X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=106a4bf610cade2c8d61b9da3d65c07b97656af2;hb=8a36125691db1d8f79e98507373cbc6ea47271d4;hp=9609cdc7cbd79b9a1afc83d0408f102b6d33e0fc;hpb=7dedd01fe68e1bc71e98f1f13b6e607814dec07b;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 9609cdc..106a4bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -6,8 +6,8 @@ perlunicode - Unicode support in Perl =head2 Important Caveats -WARNING: While the implementation of Unicode support in Perl is now fairly -complete it is still evolving to some extent. +WARNING: While the implementation of Unicode support in Perl is now +fairly complete it is still evolving to some extent. In particular the way Unicode is handled on EBCDIC platforms is still rather experimental. On such a platform references to UTF-8 encoding @@ -52,6 +52,9 @@ ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines. B is needed>. +You can also use the C pragma to change the default encoding +of the data in your script; see L. + =back =head2 Byte and Character semantics @@ -102,10 +105,11 @@ literal UTF-8 string constant in the program), character semantics apply; otherwise, byte semantics are in effect. To force byte semantics on Unicode data, the C pragma should be used. -Notice that if you have a string with byte semantics and you then -add character data into it, the bytes will be upgraded I (or if in EBCDIC, after a translation -to ISO 8859-1). +Notice that if you concatenate strings with byte semantics and strings +with Unicode character data, the bytes will by default be upgraded +I (or if in EBCDIC, after a +translation to ISO 8859-1). To change this, use the C +pragma, see L. Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes no @@ -168,134 +172,212 @@ match property) constructs. For instance, C<\p{Lu}> matches any character with the Unicode uppercase property, while C<\p{M}> matches any mark character. Single letter properties may omit the brackets, so that can be written C<\pM> also. Many predefined character classes -are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. The -names of the C classes are the official Unicode script and block -names but with all non-alphanumeric characters removed, for example -the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>. - -Here is the list as of Unicode 3.1.0 (the two-letter classes) and -as defined by Perl (the one-letter classes) (in Unicode materials -what Perl calls C is often called C): - - L Letter - Lu Letter, Uppercase - Ll Letter, Lowercase - Lt Letter, Titlecase - Lm Letter, Modifier - Lo Letter, Other - M Mark - Mn Mark, Non-Spacing - Mc Mark, Spacing Combining - Me Mark, Enclosing - N Number - Nd Number, Decimal Digit - Nl Number, Letter - No Number, Other - P Punctuation - Pc Punctuation, Connector - Pd Punctuation, Dash - Ps Punctuation, Open - Pe Punctuation, Close - Pi Punctuation, Initial quote - (may behave like Ps or Pe depending on usage) - Pf Punctuation, Final quote - (may behave like Ps or Pe depending on usage) - Po Punctuation, Other - S Symbol - Sm Symbol, Math - Sc Symbol, Currency - Sk Symbol, Modifier - So Symbol, Other - Z Separator - Zs Separator, Space - Zl Separator, Line - Zp Separator, Paragraph - C Other - Cc Other, Control - Cf Other, Format - Cs Other, Surrogate - Co Other, Private Use - Cn Other, Not Assigned (Unicode defines no Cn characters) +are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. + +The C<\p{Is...}> test for "general properties" such as "letter", +"digit", while the C<\p{In...}> test for Unicode scripts and blocks. + +The official Unicode script and block names have spaces and dashes and +separators, but for convenience you can have dashes, spaces, and +underbars at every word division, and you need not care about correct +casing. It is recommended, however, that for consistency you use the +following naming: the official Unicode script, block, or property name +(see below for the additional rules that apply to block names), +with whitespace and dashes replaced with underbar, and the words +"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement" +becomes "Latin_1_Supplement". + +You can also negate both C<\p{}> and C<\P{}> by introducing a caret +(^) between the first curly and the property name: C<\p{^In_Tamil}> is +equal to C<\P{In_Tamil}>. + +The C and C can be left out: C<\p{Greek}> is equal to +C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>. + + Short Long + + L Letter + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter + + M Mark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark + + N Number + Nd Decimal_Number + Nl Letter_Number + No Other_Number + + P Punctuation + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation + (may behave like Ps or Pe depending on usage) + Pf Final_Punctuation + (may behave like Ps or Pe depending on usage) + Po Other_Punctuation + + S Symbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol + + Z Separator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator + + C Other + Cc Control + Cf Format + Cs Surrogate + Co Private_Use + Cn Unassigned + +There's also C which is an alias for C, C, and C. + +The following reserved ranges have C tests: + + CJK_Ideograph_Extension_A + CJK_Ideograph + Hangul_Syllable + Non_Private_Use_High_Surrogate + Private_Use_High_Surrogate + Low_Surrogate + Private_Surrogate + CJK_Ideograph_Extension_B + Plane_15_Private_Use + Plane_16_Private_Use + +For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. +(Handling of surrogates is not implemented yet, because Perl +uses UTF-8 and not UTF-16 internally to represent Unicode.) Additionally, because scripts differ in their directionality (for example Hebrew is written right to left), all characters have their directionality defined: - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals + BidiL Left-to-Right + BidiLRE Left-to-Right Embedding + BidiLRO Left-to-Right Override + BidiR Right-to-Left + BidiAL Right-to-Left Arabic + BidiRLE Right-to-Left Embedding + BidiRLO Right-to-Left Override + BidiPDF Pop Directional Format + BidiEN European Number + BidiES European Number Separator + BidiET European Number Terminator + BidiAN Arabic Number + BidiCS Common Number Separator + BidiNSM Non-Spacing Mark + BidiBN Boundary Neutral + BidiB Paragraph Separator + BidiS Segment Separator + BidiWS Whitespace + BidiON Other Neutrals =head2 Scripts The scripts available for C<\p{In...}> and C<\P{In...}>, for example \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>: - Latin - Greek - Cyrillic - Armenian - Hebrew - Arabic - Syriac - Thaana - Devanagari - Bengali - Gurmukhi - Gujarati - Oriya - Tamil - Telugu - Kannada - Malayalam - Sinhala - Thai - Lao - Tibetan - Myanmar - Georgian - Hangul - Ethiopic - Cherokee - CanadianAboriginal - Ogham - Runic - Khmer - Mongolian - Hiragana - Katakana - Bopomofo - Han - Yi - OldItalic - Gothic - Deseret - Inherited + Arabic + Armenian + Bengali + Bopomofo + Canadian-Aboriginal + Cherokee + Cyrillic + Deseret + Devanagari + Ethiopic + Georgian + Gothic + Greek + Gujarati + Gurmukhi + Han + Hangul + Hebrew + Hiragana + Inherited + Kannada + Katakana + Khmer + Lao + Latin + Malayalam + Mongolian + Myanmar + Ogham + Old-Italic + Oriya + Runic + Sinhala + Syriac + Tamil + Telugu + Thaana + Thai + Tibetan + Yi + +There are also extended property classes that supplement the basic +properties, defined by the F Unicode database: + + ASCII_Hex_Digit + Bidi_Control + Dash + Diacritic + Extender + Hex_Digit + Hyphen + Ideographic + Join_Control + Noncharacter_Code_Point + Other_Alphabetic + Other_Lowercase + Other_Math + Other_Uppercase + Quotation_Mark + White_Space + +and further derived properties: + + Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic + Lowercase Ll + Other_Lowercase + Uppercase Lu + Other_Uppercase + Math Sm + Other_Math + + ID_Start Lu + Ll + Lt + Lm + Lo + Nl + ID_Continue ID_Start + Mn + Mc + Nd + Pc + + Any Any character + Assigned Any non-Cn character + Common Any character (or unassigned code point) + not explicitly assigned to a script =head2 Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the -former concept is closer to natural languages, while the latter +scripts concept is closer to natural languages, while the blocks concept is more an artificial grouping based on groups of 256 Unicode characters. For example, the C script contains letters from -many blocks, but it does not contain all the characters from those -blocks, it does not for example contain digits. +many blocks. On the other hand, the C script does not contain +all the characters from those blocks, it does not for example contain +digits because digits are shared across many scripts. Digits and +other similar groups, like punctuation, are in a category called +C. For more about scripts see the UTR #24: http://www.unicode.org/unicode/reports/tr24/ @@ -307,107 +389,107 @@ a script called C and a block called C, the block version has C appended to its name, C<\p{InKatakanaBlock}>. Notice that this definition was introduced in Perl 5.8.0: in Perl -5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the -preferential character class definition; this meant that the -definitions of some character classes changed (the ones in the +5.6 only the blocks were used; in Perl 5.8.0 scripts became the +preferential Unicode character class definition; this meant that +the definitions of some character classes changed (the ones in the below list that have the C appended). - BasicLatin - Latin1Supplement - LatinExtendedA - LatinExtendedB - IPAExtensions - SpacingModifierLetters - CombiningDiacriticalMarks - GreekBlock - CyrillicBlock - ArmenianBlock - HebrewBlock - ArabicBlock - SyriacBlock - ThaanaBlock - DevanagariBlock - BengaliBlock - GurmukhiBlock - GujaratiBlock - OriyaBlock - TamilBlock - TeluguBlock - KannadaBlock - MalayalamBlock - SinhalaBlock - ThaiBlock - LaoBlock - TibetanBlock - MyanmarBlock - GeorgianBlock - HangulJamo - EthiopicBlock - CherokeeBlock - UnifiedCanadianAboriginalSyllabics - OghamBlock - RunicBlock - KhmerBlock - MongolianBlock - LatinExtendedAdditional - GreekExtended - GeneralPunctuation - SuperscriptsandSubscripts - CurrencySymbols - CombiningMarksforSymbols - LetterlikeSymbols - NumberForms + Alphabetic Presentation Forms + Arabic Block + Arabic Presentation Forms-A + Arabic Presentation Forms-B + Armenian Block Arrows - MathematicalOperators - MiscellaneousTechnical - ControlPictures - OpticalCharacterRecognition - EnclosedAlphanumerics - BoxDrawing - BlockElements - GeometricShapes - MiscellaneousSymbols + Basic Latin + Bengali Block + Block Elements + Bopomofo Block + Bopomofo Extended + Box Drawing + Braille Patterns + Byzantine Musical Symbols + CJK Compatibility + CJK Compatibility Forms + CJK Compatibility Ideographs + CJK Compatibility Ideographs Supplement + CJK Radicals Supplement + CJK Symbols and Punctuation + CJK Unified Ideographs + CJK Unified Ideographs Extension A + CJK Unified Ideographs Extension B + Cherokee Block + Combining Diacritical Marks + Combining Half Marks + Combining Marks for Symbols + Control Pictures + Currency Symbols + Cyrillic Block + Deseret Block + Devanagari Block Dingbats - BraillePatterns - CJKRadicalsSupplement - KangxiRadicals - IdeographicDescriptionCharacters - CJKSymbolsandPunctuation - HiraganaBlock - KatakanaBlock - BopomofoBlock - HangulCompatibilityJamo + Enclosed Alphanumerics + Enclosed CJK Letters and Months + Ethiopic Block + General Punctuation + Geometric Shapes + Georgian Block + Gothic Block + Greek Block + Greek Extended + Gujarati Block + Gurmukhi Block + Halfwidth and Fullwidth Forms + Hangul Compatibility Jamo + Hangul Jamo + Hangul Syllables + Hebrew Block + High Private Use Surrogates + High Surrogates + Hiragana Block + IPA Extensions + Ideographic Description Characters Kanbun - BopomofoExtended - EnclosedCJKLettersandMonths - CJKCompatibility - CJKUnifiedIdeographsExtensionA - CJKUnifiedIdeographs - YiSyllables - YiRadicals - HangulSyllables - HighSurrogates - HighPrivateUseSurrogates - LowSurrogates - PrivateUse - CJKCompatibilityIdeographs - AlphabeticPresentationForms - ArabicPresentationFormsA - CombiningHalfMarks - CJKCompatibilityForms - SmallFormVariants - ArabicPresentationFormsB + Kangxi Radicals + Kannada Block + Katakana Block + Khmer Block + Lao Block + Latin 1 Supplement + Latin Extended Additional + Latin Extended-A + Latin Extended-B + Letterlike Symbols + Low Surrogates + Malayalam Block + Mathematical Alphanumeric Symbols + Mathematical Operators + Miscellaneous Symbols + Miscellaneous Technical + Mongolian Block + Musical Symbols + Myanmar Block + Number Forms + Ogham Block + Old Italic Block + Optical Character Recognition + Oriya Block + Private Use + Runic Block + Sinhala Block + Small Form Variants + Spacing Modifier Letters Specials - HalfwidthandFullwidthForms - OldItalicBlock - GothicBlock - DeseretBlock - ByzantineMusicalSymbols - MusicalSymbols - MathematicalAlphanumericSymbols - CJKUnifiedIdeographsExtensionB - CJKCompatibilityIdeographsSupplement + Superscripts and Subscripts + Syriac Block Tags + Tamil Block + Telugu Block + Thaana Block + Thai Block + Tibetan Block + Unified Canadian Aboriginal Syllabics + Yi Radicals + Yi Syllables =item * @@ -427,10 +509,11 @@ pack('C0', ...). =item * Case translation operators use the Unicode case translation tables -when provided character input. Note that C translates to -uppercase, while C translates to titlecase (for languages -that make the distinction). Naturally the corresponding backslash -sequences have the same semantics. +when provided character input. Note that C (also known as C<\U> +in doublequoted strings) translates to uppercase, while C +(also known as C<\u> in doublequoted strings) translates to titlecase +(for languages that make the distinction). Naturally the +corresponding backslash sequences have the same semantics. =item * @@ -456,7 +539,9 @@ outside of the utf8 pragma too.) The C and C functions work on characters. This is like C and C, not like C and C. In fact, the latter are how you now emulate -byte-oriented C and C under utf8. +byte-oriented C and C for Unicode strings. +(Note that this reveals the internal UTF-8 encoding of strings and +you are not supposed to do that unless you know what you are doing.) =item * @@ -472,6 +557,40 @@ wide bit complement. =item * +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character + +=back + +What doesn't yet work are the followng cases: + +=over 8 + +=item * + +the "final sigma" (Greek) + +=item * + +anything to with locales (Lithuanian, Turkish, Azeri) + +=back + +See the Unicode Technical Report #21, Case Mappings, for more details. + +=item * + And finally, C reverses by character rather than by byte. =back @@ -495,6 +614,69 @@ some attempt to apply 8-bit locale info to characters in the range characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged. +=head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL + +The following list of Unicode regular expression support describes +feature by feature the Unicode support implemented in Perl as of Perl +5.8.0. The "Level N" and the section numbers refer to the Unicode +Technical Report 18, "Unicode Regular Expression Guidelines". + +=over 4 + +=item * + +Level 1 - Basic Unicode Support + + 2.1 Hex Notation - done [1] + Named Notation - done [2] + 2.2 Categories - done [3][4] + 2.3 Subtraction - MISSING [5][6] + 2.4 Simple Word Boundaries - done [7] + 2.5 Simple Loose Matches - MISSING [8] + 2.6 End of Line - MISSING [9][10] + + [ 1] \x{...} + [ 2] \N{...} + [ 3] . \p{Is...} \P{Is...} + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 5] have negation + [ 6] can use look-ahead to emulate subtracion + [ 7] include Letters in word characters + [ 8] see UTR#21 Case Mappings + [ 9] see UTR#13 Unicode Newline Guidelines + [10] should do ^ and $ also on \x{2028} and \x{2029} + +=item * + +Level 2 - Extended Unicode Support + + 3.1 Surrogates - MISSING + 3.2 Canonical Equivalents - MISSING [11][12] + 3.3 Locale-Independent Graphemes - MISSING [13] + 3.4 Locale-Independent Words - MISSING [14] + 3.5 Locale-Independent Loose Matches - MISSING [15] + + [11] see UTR#15 Unicode Normalization + [12] have Unicode::Normalize but not integrated to regexes + [13] have \X but at this level . should equal that + [14] need three classes, not just \w and \W + [15] see UTR#21 Case Mappings + +=item * + +Level 3 - Locale-Sensitive Support + + 4.1 Locale-Dependent Categories - MISSING + 4.2 Locale-Dependent Graphemes - MISSING [16][17] + 4.3 Locale-Dependent Words - MISSING + 4.4 Locale-Dependent Loose Matches - MISSING + 4.5 Locale-Dependent Ranges - MISSING + + [16] see UTR#10 Unicode Collation Algorithms + [17] have Unicode::Collate but not integrated to regexes + +=back + =head1 SEE ALSO L, L, L, L