X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=106a4bf610cade2c8d61b9da3d65c07b97656af2;hb=17c338f39c13131c1bc175ef38013b54bc98396d;hp=6bd0423c688ec9b33d7dbe3188bffbbd1c7d6dcb;hpb=71d929cb3e130a6486f59d4312ce76d7d6eea647;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 6bd0423..106a4bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -52,6 +52,9 @@ ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines. B is needed>. +You can also use the C pragma to change the default encoding +of the data in your script; see L. + =back =head2 Byte and Character semantics @@ -102,10 +105,11 @@ literal UTF-8 string constant in the program), character semantics apply; otherwise, byte semantics are in effect. To force byte semantics on Unicode data, the C pragma should be used. -Notice that if you have a string with byte semantics and you then -add character data into it, the bytes will be upgraded I (or if in EBCDIC, after a translation -to ISO 8859-1). +Notice that if you concatenate strings with byte semantics and strings +with Unicode character data, the bytes will by default be upgraded +I (or if in EBCDIC, after a +translation to ISO 8859-1). To change this, use the C +pragma, see L. Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes no @@ -173,88 +177,89 @@ are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. The C<\p{Is...}> test for "general properties" such as "letter", "digit", while the C<\p{In...}> test for Unicode scripts and blocks. -The official Unicode script and block names have spaces and -dashes and separators, but for convenience you can have -dashes, spaces, and underbars at every word division, and -you need not care about correct casing. It is recommended, -however, that for consistency you use the following naming: -the official Unicode script or block name (see below for -the additional rules that apply to block names), with the whitespace -and dashes removed, and the words "uppercase-first-lowercase-otherwise". -That is, "Latin-1 Supplement" becomes "Latin1Supplement". +The official Unicode script and block names have spaces and dashes and +separators, but for convenience you can have dashes, spaces, and +underbars at every word division, and you need not care about correct +casing. It is recommended, however, that for consistency you use the +following naming: the official Unicode script, block, or property name +(see below for the additional rules that apply to block names), +with whitespace and dashes replaced with underbar, and the words +"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement" +becomes "Latin_1_Supplement". You can also negate both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first curly and the property name: C<\p{^InTamil}> is -equal to C<\P{InTamil}>. +(^) between the first curly and the property name: C<\p{^In_Tamil}> is +equal to C<\P{In_Tamil}>. The C and C can be left out: C<\p{Greek}> is equal to -C<\p{InGreek}>, C<\P{Pd}> is equal to C<\P{Pd}>. +C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>. Short Long L Letter - Lu Uppercase Letter - Ll Lowercase Letter - Lt Titlecase Letter - Lm Modifier Letter - Lo Other Letter + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter M Mark - Mn Non-Spacing Mark - Mc Spacing Combining Mark - Me Enclosing Mark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark N Number - Nd Decimal Digit Number - Nl Letter Number - No Other Number + Nd Decimal_Number + Nl Letter_Number + No Other_Number P Punctuation - Pc Connector Punctuation - Pd Dash Punctuation - Ps Open Punctuation - Pe Close Punctuation - Pi Initial Punctuation + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) - Pf Final Punctuation + Pf Final_Punctuation (may behave like Ps or Pe depending on usage) - Po Other Punctuation + Po Other_Punctuation S Symbol - Sm Math Symbol - Sc Currency Symbol - Sk Modifier Symbol - So Other Symbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol Z Separator - Zs Space Separator - Zl Line Separator - Zp Paragraph Separator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator C Other - Cc (Other) Control - Cf (Other) Format - Cs (Other) Surrogate - Co (Other) Private Use - Cn (Other) Not Assigned + Cc Control + Cf Format + Cs Surrogate + Co Private_Use + Cn Unassigned There's also C which is an alias for C, C, and C. The following reserved ranges have C tests: - CJK Ideograph Extension A - CJK Ideograph - Hangul Syllable - Non Private Use High Surrogate - Private Use High Surrogate - Low Surrogate - Private Surrogate - CJK Ideograph Extension B - Plane 15 Private Use - Plane 16 Private Use + CJK_Ideograph_Extension_A + CJK_Ideograph + Hangul_Syllable + Non_Private_Use_High_Surrogate + Private_Use_High_Surrogate + Low_Surrogate + Private_Surrogate + CJK_Ideograph_Extension_B + Plane_15_Private_Use + Plane_16_Private_Use For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. -(Handling of surrogates is not implemented yet.) +(Handling of surrogates is not implemented yet, because Perl +uses UTF-8 and not UTF-16 internally to represent Unicode.) Additionally, because scripts differ in their directionality (for example Hebrew is written right to left), all characters @@ -285,66 +290,66 @@ have their directionality defined: The scripts available for C<\p{In...}> and C<\P{In...}>, for example \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>: - Latin - Greek - Cyrillic - Armenian - Hebrew Arabic - Syriac - Thaana - Devanagari + Armenian Bengali - Gurmukhi + Bopomofo + Canadian-Aboriginal + Cherokee + Cyrillic + Deseret + Devanagari + Ethiopic + Georgian + Gothic + Greek Gujarati - Oriya - Tamil - Telugu + Gurmukhi + Han + Hangul + Hebrew + Hiragana + Inherited Kannada - Malayalam - Sinhala - Thai + Katakana + Khmer Lao - Tibetan + Latin + Malayalam + Mongolian Myanmar - Georgian - Hangul - Ethiopic - Cherokee - Canadian Aboriginal Ogham + Old-Italic + Oriya Runic - Khmer - Mongolian - Hiragana - Katakana - Bopomofo - Han + Sinhala + Syriac + Tamil + Telugu + Thaana + Thai + Tibetan Yi - Old Italic - Gothic - Deseret - Inherited There are also extended property classes that supplement the basic properties, defined by the F Unicode database: - White_space + ASCII_Hex_Digit Bidi_Control - Join_Control Dash - Hyphen - Quotation_Mark - Other_Math - Hex_Digit - ASCII_Hex_Digit - Other_Alphabetic - Ideographic Diacritic Extender + Hex_Digit + Hyphen + Ideographic + Join_Control + Noncharacter_Code_Point + Other_Alphabetic Other_Lowercase + Other_Math Other_Uppercase - Noncharacter_Code_Point + Quotation_Mark + White_Space and further derived properties: @@ -359,17 +364,20 @@ and further derived properties: Any Any character Assigned Any non-Cn character Common Any character (or unassigned code point) - not explicitly assigned to a script. + not explicitly assigned to a script =head2 Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the -former concept is closer to natural languages, while the latter +scripts concept is closer to natural languages, while the blocks concept is more an artificial grouping based on groups of 256 Unicode characters. For example, the C script contains letters from -many blocks, but it does not contain all the characters from those -blocks, it does not for example contain digits. +many blocks. On the other hand, the C script does not contain +all the characters from those blocks, it does not for example contain +digits because digits are shared across many scripts. Digits and +other similar groups, like punctuation, are in a category called +C. For more about scripts see the UTR #24: http://www.unicode.org/unicode/reports/tr24/ @@ -381,107 +389,107 @@ a script called C and a block called C, the block version has C appended to its name, C<\p{InKatakanaBlock}>. Notice that this definition was introduced in Perl 5.8.0: in Perl -5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the +5.6 only the blocks were used; in Perl 5.8.0 scripts became the preferential Unicode character class definition; this meant that the definitions of some character classes changed (the ones in the below list that have the C appended). + Alphabetic Presentation Forms + Arabic Block + Arabic Presentation Forms-A + Arabic Presentation Forms-B + Armenian Block + Arrows Basic Latin - Latin 1 Supplement - Latin Extended-A - Latin Extended-B - IPA Extensions - Spacing Modifier Letters + Bengali Block + Block Elements + Bopomofo Block + Bopomofo Extended + Box Drawing + Braille Patterns + Byzantine Musical Symbols + CJK Compatibility + CJK Compatibility Forms + CJK Compatibility Ideographs + CJK Compatibility Ideographs Supplement + CJK Radicals Supplement + CJK Symbols and Punctuation + CJK Unified Ideographs + CJK Unified Ideographs Extension A + CJK Unified Ideographs Extension B + Cherokee Block Combining Diacritical Marks - Greek Block + Combining Half Marks + Combining Marks for Symbols + Control Pictures + Currency Symbols Cyrillic Block - Armenian Block - Hebrew Block - Arabic Block - Syriac Block - Thaana Block + Deseret Block Devanagari Block - Bengali Block - Gurmukhi Block - Gujarati Block - Oriya Block - Tamil Block - Telugu Block - Kannada Block - Malayalam Block - Sinhala Block - Thai Block - Lao Block - Tibetan Block - Myanmar Block + Dingbats + Enclosed Alphanumerics + Enclosed CJK Letters and Months + Ethiopic Block + General Punctuation + Geometric Shapes Georgian Block + Gothic Block + Greek Block + Greek Extended + Gujarati Block + Gurmukhi Block + Halfwidth and Fullwidth Forms + Hangul Compatibility Jamo Hangul Jamo - Ethiopic Block - Cherokee Block - Unified Canadian Aboriginal Syllabics - Ogham Block - Runic Block + Hangul Syllables + Hebrew Block + High Private Use Surrogates + High Surrogates + Hiragana Block + IPA Extensions + Ideographic Description Characters + Kanbun + Kangxi Radicals + Kannada Block + Katakana Block Khmer Block - Mongolian Block + Lao Block + Latin 1 Supplement Latin Extended Additional - Greek Extended - General Punctuation - Superscripts and Subscripts - Currency Symbols - Combining Marks for Symbols + Latin Extended-A + Latin Extended-B Letterlike Symbols - Number Forms - Arrows + Low Surrogates + Malayalam Block + Mathematical Alphanumeric Symbols Mathematical Operators + Miscellaneous Symbols Miscellaneous Technical - Control Pictures + Mongolian Block + Musical Symbols + Myanmar Block + Number Forms + Ogham Block + Old Italic Block Optical Character Recognition - Enclosed Alphanumerics - Box Drawing - Block Elements - Geometric Shapes - Miscellaneous Symbols - Dingbats - Braille Patterns - CJK Radicals Supplement - Kangxi Radicals - Ideographic Description Characters - CJK Symbols and Punctuation - Hiragana Block - Katakana Block - Bopomofo Block - Hangul Compatibility Jamo - Kanbun - Bopomofo Extended - Enclosed CJK Letters and Months - CJK Compatibility - CJK Unified Ideographs Extension A - CJK Unified Ideographs - Yi Syllables - Yi Radicals - Hangul Syllables - High Surrogates - High Private Use Surrogates - Low Surrogates + Oriya Block Private Use - CJK Compatibility Ideographs - Alphabetic Presentation Forms - Arabic Presentation Forms-A - Combining Half Marks - CJK Compatibility Forms + Runic Block + Sinhala Block Small Form Variants - Arabic Presentation Forms-B + Spacing Modifier Letters Specials - Halfwidth and Fullwidth Forms - Old Italic Block - Gothic Block - Deseret Block - Byzantine Musical Symbols - Musical Symbols - Mathematical Alphanumeric Symbols - CJK Unified Ideographs Extension B - CJK Compatibility Ideographs Supplement + Superscripts and Subscripts + Syriac Block Tags + Tamil Block + Telugu Block + Thaana Block + Thai Block + Tibetan Block + Unified Canadian Aboriginal Syllabics + Yi Radicals + Yi Syllables =item * @@ -501,10 +509,11 @@ pack('C0', ...). =item * Case translation operators use the Unicode case translation tables -when provided character input. Note that C translates to -uppercase, while C translates to titlecase (for languages -that make the distinction). Naturally the corresponding backslash -sequences have the same semantics. +when provided character input. Note that C (also known as C<\U> +in doublequoted strings) translates to uppercase, while C +(also known as C<\u> in doublequoted strings) translates to titlecase +(for languages that make the distinction). Naturally the +corresponding backslash sequences have the same semantics. =item * @@ -548,15 +557,37 @@ wide bit complement. =item * -lc(), uc(), lcfirst(), and ucfirst() work only for some of the -simplest cases, where the mapping goes from a single Unicode character -to another single Unicode character, and where the mapping does not -depend on surrounding characters, or on locales. More complex cases, -where for example one character maps into several, are not yet -implemented. See the Unicode Technical Report #21, Case Mappings, -for more details. The Unicode::UCD module (part of Perl since 5.8.0) -casespec() and casefold() interfaces supply information about the more -complex cases. +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character + +=back + +What doesn't yet work are the followng cases: + +=over 8 + +=item * + +the "final sigma" (Greek) + +=item * + +anything to with locales (Lithuanian, Turkish, Azeri) + +=back + +See the Unicode Technical Report #21, Case Mappings, for more details. =item *