X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=4e62fed0b51dbd865eae251ecb3b76917ddd1f1f;hb=83ce3e12e086bc5a21f37af9378b7c01fa5d73d8;hp=21c5bb3ab7c858d844d001ec4c9d1a6f2145aee1;hpb=822502e5e1ee67853c76322faa5c660c9f9a49da;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 21c5bb3..4e62fed 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L, before reading +this reference document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -You can also use the C pragma to change the default encoding -of the data in your script; see L. - =item BOM-marked scripts and UTF-16 scripts autodetected If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, @@ -52,17 +53,12 @@ ISO 8859-1 or other eight-bit encodings.) =item C needed to upgrade non-Latin-1 byte strings -By default, there is a fundamental asymmetry in Perl's unicode model: +By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in I, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. -If you wish to interpret byte strings as UTF-8 instead, use the -C pragma: - - use encoding 'utf8'; - See L for more details. =back @@ -112,9 +108,7 @@ If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as I, even if the old Unicode string used EBCDIC. This translation is done without -regard to the system's native 8-bit encoding. To change this for -systems with non-Latin-1 and non-EBCDIC native encodings, use the -C pragma. See L. +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -134,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -163,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -173,17 +165,13 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. -(However, and as a limitation of the current implementation, using -C<\w> or C<\W> I a C<[...]> character class will still match -with byte semantics.) - =item * Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". -See L for more details. +See L for more details. You can define your own character properties and use them in the regular expression with the C<\p{}> or C<\P{}> construct. @@ -196,7 +184,7 @@ The special pattern C<\X> matches any extended Unicode sequence--"a combining character sequence" in Standardese--where the first character is a base character and subsequent characters are mark characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. +C<< (?>\PM\pM*) >>. =item * @@ -317,8 +305,7 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret equal to C<\P{Tamil}>. B +Unicode 5.0.0 in July 2006.> =over 4 @@ -425,16 +412,23 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Arabic Armenian + Balinese Bengali Bopomofo + Braille + Buginese Buhid CanadianAboriginal Cherokee + Coptic + Cuneiform + Cypriot Cyrillic Deseret Devanagari Ethiopic Georgian + Glagolitic Gothic Greek Gujarati @@ -447,25 +441,39 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Inherited Kannada Katakana + Kharoshthi Khmer Lao Latin + Limbu + LinearB Malayalam Mongolian Myanmar + NewTaiLue + Nko Ogham OldItalic + OldPersian Oriya + Osmanya + PhagsPa + Phoenician Runic + Shavian Sinhala + SylotiNagri Syriac Tagalog Tagbanwa + TaiLe Tamil Telugu Thaana Thai Tibetan + Tifinagh + Ugaritic Yi =item Extended property classes @@ -479,7 +487,6 @@ properties, defined by the F Unicode database: Deprecated Diacritic Extender - GraphemeLink HexDigit Hyphen Ideographic @@ -491,31 +498,44 @@ properties, defined by the F Unicode database: OtherAlphabetic OtherDefaultIgnorableCodePoint OtherGraphemeExtend + OtherIDStart + OtherIDContinue OtherLowercase OtherMath OtherUppercase + PatternSyntax + PatternWhiteSpace QuotationMark Radical SoftDotted + STerm TerminalPunctuation UnifiedIdeograph + VariationSelector WhiteSpace and there are further derived properties: - Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic - Lowercase Ll + OtherLowercase - Uppercase Lu + OtherUppercase - Math Sm + OtherMath + Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic + Lowercase = Ll + OtherLowercase + Uppercase = Lu + OtherUppercase + Math = Sm + OtherMath - ID_Start Lu + Ll + Lt + Lm + Lo + Nl - ID_Continue ID_Start + Mn + Mc + Nd + Pc + IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart + IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue - Any Any character - Assigned Any non-Cn character (i.e. synonym for \P{Cn}) - Unassigned Synonym for \p{Cn} - Common Any character (or unassigned code point) - not explicitly assigned to a script + DefaultIgnorableCodePoint + = OtherDefaultIgnorableCodePoint + + Cf + Cc + Cs + Noncharacters + VariationSelector + - WhiteSpace - FFF9..FFFB (Annotation Characters) + + Any = Any code points (i.e. U+0000 to U+10FFFF) + Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) + Unassigned = Synonym for \p{Cn} + ASCII = ASCII (i.e. U+0000 to U+007F) + + Common = Any character (or unassigned code point) + not explicitly assigned to a script =item Use of "Is" Prefix @@ -535,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in a category called C. -For more about scripts, see the UTR #24: +For more about scripts, see the UAX#24 "Script Names": - http://www.unicode.org/unicode/reports/tr24/ + http://www.unicode.org/reports/tr24/ For more about blocks, see: @@ -551,12 +571,17 @@ for block tests to avoid confusion. These block names are supported: + InAegeanNumbers InAlphabeticPresentationForms + InAncientGreekMusicalNotation + InAncientGreekNumbers InArabic InArabicPresentationFormsA InArabicPresentationFormsB + InArabicSupplement InArmenian InArrows + InBalinese InBasicLatin InBengali InBlockElements @@ -564,6 +589,7 @@ These block names are supported: InBopomofoExtended InBoxDrawing InBraillePatterns + InBuginese InBuhid InByzantineMusicalSymbols InCJKCompatibility @@ -571,27 +597,38 @@ These block names are supported: InCJKCompatibilityIdeographs InCJKCompatibilityIdeographsSupplement InCJKRadicalsSupplement + InCJKStrokes InCJKSymbolsAndPunctuation InCJKUnifiedIdeographs InCJKUnifiedIdeographsExtensionA InCJKUnifiedIdeographsExtensionB InCherokee InCombiningDiacriticalMarks + InCombiningDiacriticalMarksSupplement InCombiningDiacriticalMarksforSymbols InCombiningHalfMarks InControlPictures + InCoptic + InCountingRodNumerals + InCuneiform + InCuneiformNumbersAndPunctuation InCurrencySymbols + InCypriotSyllabary InCyrillic - InCyrillicSupplementary + InCyrillicSupplement InDeseret InDevanagari InDingbats InEnclosedAlphanumerics InEnclosedCJKLettersAndMonths InEthiopic + InEthiopicExtended + InEthiopicSupplement InGeneralPunctuation InGeometricShapes InGeorgian + InGeorgianSupplement + InGlagolitic InGothic InGreekExtended InGreekAndCoptic @@ -613,13 +650,20 @@ These block names are supported: InKannada InKatakana InKatakanaPhoneticExtensions + InKharoshthi InKhmer + InKhmerSymbols InLao InLatin1Supplement InLatinExtendedA InLatinExtendedAdditional InLatinExtendedB + InLatinExtendedC + InLatinExtendedD InLetterlikeSymbols + InLimbu + InLinearBIdeograms + InLinearBSyllabary InLowSurrogates InMalayalam InMathematicalAlphanumericSymbols @@ -627,17 +671,28 @@ These block names are supported: InMiscellaneousMathematicalSymbolsA InMiscellaneousMathematicalSymbolsB InMiscellaneousSymbols + InMiscellaneousSymbolsAndArrows InMiscellaneousTechnical + InModifierToneLetters InMongolian InMusicalSymbols InMyanmar + InNKo + InNewTaiLue InNumberForms InOgham InOldItalic + InOldPersian InOpticalCharacterRecognition InOriya + InOsmanya + InPhagspa + InPhoenician + InPhoneticExtensions + InPhoneticExtensionsSupplement InPrivateUseArea InRunic + InShavian InSinhala InSmallFormVariants InSpacingModifierLetters @@ -646,21 +701,30 @@ These block names are supported: InSupplementalArrowsA InSupplementalArrowsB InSupplementalMathematicalOperators + InSupplementalPunctuation InSupplementaryPrivateUseAreaA InSupplementaryPrivateUseAreaB + InSylotiNagri InSyriac InTagalog InTagbanwa InTags + InTaiLe + InTaiXuanJingSymbols InTamil InTelugu InThaana InThai InTibetan + InTifinagh + InUgaritic InUnifiedCanadianAboriginalSyllabics InVariationSelectors + InVariationSelectorsSupplement + InVerticalForms InYiRadicals InYiSyllables + InYijingHexagramSymbols =back @@ -690,6 +754,10 @@ or more newline-separated lines. Each line must be one of the following: =item * +A single hexadecimal number denoting a Unicode code point to include. + +=item * + Two hexadecimal numbers separated by horizontal whitespace (space or tabular characters) denoting a range of Unicode code points to include. @@ -780,10 +848,6 @@ two (or more) classes. It's important to remember not to use "&" for the first set -- that would be intersecting with nothing (resulting in an empty set). -A final note on the user-defined property tests: they will be used -only if the scalar has been marked as having Unicode characters. -Old byte-style strings will not be affected. - =head2 User-Defined Case Mappings You can also define your own mappings to be used in the lc(), @@ -845,9 +909,8 @@ See L. The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" -and the section numbers refer to the Unicode Technical Report 18, -"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0, -Perl 5.8.0). +and the section numbers refer to the Unicode Technical Standard #18, +"Unicode Regular Expressions", version 11, in May 2005. =over 4 @@ -855,37 +918,42 @@ Perl 5.8.0). Level 1 - Basic Unicode Support - 2.1 Hex Notation - done [1] - Named Notation - done [2] - 2.2 Categories - done [3][4] - 2.3 Subtraction - MISSING [5][6] - 2.4 Simple Word Boundaries - done [7] - 2.5 Simple Loose Matches - done [8] - 2.6 End of Line - MISSING [9][10] - - [ 1] \x{...} - [ 2] \N{...} - [ 3] . \p{...} \P{...} - [ 4] support for scripts (see UTR#24 Script Names), blocks, - binary properties, enumerated non-binary properties, and - numeric properties (as listed in UTR#18 Other Properties) - [ 5] have negation - [ 6] can use regular expression look-ahead [a] - or user-defined character properties [b] to emulate subtraction - [ 7] include Letters in word characters - [ 8] note that Perl does Full case-folding in matching, not Simple: + RL1.1 Hex Notation - done [1] + RL1.2 Properties - done [2][3] + RL1.2a Compatibility Properties - done [4] + RL1.3 Subtraction and Intersection - MISSING [5] + RL1.4 Simple Word Boundaries - done [6] + RL1.5 Simple Loose Matches - done [7] + RL1.6 Line Boundaries - MISSING [8] + RL1.7 Supplementary Code Points - done [9] + + [1] \x{...} + [2] \p{...} \P{...} + [3] supports not only minimal list (general category, scripts, + Alphabetic, Lowercase, Uppercase, WhiteSpace, + NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, + ASCII, Assigned), but also bidirectional types, blocks, etc. + (see L) + [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] + [5] can use regular expression look-ahead [a] or + user-defined character properties [b] to emulate set operations + [6] \b \B + [7] note that Perl does Full case-folding in matching, not Simple: for example U+1F88 is equivalent with U+1F00 U+03B9, not with 1F80. This difference matters for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR #13 Unicode Newline Guidelines - [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} - (should also affect <>, $., and script line numbers) - (the \x{85}, \x{2028} and \x{2029} do match \s) + [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), + CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); + should also affect <>, $., and script line numbers; + should not split lines within CRLF [c] (i.e. there is no empty + line between \r and \n) + [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF + but also beyond U+10FFFF [d] [a] You can mimic class subtraction using lookahead. -For example, what UTR #18 might write as +For example, what UTS#18 might write as [{Greek}-[{UNASSIGNED}]] @@ -901,40 +969,62 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. Also see the Unicode::Regex::Set module, it does implement the full -UTR #18 grouping, intersection, union, and removal (subtraction) syntax. +UTS#18 grouping, intersection, union, and removal (subtraction) syntax. + +[b] '+' for union, '-' for removal (set-difference), '&' for intersection +(see L) -[b] See L. +[c] Try the C<:crlf> layer (see L). + +[d] Avoid C (or say C) to allow +U+FFFF (C<\x{FFFF}>). =item * Level 2 - Extended Unicode Support - 3.1 Surrogates - MISSING [11] - 3.2 Canonical Equivalents - MISSING [12][13] - 3.3 Locale-Independent Graphemes - MISSING [14] - 3.4 Locale-Independent Words - MISSING [15] - 3.5 Locale-Independent Loose Matches - MISSING [16] - - [11] Surrogates are solely a UTF-16 concept and Perl's internal - representation is UTF-8. The Encode module does UTF-16, though. - [12] see UTR#15 Unicode Normalization - [13] have Unicode::Normalize but not integrated to regexes - [14] have \X but at this level . should equal that - [15] need three classes, not just \w and \W - [16] see UTR#21 Case Mappings + RL2.1 Canonical Equivalents - MISSING [10][11] + RL2.2 Default Grapheme Clusters - MISSING [12][13] + RL2.3 Default Word Boundaries - MISSING [14] + RL2.4 Default Loose Matches - MISSING [15] + RL2.5 Name Properties - MISSING [16] + RL2.6 Wildcard Properties - MISSING + + [10] see UAX#15 "Unicode Normalization Forms" + [11] have Unicode::Normalize but not integrated to regexes + [12] have \X but at this level . should equal that + [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable + clusters as a single grapheme cluster. + [14] see UAX#29, Word Boundaries + [15] see UAX#21 "Case Mappings" + [16] have \N{...} but neither compute names of CJK Ideographs + and Hangul Syllables nor use a loose match [e] + +[e] C<\N{...}> allows namespaces (see L). =item * -Level 3 - Locale-Sensitive Support - - 4.1 Locale-Dependent Categories - MISSING - 4.2 Locale-Dependent Graphemes - MISSING [16][17] - 4.3 Locale-Dependent Words - MISSING - 4.4 Locale-Dependent Loose Matches - MISSING - 4.5 Locale-Dependent Ranges - MISSING - - [16] see UTR#10 Unicode Collation Algorithms - [17] have Unicode::Collate but not integrated to regexes +Level 3 - Tailored Support + + RL3.1 Tailored Punctuation - MISSING + RL3.2 Tailored Grapheme Clusters - MISSING [17][18] + RL3.3 Tailored Word Boundaries - MISSING + RL3.4 Tailored Loose Matches - MISSING + RL3.5 Tailored Ranges - MISSING + RL3.6 Context Matching - MISSING [19] + RL3.7 Incremental Matches - MISSING + ( RL3.8 Unicode Set Sharing ) + RL3.9 Possible Match Sets - MISSING + RL3.10 Folded Matching - MISSING [20] + RL3.11 Submatchers - MISSING + + [17] see UAX#10 "Unicode Collation Algorithms" + [18] have Unicode::Collate but not integrated to regexes + [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see + outside of the target substring + [20] need insensitive matching for linguistic features other than case; + for example, hiragana to katakana, wide and narrow, simplified Han + to traditional Han (see UTR#30 "Character Foldings") =back @@ -1339,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1384,7 +1474,7 @@ derived class with such a C method: sub param { my($self,$name,$value) = @_; utf8::upgrade($name); # make sure it is UTF-8 encoded - if (defined $value) + if (defined $value) { utf8::upgrade($value); # make sure it is UTF-8 encoded return $self->SUPER::param($name,$value); } else { @@ -1433,7 +1523,7 @@ to work under 5.6, so you should be safe to try them out. A filehandle that should read or write UTF-8 if ($] > 5.007) { - binmode $fh, ":utf8"; + binmode $fh, ":encoding(utf8)"; } =item * @@ -1442,7 +1532,7 @@ A scalar that is going to be passed to some extension Be it Compress::Zlib, Apache::Request or any extension that has no mention of Unicode in the manpage, you need to make sure that the -UTF-8 flag is stripped off. Note that at the time of this writing +UTF8 flag is stripped off. Note that at the time of this writing (October 2002) the mentioned modules are not UTF-8-aware. Please check the documentation to verify if this is still true. @@ -1456,7 +1546,7 @@ check the documentation to verify if this is still true. A scalar we got back from an extension If you believe the scalar comes back as UTF-8, you will most likely -want the UTF-8 flag restored: +want the UTF8 flag restored: if ($] > 5.007) { require Encode; @@ -1518,7 +1608,7 @@ A large scalar that you know can only contain ASCII Scalars that contain only ASCII and are marked as UTF-8 are sometimes a drag to your program. If you recognize such a situation, just remove -the UTF-8 flag: +the UTF8 flag: utf8::downgrade($val) if $] > 5.007; @@ -1526,7 +1616,7 @@ the UTF-8 flag: =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L =cut