X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=61d62d219f965ffa87f1f1281de0d4bc45d6184a;hb=b9ad30b40cf004f5ea6fd7a945a950cf873aed7b;hp=1101b5ee084672c30a501c7f67f91531f0ab06fa;hpb=1e54db1a8aea187ba2e790aca2ab81fab24ff92d;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1101b5e..61d62d2 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L before reading this reference +document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,8 +43,23 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -You can also use the C pragma to change the default encoding -of the data in your script; see L. +=item BOM-marked scripts and UTF-16 scripts autodetected + +If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +endianness, Perl will correctly read in the script as Unicode. +(BOMless UTF-8 cannot be effectively recognized or differentiated from +ISO 8859-1 or other eight-bit encodings.) + +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's Unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +See L for more details. =back @@ -86,12 +105,10 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -111,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -130,7 +146,6 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - =item * If an appropriate L is specified, identifiers within the @@ -141,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -155,7 +169,120 @@ ideograph, for instance. Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". +the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. + +See L for more details. + +=item * + +The special pattern C<\X> matches any extended Unicode +sequence--"a combining character sequence" in Standardese--where the +first character is a base character and subsequent characters are mark +characters that apply to the base character. C<\X> is equivalent to +C<(?:\PM\pM*)>. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction. + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character, or + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character. + +=back + +Things to do with locales (Lithuanian, Turkish, Azeri) do B work +since Perl does not understand the concept of Unicode locales. + +See the Unicode Technical Report #21, Case Mappings, for more details. + +But you can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). + +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Named Unicode properties, scripts, and block ranges may be used like +character classes via the C<\p{}> "matches property" construct and +the C<\P{}> negation, "doesn't match property". For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character @@ -178,8 +305,11 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret equal to C<\P{Tamil}>. B +Unicode 5.0.0 in July 2006.> + +=over 4 + +=item General Category Here are the basic Unicode General Category properties, followed by their long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, @@ -188,6 +318,7 @@ for instance, are identical. Short Long L Letter + LC CasedLetter Lu UppercaseLetter Ll LowercaseLetter Lt TitlecaseLetter @@ -235,60 +366,69 @@ for instance, are identical. Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C is a special case, which is an alias for C, C, and C. +C and C are special cases, which are aliases for the set of +C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=item Bidirectional Character Types + Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties: +written right to left, for example--Unicode supplies these properties in +the BidiClass class: Property Meaning - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals - -For example, C<\p{BidiR}> matches characters that are normally + L Left-to-Right + LRE Left-to-Right Embedding + LRO Left-to-Right Override + R Right-to-Left + AL Right-to-Left Arabic + RLE Right-to-Left Embedding + RLO Right-to-Left Override + PDF Pop Directional Format + EN European Number + ES European Number Separator + ET European Number Terminator + AN Arabic Number + CS Common Number Separator + NSM Non-Spacing Mark + BN Boundary Neutral + B Paragraph Separator + S Segment Separator + WS Whitespace + ON Other Neutrals + +For example, C<\p{BidiClass:R}> matches characters that are normally written right to left. -=back - -=head2 Scripts +=item Scripts The script names which can be used by C<\p{...}> and C<\P{...}>, such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Arabic Armenian + Balinese Bengali Bopomofo + Braille + Buginese Buhid CanadianAboriginal Cherokee + Coptic + Cuneiform + Cypriot Cyrillic Deseret Devanagari Ethiopic Georgian + Glagolitic Gothic Greek Gujarati @@ -301,27 +441,43 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Inherited Kannada Katakana + Kharoshthi Khmer Lao Latin + Limbu + LinearB Malayalam Mongolian Myanmar + NewTaiLue + Nko Ogham OldItalic + OldPersian Oriya + Osmanya + PhagsPa + Phoenician Runic + Shavian Sinhala + SylotiNagri Syriac Tagalog Tagbanwa + TaiLe Tamil Telugu Thaana Thai Tibetan + Tifinagh + Ugaritic Yi +=item Extended property classes + Extended property classes can supplement the basic properties, defined by the F Unicode database: @@ -331,7 +487,6 @@ properties, defined by the F Unicode database: Deprecated Diacritic Extender - GraphemeLink HexDigit Hyphen Ideographic @@ -343,37 +498,52 @@ properties, defined by the F Unicode database: OtherAlphabetic OtherDefaultIgnorableCodePoint OtherGraphemeExtend + OtherIDStart + OtherIDContinue OtherLowercase OtherMath OtherUppercase + PatternSyntax + PatternWhiteSpace QuotationMark Radical SoftDotted + STerm TerminalPunctuation UnifiedIdeograph + VariationSelector WhiteSpace and there are further derived properties: - Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic - Lowercase Ll + OtherLowercase - Uppercase Lu + OtherUppercase - Math Sm + OtherMath + Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic + Lowercase = Ll + OtherLowercase + Uppercase = Lu + OtherUppercase + Math = Sm + OtherMath + + IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart + IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue + + DefaultIgnorableCodePoint + = OtherDefaultIgnorableCodePoint + + Cf + Cc + Cs + Noncharacters + VariationSelector + - WhiteSpace - FFF9..FFFB (Annotation Characters) + + Any = Any code points (i.e. U+0000 to U+10FFFF) + Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) + Unassigned = Synonym for \p{Cn} + ASCII = ASCII (i.e. U+0000 to U+007F) - ID_Start Lu + Ll + Lt + Lm + Lo + Nl - ID_Continue ID_Start + Mn + Mc + Nd + Pc + Common = Any character (or unassigned code point) + not explicitly assigned to a script - Any Any character - Assigned Any non-Cn character (i.e. synonym for \P{Cn}) - Unassigned Synonym for \p{Cn} - Common Any character (or unassigned code point) - not explicitly assigned to a script +=item Use of "Is" Prefix For backward compatibility (with Perl 5.6), all properties mentioned so far may have C prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>. -=head2 Blocks +=item Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the @@ -385,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in a category called C. -For more about scripts, see the UTR #24: +For more about scripts, see the UAX#24 "Script Names": - http://www.unicode.org/unicode/reports/tr24/ + http://www.unicode.org/reports/tr24/ For more about blocks, see: @@ -401,12 +571,17 @@ for block tests to avoid confusion. These block names are supported: + InAegeanNumbers InAlphabeticPresentationForms + InAncientGreekMusicalNotation + InAncientGreekNumbers InArabic InArabicPresentationFormsA InArabicPresentationFormsB + InArabicSupplement InArmenian InArrows + InBalinese InBasicLatin InBengali InBlockElements @@ -414,6 +589,7 @@ These block names are supported: InBopomofoExtended InBoxDrawing InBraillePatterns + InBuginese InBuhid InByzantineMusicalSymbols InCJKCompatibility @@ -421,27 +597,38 @@ These block names are supported: InCJKCompatibilityIdeographs InCJKCompatibilityIdeographsSupplement InCJKRadicalsSupplement + InCJKStrokes InCJKSymbolsAndPunctuation InCJKUnifiedIdeographs InCJKUnifiedIdeographsExtensionA InCJKUnifiedIdeographsExtensionB InCherokee InCombiningDiacriticalMarks + InCombiningDiacriticalMarksSupplement InCombiningDiacriticalMarksforSymbols InCombiningHalfMarks InControlPictures + InCoptic + InCountingRodNumerals + InCuneiform + InCuneiformNumbersAndPunctuation InCurrencySymbols + InCypriotSyllabary InCyrillic - InCyrillicSupplementary + InCyrillicSupplement InDeseret InDevanagari InDingbats InEnclosedAlphanumerics InEnclosedCJKLettersAndMonths InEthiopic + InEthiopicExtended + InEthiopicSupplement InGeneralPunctuation InGeometricShapes InGeorgian + InGeorgianSupplement + InGlagolitic InGothic InGreekExtended InGreekAndCoptic @@ -463,13 +650,20 @@ These block names are supported: InKannada InKatakana InKatakanaPhoneticExtensions + InKharoshthi InKhmer + InKhmerSymbols InLao InLatin1Supplement InLatinExtendedA InLatinExtendedAdditional InLatinExtendedB + InLatinExtendedC + InLatinExtendedD InLetterlikeSymbols + InLimbu + InLinearBIdeograms + InLinearBSyllabary InLowSurrogates InMalayalam InMathematicalAlphanumericSymbols @@ -477,17 +671,28 @@ These block names are supported: InMiscellaneousMathematicalSymbolsA InMiscellaneousMathematicalSymbolsB InMiscellaneousSymbols + InMiscellaneousSymbolsAndArrows InMiscellaneousTechnical + InModifierToneLetters InMongolian InMusicalSymbols InMyanmar + InNKo + InNewTaiLue InNumberForms InOgham InOldItalic + InOldPersian InOpticalCharacterRecognition InOriya + InOsmanya + InPhagspa + InPhoenician + InPhoneticExtensions + InPhoneticExtensionsSupplement InPrivateUseArea InRunic + InShavian InSinhala InSmallFormVariants InSpacingModifierLetters @@ -496,127 +701,51 @@ These block names are supported: InSupplementalArrowsA InSupplementalArrowsB InSupplementalMathematicalOperators + InSupplementalPunctuation InSupplementaryPrivateUseAreaA InSupplementaryPrivateUseAreaB + InSylotiNagri InSyriac InTagalog InTagbanwa InTags + InTaiLe + InTaiXuanJingSymbols InTamil InTelugu InThaana InThai InTibetan + InTifinagh + InUgaritic InUnifiedCanadianAboriginalSyllabics InVariationSelectors + InVariationSelectorsSupplement + InVerticalForms InYiRadicals InYiSyllables - -=over 4 - -=item * - -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction. - -=item * - -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, -C, C, and C. Operators that -specifically do not switch include C, C, and -C. Operators that really don't care include C, -operators that treats strings as a bucket of bits such as C, -and operators dealing with filenames. - -=item * - -The C/C letters C and C do I change, -since they are often used for byte-oriented formats. Again, think -C in the C language. - -There is a new C specifier that converts between Unicode characters -and code points. - -=item * - -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. - -=item * - -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. - -=item * - -lc(), uc(), lcfirst(), and ucfirst() work for the following cases: - -=over 8 - -=item * - -the case mapping is from a single Unicode character to another -single Unicode character, or - -=item * - -the case mapping is from a single Unicode character to more -than one Unicode character. + InYijingHexagramSymbols =back -Things to do with locales (Lithuanian, Turkish, Azeri) do B work -since Perl does not understand the concept of Unicode locales. - -See the Unicode Technical Report #21, Case Mappings, for more details. - -=back +=head2 User-Defined Character Properties -=over 4 +You can define your own character properties by defining subroutines +whose names begin with "In" or "Is". The subroutines can be defined in +any package. The user-defined properties can be used in the regular +expression C<\p> and C<\P> constructs; if you are using a user-defined +property from a package other than the one you are in, you must specify +its package in the C<\p> or C<\P> construct. -=item * + # assuming property IsForeign defined in Lang:: + package main; # property package name required + if ($txt =~ /\p{Lang::IsForeign}+/) { ... } -And finally, C reverses by character rather than by byte. + package Lang; # property package name not required + if ($txt =~ /\p{IsForeign}+/) { ... } -=back -=head2 User-Defined Character Properties - -You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be defined -in the C
package. The user-defined properties can be used in the -regular expression C<\p> and C<\P> constructs. Note that the effect -is compile-time and immutable once defined. +Note that the effect is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -625,29 +754,40 @@ or more newline-separated lines. Each line must be one of the following: =item * +A single hexadecimal number denoting a Unicode code point to include. + +=item * + Two hexadecimal numbers separated by horizontal whitespace (space or tabular characters) denoting a range of Unicode code points to include. =item * Something to include, prefixed by "+": a built-in character -property (prefixed by "utf8::"), to represent all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to exclude, prefixed by "-": an existing character -property (prefixed by "utf8::"), for all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to negate, prefixed "!": an existing character -property (prefixed by "utf8::") for all the characters except the -characters in the property; two hexadecimal code points for a range; -or a single hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. + +=item * + +Something to intersect with, prefixed by "&": an existing character +property (prefixed by "utf8::") or a user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. =back @@ -695,9 +835,25 @@ The negation is useful for defining (surprise!) negated classes. END } +Intersection is useful for getting the common characters matched by +two (or more) classes. + + sub InFooAndBar { + return <<'END'; + +main::Foo + &main::Bar + END + } + +It's important to remember not to use "&" for the first set -- that +would be intersecting with nothing (resulting in an empty set). + +=head2 User-Defined Case Mappings + You can also define your own mappings to be used in the lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). -The principle is the same: define subroutines in the C
package +The principle is similar to that of user-defined character +properties: to define subroutines in the C
package with names like C (for lc() and lcfirst()), C (for the first character in ucfirst()), and C (for uc(), and the rest of the characters in ucfirst()). @@ -741,9 +897,9 @@ are not directly user-accessible, one can use either the C module, or just match case-insensitively (that's when the C mapping is used). -A final note on the user-defined property tests and mappings: they -will be used only if the scalar has been marked as having Unicode -characters. Old byte-style strings will not be affected. +A final note on the user-defined case mappings: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. =head2 Character Encodings for Input and Output @@ -753,9 +909,8 @@ See L. The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" -and the section numbers refer to the Unicode Technical Report 18, -"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0, -Perl 5.8.0). +and the section numbers refer to the Unicode Technical Standard #18, +"Unicode Regular Expressions", version 11, in May 2005. =over 4 @@ -763,35 +918,42 @@ Perl 5.8.0). Level 1 - Basic Unicode Support - 2.1 Hex Notation - done [1] - Named Notation - done [2] - 2.2 Categories - done [3][4] - 2.3 Subtraction - MISSING [5][6] - 2.4 Simple Word Boundaries - done [7] - 2.5 Simple Loose Matches - done [8] - 2.6 End of Line - MISSING [9][10] - - [ 1] \x{...} - [ 2] \N{...} - [ 3] . \p{...} \P{...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks - [ 5] have negation - [ 6] can use regular expression look-ahead [a] - or user-defined character properties [b] to emulate subtraction - [ 7] include Letters in word characters - [ 8] note that Perl does Full case-folding in matching, not Simple: + RL1.1 Hex Notation - done [1] + RL1.2 Properties - done [2][3] + RL1.2a Compatibility Properties - done [4] + RL1.3 Subtraction and Intersection - MISSING [5] + RL1.4 Simple Word Boundaries - done [6] + RL1.5 Simple Loose Matches - done [7] + RL1.6 Line Boundaries - MISSING [8] + RL1.7 Supplementary Code Points - done [9] + + [1] \x{...} + [2] \p{...} \P{...} + [3] supports not only minimal list (general category, scripts, + Alphabetic, Lowercase, Uppercase, WhiteSpace, + NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, + ASCII, Assigned), but also bidirectional types, blocks, etc. + (see L) + [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] + [5] can use regular expression look-ahead [a] or + user-defined character properties [b] to emulate set operations + [6] \b \B + [7] note that Perl does Full case-folding in matching, not Simple: for example U+1F88 is equivalent with U+1F00 U+03B9, not with 1F80. This difference matters for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR #13 Unicode Newline Guidelines - [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} - (should also affect <>, $., and script line numbers) - (the \x{85}, \x{2028} and \x{2029} do match \s) + [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), + CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); + should also affect <>, $., and script line numbers; + should not split lines within CRLF [c] (i.e. there is no empty + line between \r and \n) + [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF + but also beyond U+10FFFF [d] [a] You can mimic class subtraction using lookahead. -For example, what UTR #18 might write as +For example, what UTS#18 might write as [{Greek}-[{UNASSIGNED}]] @@ -807,40 +969,62 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. Also see the Unicode::Regex::Set module, it does implement the full -UTR #18 grouping, intersection, union, and removal (subtraction) syntax. +UTS#18 grouping, intersection, union, and removal (subtraction) syntax. + +[b] '+' for union, '-' for removal (set-difference), '&' for intersection +(see L) + +[c] Try the C<:crlf> layer (see L). -[b] See L. +[d] Avoid C (or say C) to allow +U+FFFF (C<\x{FFFF}>). =item * Level 2 - Extended Unicode Support - 3.1 Surrogates - MISSING [11] - 3.2 Canonical Equivalents - MISSING [12][13] - 3.3 Locale-Independent Graphemes - MISSING [14] - 3.4 Locale-Independent Words - MISSING [15] - 3.5 Locale-Independent Loose Matches - MISSING [16] - - [11] Surrogates are solely a UTF-16 concept and Perl's internal - representation is UTF-8. The Encode module does UTF-16, though. - [12] see UTR#15 Unicode Normalization - [13] have Unicode::Normalize but not integrated to regexes - [14] have \X but at this level . should equal that - [15] need three classes, not just \w and \W - [16] see UTR#21 Case Mappings + RL2.1 Canonical Equivalents - MISSING [10][11] + RL2.2 Default Grapheme Clusters - MISSING [12][13] + RL2.3 Default Word Boundaries - MISSING [14] + RL2.4 Default Loose Matches - MISSING [15] + RL2.5 Name Properties - MISSING [16] + RL2.6 Wildcard Properties - MISSING + + [10] see UAX#15 "Unicode Normalization Forms" + [11] have Unicode::Normalize but not integrated to regexes + [12] have \X but at this level . should equal that + [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable + clusters as a single grapheme cluster. + [14] see UAX#29, Word Boundaries + [15] see UAX#21 "Case Mappings" + [16] have \N{...} but neither compute names of CJK Ideographs + and Hangul Syllables nor use a loose match [e] + +[e] C<\N{...}> allows namespaces (see L). =item * -Level 3 - Locale-Sensitive Support - - 4.1 Locale-Dependent Categories - MISSING - 4.2 Locale-Dependent Graphemes - MISSING [16][17] - 4.3 Locale-Dependent Words - MISSING - 4.4 Locale-Dependent Loose Matches - MISSING - 4.5 Locale-Dependent Ranges - MISSING - - [16] see UTR#10 Unicode Collation Algorithms - [17] have Unicode::Collate but not integrated to regexes +Level 3 - Tailored Support + + RL3.1 Tailored Punctuation - MISSING + RL3.2 Tailored Grapheme Clusters - MISSING [17][18] + RL3.3 Tailored Word Boundaries - MISSING + RL3.4 Tailored Loose Matches - MISSING + RL3.5 Tailored Ranges - MISSING + RL3.6 Context Matching - MISSING [19] + RL3.7 Incremental Matches - MISSING + ( RL3.8 Unicode Set Sharing ) + RL3.9 Possible Match Sets - MISSING + RL3.10 Folded Matching - MISSING [20] + RL3.11 Submatchers - MISSING + + [17] see UAX#10 "Unicode Collation Algorithms" + [18] have Unicode::Collate but not integrated to regexes + [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see + outside of the target substring + [20] need insensitive matching for linguistic features other than case; + for example, hiragana to katakana, wide and narrow, simplified Han + to traditional Han (see UTR#30 "Character Foldings") =back @@ -1072,8 +1256,9 @@ as Unicode (UTF-8), there still are many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not. -The following are such interfaces. For all of these Perl currently -(as of 5.8.1) simply assumes byte strings both as arguments and results. +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. One reason why Perl does not attempt to resolve the role of Unicode in this cases is that the answers are highly dependent on the operating @@ -1086,8 +1271,8 @@ portable concept. Similarly for the qx and system: how well will the =item * -chmod, chmod, chown, chroot, exec, link, mkdir -rename, rmdir stat, symlink, truncate, unlink, utime +chdir, chmod, chown, chroot, exec, link, lstat, mkdir, +rename, rmdir, stat, symlink, truncate, unlink, utime, -X =item * @@ -1149,7 +1334,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary. =item * -C) writes a Unicode character code point into +C writes a Unicode character code point into a buffer encoding the code point as UTF-8, and returns a pointer pointing after the UTF-8 bytes. @@ -1244,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1289,7 +1474,7 @@ derived class with such a C method: sub param { my($self,$name,$value) = @_; utf8::upgrade($name); # make sure it is UTF-8 encoded - if (defined $value) + if (defined $value) { utf8::upgrade($value); # make sure it is UTF-8 encoded return $self->SUPER::param($name,$value); } else { @@ -1314,8 +1499,12 @@ byte-encoded. In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a caching scheme was introduced which will hopefully make the slowness -somewhat less spectacular. Operations with UTF-8 encoded strings are -still slower, though. +somewhat less spectacular, at least for some operations. In general, +operations with UTF-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like C<\p{Nd}> are known to +be quite a bit slower (5-20 times) than their simpler counterparts +like C<\d> (then again, there 268 Unicode characters matching C +compared with the 10 ASCII characters matching C). =head2 Porting code from perl-5.6.X @@ -1334,7 +1523,7 @@ to work under 5.6, so you should be safe to try them out. A filehandle that should read or write UTF-8 if ($] > 5.007) { - binmode $fh, ":utf8"; + binmode $fh, ":encoding(utf8)"; } =item * @@ -1343,7 +1532,7 @@ A scalar that is going to be passed to some extension Be it Compress::Zlib, Apache::Request or any extension that has no mention of Unicode in the manpage, you need to make sure that the -UTF-8 flag is stripped off. Note that at the time of this writing +UTF8 flag is stripped off. Note that at the time of this writing (October 2002) the mentioned modules are not UTF-8-aware. Please check the documentation to verify if this is still true. @@ -1357,7 +1546,7 @@ check the documentation to verify if this is still true. A scalar we got back from an extension If you believe the scalar comes back as UTF-8, you will most likely -want the UTF-8 flag restored: +want the UTF8 flag restored: if ($] > 5.007) { require Encode; @@ -1419,7 +1608,7 @@ A large scalar that you know can only contain ASCII Scalars that contain only ASCII and are marked as UTF-8 are sometimes a drag to your program. If you recognize such a situation, just remove -the UTF-8 flag: +the UTF8 flag: utf8::downgrade($val) if $] > 5.007; @@ -1427,7 +1616,7 @@ the UTF-8 flag: =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L =cut