X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=c913047099908c9d158faa0db29c2b5d694cdb04;hb=5145b83ccb40455ee1421b25f5971eb7e2a87afc;hp=7df28e2cf349f786630b306653ca14b6f9e87b9a;hpb=8aa8f774be44d46814d4ddbad03e302f1eb37338;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 7df28e2..c913047 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L before reading this reference +document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,8 +43,23 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -You can also use the C pragma to change the default encoding -of the data in your script; see L. +=item BOM-marked scripts and UTF-16 scripts autodetected + +If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +endianness, Perl will correctly read in the script as Unicode. +(BOMless UTF-8 cannot be effectively recognized or differentiated from +ISO 8859-1 or other eight-bit encodings.) + +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +See L for more details. =back @@ -86,12 +105,10 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -111,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -130,7 +146,6 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - =item * If an appropriate L is specified, identifiers within the @@ -141,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -155,7 +169,120 @@ ideograph, for instance. Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". +the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. + +See L for more details. + +=item * + +The special pattern C<\X> matches any extended Unicode +sequence--"a combining character sequence" in Standardese--where the +first character is a base character and subsequent characters are mark +characters that apply to the base character. C<\X> is equivalent to +C<(?:\PM\pM*)>. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction. + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character, or + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character. + +=back + +Things to do with locales (Lithuanian, Turkish, Azeri) do B work +since Perl does not understand the concept of Unicode locales. + +See the Unicode Technical Report #21, Case Mappings, for more details. + +But you can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). + +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Named Unicode properties, scripts, and block ranges may be used like +character classes via the C<\p{}> "matches property" construct and +the C<\P{}> negation, "doesn't match property". For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character @@ -177,6 +304,13 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. +B + +=over 4 + +=item General Category + Here are the basic Unicode General Category properties, followed by their long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, for instance, are identical. @@ -184,6 +318,7 @@ for instance, are identical. Short Long L Letter + LC CasedLetter Lu UppercaseLetter Ll LowercaseLetter Lt TitlecaseLetter @@ -231,60 +366,69 @@ for instance, are identical. Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C is a special case, which is an alias for C, C, and C. +C and C are special cases, which are aliases for the set of +C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=item Bidirectional Character Types + Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties: +written right to left, for example--Unicode supplies these properties in +the BidiClass class: Property Meaning - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals - -For example, C<\p{BidiR}> matches characters that are normally + L Left-to-Right + LRE Left-to-Right Embedding + LRO Left-to-Right Override + R Right-to-Left + AL Right-to-Left Arabic + RLE Right-to-Left Embedding + RLO Right-to-Left Override + PDF Pop Directional Format + EN European Number + ES European Number Separator + ET European Number Terminator + AN Arabic Number + CS Common Number Separator + NSM Non-Spacing Mark + BN Boundary Neutral + B Paragraph Separator + S Segment Separator + WS Whitespace + ON Other Neutrals + +For example, C<\p{BidiClass:R}> matches characters that are normally written right to left. -=back - -=head2 Scripts +=item Scripts The script names which can be used by C<\p{...}> and C<\P{...}>, such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Arabic Armenian + Balinese Bengali Bopomofo + Braille + Buginese Buhid CanadianAboriginal Cherokee + Coptic + Cuneiform + Cypriot Cyrillic Deseret Devanagari Ethiopic Georgian + Glagolitic Gothic Greek Gujarati @@ -297,27 +441,43 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Inherited Kannada Katakana + Kharoshthi Khmer Lao Latin + Limbu + LinearB Malayalam Mongolian Myanmar + NewTaiLue + Nko Ogham OldItalic + OldPersian Oriya + Osmanya + PhagsPa + Phoenician Runic + Shavian Sinhala + SylotiNagri Syriac Tagalog Tagbanwa + TaiLe Tamil Telugu Thaana Thai Tibetan + Tifinagh + Ugaritic Yi +=item Extended property classes + Extended property classes can supplement the basic properties, defined by the F Unicode database: @@ -327,7 +487,6 @@ properties, defined by the F Unicode database: Deprecated Diacritic Extender - GraphemeLink HexDigit Hyphen Ideographic @@ -339,37 +498,52 @@ properties, defined by the F Unicode database: OtherAlphabetic OtherDefaultIgnorableCodePoint OtherGraphemeExtend + OtherIDStart + OtherIDContinue OtherLowercase OtherMath OtherUppercase + PatternSyntax + PatternWhiteSpace QuotationMark Radical SoftDotted + STerm TerminalPunctuation UnifiedIdeograph + VariationSelector WhiteSpace and there are further derived properties: - Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic - Lowercase Ll + OtherLowercase - Uppercase Lu + OtherUppercase - Math Sm + OtherMath + Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic + Lowercase = Ll + OtherLowercase + Uppercase = Lu + OtherUppercase + Math = Sm + OtherMath + + IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart + IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue - ID_Start Lu + Ll + Lt + Lm + Lo + Nl - ID_Continue ID_Start + Mn + Mc + Nd + Pc + DefaultIgnorableCodePoint + = OtherDefaultIgnorableCodePoint + + Cf + Cc + Cs + Noncharacters + VariationSelector + - WhiteSpace - FFF9..FFFB (Annotation Characters) - Any Any character - Assigned Any non-Cn character (i.e. synonym for \P{Cn}) - Unassigned Synonym for \p{Cn} - Common Any character (or unassigned code point) - not explicitly assigned to a script + Any = Any code points (i.e. U+0000 to U+10FFFF) + Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) + Unassigned = Synonym for \p{Cn} + ASCII = ASCII (i.e. U+0000 to U+007F) + + Common = Any character (or unassigned code point) + not explicitly assigned to a script + +=item Use of "Is" Prefix For backward compatibility (with Perl 5.6), all properties mentioned so far may have C prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>. -=head2 Blocks +=item Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the @@ -381,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in a category called C. -For more about scripts, see the UTR #24: +For more about scripts, see the UAX#24 "Script Names": - http://www.unicode.org/unicode/reports/tr24/ + http://www.unicode.org/reports/tr24/ For more about blocks, see: @@ -397,12 +571,17 @@ for block tests to avoid confusion. These block names are supported: + InAegeanNumbers InAlphabeticPresentationForms + InAncientGreekMusicalNotation + InAncientGreekNumbers InArabic InArabicPresentationFormsA InArabicPresentationFormsB + InArabicSupplement InArmenian InArrows + InBalinese InBasicLatin InBengali InBlockElements @@ -410,6 +589,7 @@ These block names are supported: InBopomofoExtended InBoxDrawing InBraillePatterns + InBuginese InBuhid InByzantineMusicalSymbols InCJKCompatibility @@ -417,27 +597,38 @@ These block names are supported: InCJKCompatibilityIdeographs InCJKCompatibilityIdeographsSupplement InCJKRadicalsSupplement + InCJKStrokes InCJKSymbolsAndPunctuation InCJKUnifiedIdeographs InCJKUnifiedIdeographsExtensionA InCJKUnifiedIdeographsExtensionB InCherokee InCombiningDiacriticalMarks + InCombiningDiacriticalMarksSupplement InCombiningDiacriticalMarksforSymbols InCombiningHalfMarks InControlPictures + InCoptic + InCountingRodNumerals + InCuneiform + InCuneiformNumbersAndPunctuation InCurrencySymbols + InCypriotSyllabary InCyrillic - InCyrillicSupplementary + InCyrillicSupplement InDeseret InDevanagari InDingbats InEnclosedAlphanumerics InEnclosedCJKLettersAndMonths InEthiopic + InEthiopicExtended + InEthiopicSupplement InGeneralPunctuation InGeometricShapes InGeorgian + InGeorgianSupplement + InGlagolitic InGothic InGreekExtended InGreekAndCoptic @@ -459,13 +650,20 @@ These block names are supported: InKannada InKatakana InKatakanaPhoneticExtensions + InKharoshthi InKhmer + InKhmerSymbols InLao InLatin1Supplement InLatinExtendedA InLatinExtendedAdditional InLatinExtendedB + InLatinExtendedC + InLatinExtendedD InLetterlikeSymbols + InLimbu + InLinearBIdeograms + InLinearBSyllabary InLowSurrogates InMalayalam InMathematicalAlphanumericSymbols @@ -473,17 +671,28 @@ These block names are supported: InMiscellaneousMathematicalSymbolsA InMiscellaneousMathematicalSymbolsB InMiscellaneousSymbols + InMiscellaneousSymbolsAndArrows InMiscellaneousTechnical + InModifierToneLetters InMongolian InMusicalSymbols InMyanmar + InNKo + InNewTaiLue InNumberForms InOgham InOldItalic + InOldPersian InOpticalCharacterRecognition InOriya + InOsmanya + InPhagspa + InPhoenician + InPhoneticExtensions + InPhoneticExtensionsSupplement InPrivateUseArea InRunic + InShavian InSinhala InSmallFormVariants InSpacingModifierLetters @@ -492,127 +701,51 @@ These block names are supported: InSupplementalArrowsA InSupplementalArrowsB InSupplementalMathematicalOperators + InSupplementalPunctuation InSupplementaryPrivateUseAreaA InSupplementaryPrivateUseAreaB + InSylotiNagri InSyriac InTagalog InTagbanwa InTags + InTaiLe + InTaiXuanJingSymbols InTamil InTelugu InThaana InThai InTibetan + InTifinagh + InUgaritic InUnifiedCanadianAboriginalSyllabics InVariationSelectors + InVariationSelectorsSupplement + InVerticalForms InYiRadicals InYiSyllables - -=over 4 - -=item * - -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction. - -=item * - -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, -C, C, and C. Operators that -specifically do not switch include C, C, and -C. Operators that really don't care include C, -operators that treats strings as a bucket of bits such as C, -and operators dealing with filenames. - -=item * - -The C/C letters C and C do I change, -since they are often used for byte-oriented formats. Again, think -C in the C language. - -There is a new C specifier that converts between Unicode characters -and code points. - -=item * - -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. - -=item * - -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. - -=item * - -lc(), uc(), lcfirst(), and ucfirst() work for the following cases: - -=over 8 - -=item * - -the case mapping is from a single Unicode character to another -single Unicode character, or - -=item * - -the case mapping is from a single Unicode character to more -than one Unicode character. + InYijingHexagramSymbols =back -Things to do with locales (Lithuanian, Turkish, Azeri) do B work -since Perl does not understand the concept of Unicode locales. - -See the Unicode Technical Report #21, Case Mappings, for more details. - -=back - -=over 4 +=head2 User-Defined Character Properties -=item * +You can define your own character properties by defining subroutines +whose names begin with "In" or "Is". The subroutines can be defined in +any package. The user-defined properties can be used in the regular +expression C<\p> and C<\P> constructs; if you are using a user-defined +property from a package other than the one you are in, you must specify +its package in the C<\p> or C<\P> construct. -And finally, C reverses by character rather than by byte. + # assuming property IsForeign defined in Lang:: + package main; # property package name required + if ($txt =~ /\p{Lang::IsForeign}+/) { ... } -=back + package Lang; # property package name not required + if ($txt =~ /\p{IsForeign}+/) { ... } -=head2 User-Defined Character Properties -You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be defined -in the C
package. The user-defined properties can be used in the -regular expression C<\p> and C<\P> constructs. Note that the effect -is compile-time and immutable once defined. +Note that the effect is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -627,23 +760,30 @@ tabular characters) denoting a range of Unicode code points to include. =item * Something to include, prefixed by "+": a built-in character -property (prefixed by "utf8::"), to represent all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to exclude, prefixed by "-": an existing character -property (prefixed by "utf8::"), for all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to negate, prefixed "!": an existing character -property (prefixed by "utf8::") for all the characters except the -characters in the property; two hexadecimal code points for a range; -or a single hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. + +=item * + +Something to intersect with, prefixed by "&": an existing character +property (prefixed by "utf8::") or a user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. =back @@ -691,9 +831,29 @@ The negation is useful for defining (surprise!) negated classes. END } +Intersection is useful for getting the common characters matched by +two (or more) classes. + + sub InFooAndBar { + return <<'END'; + +main::Foo + &main::Bar + END + } + +It's important to remember not to use "&" for the first set -- that +would be intersecting with nothing (resulting in an empty set). + +A final note on the user-defined property tests: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. + +=head2 User-Defined Case Mappings + You can also define your own mappings to be used in the lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). -The principle is the same: define subroutines in the C
package +The principle is similar to that of user-defined character +properties: to define subroutines in the C
package with names like C (for lc() and lcfirst()), C (for the first character in ucfirst()), and C (for uc(), and the rest of the characters in ucfirst()). @@ -737,9 +897,9 @@ are not directly user-accessible, one can use either the C module, or just match case-insensitively (that's when the C mapping is used). -A final note on the user-defined property tests and mappings: they -will be used only if the scalar has been marked as having Unicode -characters. Old byte-style strings will not be affected. +A final note on the user-defined case mappings: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. =head2 Character Encodings for Input and Output @@ -749,8 +909,8 @@ See L. The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" -and the section numbers refer to the Unicode Technical Report 18, -"Unicode Regular Expression Guidelines". +and the section numbers refer to the Unicode Technical Standard #18, +"Unicode Regular Expressions", version 11, in May 2005. =over 4 @@ -758,35 +918,42 @@ and the section numbers refer to the Unicode Technical Report 18, Level 1 - Basic Unicode Support - 2.1 Hex Notation - done [1] - Named Notation - done [2] - 2.2 Categories - done [3][4] - 2.3 Subtraction - MISSING [5][6] - 2.4 Simple Word Boundaries - done [7] - 2.5 Simple Loose Matches - done [8] - 2.6 End of Line - MISSING [9][10] - - [ 1] \x{...} - [ 2] \N{...} - [ 3] . \p{...} \P{...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks - [ 5] have negation - [ 6] can use regular expression look-ahead [a] - or user-defined character properties [b] to emulate subtraction - [ 7] include Letters in word characters - [ 8] note that Perl does Full case-folding in matching, not Simple: + RL1.1 Hex Notation - done [1] + RL1.2 Properties - done [2][3] + RL1.2a Compatibility Properties - done [4] + RL1.3 Subtraction and Intersection - MISSING [5] + RL1.4 Simple Word Boundaries - done [6] + RL1.5 Simple Loose Matches - done [7] + RL1.6 Line Boundaries - MISSING [8] + RL1.7 Supplementary Code Points - done [9] + + [1] \x{...} + [2] \p{...} \P{...} + [3] supports not only minimal list (general category, scripts, + Alphabetic, Lowercase, Uppercase, WhiteSpace, + NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, + ASCII, Assigned), but also bidirectional types, blocks, etc. + (see L) + [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] + [5] can use regular expression look-ahead [a] or + user-defined character properties [b] to emulate set operations + [6] \b \B + [7] note that Perl does Full case-folding in matching, not Simple: for example U+1F88 is equivalent with U+1F00 U+03B9, not with 1F80. This difference matters for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR#13 Unicode Newline Guidelines - [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} - (should also affect <>, $., and script line numbers) - (the \x{85}, \x{2028} and \x{2029} do match \s) + [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), + CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); + should also affect <>, $., and script line numbers; + should not split lines within CRLF [c] (i.e. there is no empty + line between \r and \n) + [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF + but also beyond U+10FFFF [d] [a] You can mimic class subtraction using lookahead. -For example, what TR18 might write as +For example, what UTS#18 might write as [{Greek}-[{UNASSIGNED}]] @@ -801,38 +968,63 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. -[b] See L. +Also see the Unicode::Regex::Set module, it does implement the full +UTS#18 grouping, intersection, union, and removal (subtraction) syntax. -=item * +[b] '+' for union, '-' for removal (set-difference), '&' for intersection +(see L) -Level 2 - Extended Unicode Support +[c] Try the C<:crlf> layer (see L). - 3.1 Surrogates - MISSING [11] - 3.2 Canonical Equivalents - MISSING [12][13] - 3.3 Locale-Independent Graphemes - MISSING [14] - 3.4 Locale-Independent Words - MISSING [15] - 3.5 Locale-Independent Loose Matches - MISSING [16] - - [11] Surrogates are solely a UTF-16 concept and Perl's internal - representation is UTF-8. The Encode module does UTF-16, though. - [12] see UTR#15 Unicode Normalization - [13] have Unicode::Normalize but not integrated to regexes - [14] have \X but at this level . should equal that - [15] need three classes, not just \w and \W - [16] see UTR#21 Case Mappings +[d] Avoid C (or say C) to allow +U+FFFF (C<\x{FFFF}>). =item * -Level 3 - Locale-Sensitive Support +Level 2 - Extended Unicode Support + + RL2.1 Canonical Equivalents - MISSING [10][11] + RL2.2 Default Grapheme Clusters - MISSING [12][13] + RL2.3 Default Word Boundaries - MISSING [14] + RL2.4 Default Loose Matches - MISSING [15] + RL2.5 Name Properties - MISSING [16] + RL2.6 Wildcard Properties - MISSING + + [10] see UAX#15 "Unicode Normalization Forms" + [11] have Unicode::Normalize but not integrated to regexes + [12] have \X but at this level . should equal that + [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable + clusters as a single grapheme cluster. + [14] see UAX#29, Word Boundaries + [15] see UAX#21 "Case Mappings" + [16] have \N{...} but neither compute names of CJK Ideographs + and Hangul Syllables nor use a loose match [e] + +[e] C<\N{...}> allows namespaces (see L). - 4.1 Locale-Dependent Categories - MISSING - 4.2 Locale-Dependent Graphemes - MISSING [16][17] - 4.3 Locale-Dependent Words - MISSING - 4.4 Locale-Dependent Loose Matches - MISSING - 4.5 Locale-Dependent Ranges - MISSING +=item * - [16] see UTR#10 Unicode Collation Algorithms - [17] have Unicode::Collate but not integrated to regexes +Level 3 - Tailored Support + + RL3.1 Tailored Punctuation - MISSING + RL3.2 Tailored Grapheme Clusters - MISSING [17][18] + RL3.3 Tailored Word Boundaries - MISSING + RL3.4 Tailored Loose Matches - MISSING + RL3.5 Tailored Ranges - MISSING + RL3.6 Context Matching - MISSING [19] + RL3.7 Incremental Matches - MISSING + ( RL3.8 Unicode Set Sharing ) + RL3.9 Possible Match Sets - MISSING + RL3.10 Folded Matching - MISSING [20] + RL3.11 Submatchers - MISSING + + [17] see UAX#10 "Unicode Collation Algorithms" + [18] have Unicode::Collate but not integrated to regexes + [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see + outside of the target substring + [20] need insensitive matching for linguistic features other than case; + for example, hiragana to katakana, wide and narrow, simplified Han + to traditional Han (see UTR#30 "Character Foldings") =back @@ -896,7 +1088,7 @@ Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. =item * -UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) +UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) The followings items are mostly for reference and general Unicode knowledge, Perl doesn't use these constructs internally. @@ -948,7 +1140,7 @@ format". =item * -UTF-32, UTF-32BE, UTF32-LE +UTF-32, UTF-32BE, UTF-32LE The UTF-32 family is pretty much like the UTF-16 family, expect that the units are 32-bit, and therefore the surrogate scheme is not @@ -1056,6 +1248,67 @@ straddling of the proverbial fence causes problems. =back +=head2 When Unicode Does Not Happen + +While Perl does have extensive ways to input and output in Unicode, +and few other 'entry points' like the @ARGV which can be interpreted +as Unicode (UTF-8), there still are many places where Unicode (in some +encoding or another) could be given as arguments or received as +results, or both, but it is not. + +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. + +One reason why Perl does not attempt to resolve the role of Unicode in +this cases is that the answers are highly dependent on the operating +system and the file system(s). For example, whether filenames can be +in Unicode, and in exactly what kind of encoding, is not exactly a +portable concept. Similarly for the qx and system: how well will the +'command line interface' (and which of them?) handle Unicode? + +=over 4 + +=item * + +chdir, chmod, chown, chroot, exec, link, lstat, mkdir, +rename, rmdir, stat, symlink, truncate, unlink, utime, -X + +=item * + +%ENV + +=item * + +glob (aka the <*>) + +=item * + +open, opendir, sysopen + +=item * + +qx (aka the backtick operator), system + +=item * + +readdir, readlink + +=back + +=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) + +Sometimes (see L) there are +situations where you simply need to force Perl to believe that a byte +string is UTF-8, or vice versa. The low-level calls +utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are +the answers. + +Do not use them without careful thought, though: Perl may easily get +very confused, angry, or even crash, if you suddenly change the 'nature' +of scalar like that. Especially careful you have to be if you use the +utf8::upgrade(): any random byte string is not valid UTF-8. + =head2 Using Unicode in XS If you want to handle Perl Unicode in XS extensions, you may find the @@ -1081,7 +1334,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary. =item * -C) writes a Unicode character code point into +C writes a Unicode character code point into a buffer encoding the code point as UTF-8, and returns a pointer pointing after the UTF-8 bytes. @@ -1176,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1240,57 +1493,18 @@ Unicode data much easier. Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over -characters such as length(), substr() or index() can work B -faster when the underlying data are byte-encoded. Witness the -following benchmark: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-2,{ - LENGTH_B => q{ length($b) }, - LENGTH_U => q{ length($u) }, - SUBSTR_B => q{ substr($b, $l/4, $l/2) }, - SUBSTR_U => q{ substr($u, $l/4, $l/2) }, - }); - ' - Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds... - LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960) - LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648) - SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877) - SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329) - -The numbers show an incredible slowness on long UTF-8 strings. You -should carefully avoid using these functions in tight loops. If you -want to iterate over characters, the superior coding technique would -split the characters into an array instead of using substr, as the following -benchmark shows: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-5,{ - SPLIT_B => q{ for my $c (split //, $b){} }, - SPLIT_U => q{ for my $c (split //, $u){} }, - SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} }, - SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} }, - }); - ' - Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds... - SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297) - SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286) - SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658) - SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5) - -Even though the algorithm based on C is faster than -C for byte-encoded data, it pales in comparison to the speed -of C when used with UTF-8 data. +characters such as length(), substr() or index(), or matching regular +expressions can work B faster when the underlying data are +byte-encoded. + +In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 +a caching scheme was introduced which will hopefully make the slowness +somewhat less spectacular, at least for some operations. In general, +operations with UTF-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like C<\p{Nd}> are known to +be quite a bit slower (5-20 times) than their simpler counterparts +like C<\d> (then again, there 268 Unicode characters matching C +compared with the 10 ASCII characters matching C). =head2 Porting code from perl-5.6.X @@ -1318,7 +1532,7 @@ A scalar that is going to be passed to some extension Be it Compress::Zlib, Apache::Request or any extension that has no mention of Unicode in the manpage, you need to make sure that the -UTF-8 flag is stripped off. Note that at the time of this writing +UTF8 flag is stripped off. Note that at the time of this writing (October 2002) the mentioned modules are not UTF-8-aware. Please check the documentation to verify if this is still true. @@ -1332,7 +1546,7 @@ check the documentation to verify if this is still true. A scalar we got back from an extension If you believe the scalar comes back as UTF-8, you will most likely -want the UTF-8 flag restored: +want the UTF8 flag restored: if ($] > 5.007) { require Encode; @@ -1394,7 +1608,7 @@ A large scalar that you know can only contain ASCII Scalars that contain only ASCII and are marked as UTF-8 are sometimes a drag to your program. If you recognize such a situation, just remove -the UTF-8 flag: +the UTF8 flag: utf8::downgrade($val) if $] > 5.007; @@ -1402,7 +1616,7 @@ the UTF-8 flag: =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L =cut