X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=a6f748ec6cf4169001a1ff2c52e15c2e3dec4f6b;hb=b3631f69ca17c134df671ddcddb78a6862b927cd;hp=25d512e70e70bafb1b3d59dfa72bb9f3b4e593f3;hpb=818c4caa84c1eb56340765ecb8e5b3df206aeab1;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 25d512e..a6f748e 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,9 +10,13 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L before reading this reference +document. + =over 4 -=item Input and Output Disciplines +=item Input and Output Layers Perl knows when a filehandle uses Perl's internal Unicode encodings (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -37,10 +41,25 @@ included to enable recognition of UTF-8 in the Perl scripts themselves (in string or regular expression literals, or in identifier names) on ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B -is needed.> +is needed.> See L. + +=item BOM-marked scripts and UTF-16 scripts autodetected + +If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, +or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either +endianness, Perl will correctly read in the script as Unicode. +(BOMless UTF-8 cannot be effectively recognized or differentiated from +ISO 8859-1 or other eight-bit encodings.) + +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's Unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. -You can also use the C pragma to change the default encoding -of the data in your script; see L. +See L for more details. =back @@ -67,13 +86,6 @@ character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text. -On Windows platforms, if the C<-C> command line switch is used or the -${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls -will use the corresponding wide-character APIs. This feature is -available only on Windows to conform to the API standard already -established for that platform--and there are very few non-Windows -platforms that have Unicode-aware APIs. - The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. @@ -87,18 +99,16 @@ Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. The decision to use character semantics is made transparently. If input data comes from a Unicode source--for example, if a character -encoding discipline is added to a filehandle or a literal Unicode +encoding layer is added to a filehandle or a literal Unicode string constant appears in a program--character semantics apply. Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -118,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -137,7 +146,6 @@ Additionally, if you you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. - =item * If an appropriate L is specified, identifiers within the @@ -148,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -162,7 +169,120 @@ ideograph, for instance. Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and -the C<\P{}> negation, "doesn't match property". +the C<\P{}> negation, "doesn't match property". + +See L for more details. + +You can define your own character properties and use them +in the regular expression with the C<\p{}> or C<\P{}> construct. + +See L for more details. + +=item * + +The special pattern C<\X> matches any extended Unicode +sequence--"a combining character sequence" in Standardese--where the +first character is a base character and subsequent characters are mark +characters that apply to the base character. C<\X> is equivalent to +C<(?:\PM\pM*)>. + +=item * + +The C operator translates characters instead of bytes. Note +that the C functionality has been removed. For similar +functionality see pack('U0', ...) and pack('C0', ...). + +=item * + +Case translation operators use the Unicode case translation tables +when character input is provided. Note that C, or C<\U> in +interpolated strings, translates to uppercase, while C, +or C<\u> in interpolated strings, translates to titlecase in languages +that make the distinction. + +=item * + +Most operators that deal with positions or lengths in a string will +automatically switch to using character positions, including +C, C, C, C, C, C, +C, C, and C. An operator that +specifically does not switch is C. Operators that really don't +care include operators that treat strings as a bucket of bits such as +C, and operators dealing with filenames. + +=item * + +The C/C letter C does I change, since it is often +used for byte-oriented formats. Again, think C in the C language. + +There is a new C specifier that converts between Unicode characters +and code points. There is also a C specifier that is the equivalent of +C/C and properly handles character values even if they are above 255. + +=item * + +The C and C functions work on characters, similar to +C and C, I C and +C. C and C are methods for +emulating byte-oriented C and C on Unicode strings. +While these methods reveal the internal encoding of Unicode strings, +that is not something one normally needs to care about at all. + +=item * + +The bit string operators, C<& | ^ ~>, can operate on character data. +However, for backward compatibility, such as when using bit string +operations when characters are all less than 256 in ordinal value, one +should not use C<~> (the bit complement) with characters of both +values less than 256 and values greater than 256. Most importantly, +DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) +will not hold. The reason for this mathematical I is that +the complement cannot return B the 8-bit (byte-wide) bit +complement B the full character-wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character, or + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character. + +=back + +Things to do with locales (Lithuanian, Turkish, Azeri) do B work +since Perl does not understand the concept of Unicode locales. + +See the Unicode Technical Report #21, Case Mappings, for more details. + +But you can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). + +See L for more details. + +=back + +=over 4 + +=item * + +And finally, C reverses by character rather than by byte. + +=back + +=head2 Unicode Character Properties + +Named Unicode properties, scripts, and block ranges may be used like +character classes via the C<\p{}> "matches property" construct and +the C<\P{}> negation, "doesn't match property". For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character @@ -184,13 +304,21 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. +B + +=over 4 + +=item General Category + Here are the basic Unicode General Category properties, followed by their -long form. You can use either; C<\p{Lu}> and C<\p{LowercaseLetter}>, +long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, for instance, are identical. Short Long L Letter + LC CasedLetter Lu UppercaseLetter Ll LowercaseLetter Lt TitlecaseLetter @@ -238,60 +366,69 @@ for instance, are identical. Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. -C is a special case, which is an alias for C, C, and C. +C and C are special cases, which are aliases for the set of +C, C, and C. Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. +=item Bidirectional Character Types + Because scripts differ in their directionality--Hebrew is -written right to left, for example--Unicode supplies these properties: +written right to left, for example--Unicode supplies these properties in +the BidiClass class: Property Meaning - BidiL Left-to-Right - BidiLRE Left-to-Right Embedding - BidiLRO Left-to-Right Override - BidiR Right-to-Left - BidiAL Right-to-Left Arabic - BidiRLE Right-to-Left Embedding - BidiRLO Right-to-Left Override - BidiPDF Pop Directional Format - BidiEN European Number - BidiES European Number Separator - BidiET European Number Terminator - BidiAN Arabic Number - BidiCS Common Number Separator - BidiNSM Non-Spacing Mark - BidiBN Boundary Neutral - BidiB Paragraph Separator - BidiS Segment Separator - BidiWS Whitespace - BidiON Other Neutrals - -For example, C<\p{BidiR}> matches characters that are normally + L Left-to-Right + LRE Left-to-Right Embedding + LRO Left-to-Right Override + R Right-to-Left + AL Right-to-Left Arabic + RLE Right-to-Left Embedding + RLO Right-to-Left Override + PDF Pop Directional Format + EN European Number + ES European Number Separator + ET European Number Terminator + AN Arabic Number + CS Common Number Separator + NSM Non-Spacing Mark + BN Boundary Neutral + B Paragraph Separator + S Segment Separator + WS Whitespace + ON Other Neutrals + +For example, C<\p{BidiClass:R}> matches characters that are normally written right to left. -=back - -=head2 Scripts +=item Scripts The script names which can be used by C<\p{...}> and C<\P{...}>, such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Arabic Armenian + Balinese Bengali Bopomofo + Braille + Buginese Buhid CanadianAboriginal Cherokee + Coptic + Cuneiform + Cypriot Cyrillic Deseret Devanagari Ethiopic Georgian + Glagolitic Gothic Greek Gujarati @@ -304,27 +441,43 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: Inherited Kannada Katakana + Kharoshthi Khmer Lao Latin + Limbu + LinearB Malayalam Mongolian Myanmar + NewTaiLue + Nko Ogham OldItalic + OldPersian Oriya + Osmanya + PhagsPa + Phoenician Runic + Shavian Sinhala + SylotiNagri Syriac Tagalog Tagbanwa + TaiLe Tamil Telugu Thaana Thai Tibetan + Tifinagh + Ugaritic Yi +=item Extended property classes + Extended property classes can supplement the basic properties, defined by the F Unicode database: @@ -334,7 +487,6 @@ properties, defined by the F Unicode database: Deprecated Diacritic Extender - GraphemeLink HexDigit Hyphen Ideographic @@ -346,37 +498,52 @@ properties, defined by the F Unicode database: OtherAlphabetic OtherDefaultIgnorableCodePoint OtherGraphemeExtend + OtherIDStart + OtherIDContinue OtherLowercase OtherMath OtherUppercase + PatternSyntax + PatternWhiteSpace QuotationMark Radical SoftDotted + STerm TerminalPunctuation UnifiedIdeograph + VariationSelector WhiteSpace and there are further derived properties: - Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic - Lowercase Ll + OtherLowercase - Uppercase Lu + OtherUppercase - Math Sm + OtherMath + Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic + Lowercase = Ll + OtherLowercase + Uppercase = Lu + OtherUppercase + Math = Sm + OtherMath - ID_Start Lu + Ll + Lt + Lm + Lo + Nl - ID_Continue ID_Start + Mn + Mc + Nd + Pc + IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart + IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue - Any Any character - Assigned Any non-Cn character (i.e. synonym for \P{Cn}) - Unassigned Synonym for \p{Cn} - Common Any character (or unassigned code point) - not explicitly assigned to a script + DefaultIgnorableCodePoint + = OtherDefaultIgnorableCodePoint + + Cf + Cc + Cs + Noncharacters + VariationSelector + - WhiteSpace - FFF9..FFFB (Annotation Characters) + + Any = Any code points (i.e. U+0000 to U+10FFFF) + Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) + Unassigned = Synonym for \p{Cn} + ASCII = ASCII (i.e. U+0000 to U+007F) + + Common = Any character (or unassigned code point) + not explicitly assigned to a script + +=item Use of "Is" Prefix For backward compatibility (with Perl 5.6), all properties mentioned so far may have C prepended to their name, so C<\P{IsLu}>, for example, is equal to C<\P{Lu}>. -=head2 Blocks +=item Blocks In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the @@ -388,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in a category called C. -For more about scripts, see the UTR #24: +For more about scripts, see the UAX#24 "Script Names": - http://www.unicode.org/unicode/reports/tr24/ + http://www.unicode.org/reports/tr24/ For more about blocks, see: @@ -404,12 +571,17 @@ for block tests to avoid confusion. These block names are supported: + InAegeanNumbers InAlphabeticPresentationForms + InAncientGreekMusicalNotation + InAncientGreekNumbers InArabic InArabicPresentationFormsA InArabicPresentationFormsB + InArabicSupplement InArmenian InArrows + InBalinese InBasicLatin InBengali InBlockElements @@ -417,6 +589,7 @@ These block names are supported: InBopomofoExtended InBoxDrawing InBraillePatterns + InBuginese InBuhid InByzantineMusicalSymbols InCJKCompatibility @@ -424,27 +597,38 @@ These block names are supported: InCJKCompatibilityIdeographs InCJKCompatibilityIdeographsSupplement InCJKRadicalsSupplement + InCJKStrokes InCJKSymbolsAndPunctuation InCJKUnifiedIdeographs InCJKUnifiedIdeographsExtensionA InCJKUnifiedIdeographsExtensionB InCherokee InCombiningDiacriticalMarks + InCombiningDiacriticalMarksSupplement InCombiningDiacriticalMarksforSymbols InCombiningHalfMarks InControlPictures + InCoptic + InCountingRodNumerals + InCuneiform + InCuneiformNumbersAndPunctuation InCurrencySymbols + InCypriotSyllabary InCyrillic - InCyrillicSupplementary + InCyrillicSupplement InDeseret InDevanagari InDingbats InEnclosedAlphanumerics InEnclosedCJKLettersAndMonths InEthiopic + InEthiopicExtended + InEthiopicSupplement InGeneralPunctuation InGeometricShapes InGeorgian + InGeorgianSupplement + InGlagolitic InGothic InGreekExtended InGreekAndCoptic @@ -466,13 +650,20 @@ These block names are supported: InKannada InKatakana InKatakanaPhoneticExtensions + InKharoshthi InKhmer + InKhmerSymbols InLao InLatin1Supplement InLatinExtendedA InLatinExtendedAdditional InLatinExtendedB + InLatinExtendedC + InLatinExtendedD InLetterlikeSymbols + InLimbu + InLinearBIdeograms + InLinearBSyllabary InLowSurrogates InMalayalam InMathematicalAlphanumericSymbols @@ -480,17 +671,28 @@ These block names are supported: InMiscellaneousMathematicalSymbolsA InMiscellaneousMathematicalSymbolsB InMiscellaneousSymbols + InMiscellaneousSymbolsAndArrows InMiscellaneousTechnical + InModifierToneLetters InMongolian InMusicalSymbols InMyanmar + InNKo + InNewTaiLue InNumberForms InOgham InOldItalic + InOldPersian InOpticalCharacterRecognition InOriya + InOsmanya + InPhagspa + InPhoenician + InPhoneticExtensions + InPhoneticExtensionsSupplement InPrivateUseArea InRunic + InShavian InSinhala InSmallFormVariants InSpacingModifierLetters @@ -499,134 +701,51 @@ These block names are supported: InSupplementalArrowsA InSupplementalArrowsB InSupplementalMathematicalOperators + InSupplementalPunctuation InSupplementaryPrivateUseAreaA InSupplementaryPrivateUseAreaB + InSylotiNagri InSyriac InTagalog InTagbanwa InTags + InTaiLe + InTaiXuanJingSymbols InTamil InTelugu InThaana InThai InTibetan + InTifinagh + InUgaritic InUnifiedCanadianAboriginalSyllabics InVariationSelectors + InVariationSelectorsSupplement + InVerticalForms InYiRadicals InYiSyllables - -=over 4 - -=item * - -The special pattern C<\X> matches any extended Unicode -sequence--"a combining character sequence" in Standardese--where the -first character is a base character and subsequent characters are mark -characters that apply to the base character. C<\X> is equivalent to -C<(?:\PM\pM*)>. - -=item * - -The C operator translates characters instead of bytes. Note -that the C functionality has been removed. For similar -functionality see pack('U0', ...) and pack('C0', ...). - -=item * - -Case translation operators use the Unicode case translation tables -when character input is provided. Note that C, or C<\U> in -interpolated strings, translates to uppercase, while C, -or C<\u> in interpolated strings, translates to titlecase in languages -that make the distinction. - -=item * - -Most operators that deal with positions or lengths in a string will -automatically switch to using character positions, including -C, C, C, C, C, -C, C, and C. Operators that -specifically do not switch include C, C, and -C. Operators that really don't care include C, -operators that treats strings as a bucket of bits such as C, -and operators dealing with filenames. - -=item * - -The C/C letters C and C do I change, -since they are often used for byte-oriented formats. Again, think -C in the C language. - -There is a new C specifier that converts between Unicode characters -and code points. - -=item * - -The C and C functions work on characters, similar to -C and C, I C and -C. C and C are methods for -emulating byte-oriented C and C on Unicode strings. -While these methods reveal the internal encoding of Unicode strings, -that is not something one normally needs to care about at all. - -=item * - -The bit string operators, C<& | ^ ~>, can operate on character data. -However, for backward compatibility, such as when using bit string -operations when characters are all less than 256 in ordinal value, one -should not use C<~> (the bit complement) with characters of both -values less than 256 and values greater than 256. Most importantly, -DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) -will not hold. The reason for this mathematical I is that -the complement cannot return B the 8-bit (byte-wide) bit -complement B the full character-wide bit complement. - -=item * - -lc(), uc(), lcfirst(), and ucfirst() work for the following cases: - -=over 8 - -=item * - -the case mapping is from a single Unicode character to another -single Unicode character, or - -=item * - -the case mapping is from a single Unicode character to more -than one Unicode character. + InYijingHexagramSymbols =back -The following cases do not yet work: - -=over 8 - -=item * - -the "final sigma" (Greek), and - -=item * - -anything to with locales (Lithuanian, Turkish, Azeri). - -=back - -See the Unicode Technical Report #21, Case Mappings, for more details. +=head2 User-Defined Character Properties -=item * +You can define your own character properties by defining subroutines +whose names begin with "In" or "Is". The subroutines can be defined in +any package. The user-defined properties can be used in the regular +expression C<\p> and C<\P> constructs; if you are using a user-defined +property from a package other than the one you are in, you must specify +its package in the C<\p> or C<\P> construct. -And finally, C reverses by character rather than by byte. + # assuming property IsForeign defined in Lang:: + package main; # property package name required + if ($txt =~ /\p{Lang::IsForeign}+/) { ... } -=back + package Lang; # property package name not required + if ($txt =~ /\p{IsForeign}+/) { ... } -=head2 User-Defined Character Properties -You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be -visible in the package that uses the properties. The user-defined -properties can be used in the regular expression C<\p> and C<\P> -constructs. +Note that the effect is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -641,23 +760,30 @@ tabular characters) denoting a range of Unicode code points to include. =item * Something to include, prefixed by "+": a built-in character -property (prefixed by "utf8::"), to represent all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to exclude, prefixed by "-": an existing character -property (prefixed by "utf8::"), for all the characters in that -property; two hexadecimal code points for a range; or a single -hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. =item * Something to negate, prefixed "!": an existing character -property (prefixed by "utf8::") for all the characters except the -characters in the property; two hexadecimal code points for a range; -or a single hexadecimal code point. +property (prefixed by "utf8::") or a user-defined character property, +to represent all the characters in that property; two hexadecimal code +points for a range; or a single hexadecimal code point. + +=item * + +Something to intersect with, prefixed by "&": an existing character +property (prefixed by "utf8::") or a user-defined character property, +for all the characters except the characters in the property; two +hexadecimal code points for a range; or a single hexadecimal code point. =back @@ -705,6 +831,76 @@ The negation is useful for defining (surprise!) negated classes. END } +Intersection is useful for getting the common characters matched by +two (or more) classes. + + sub InFooAndBar { + return <<'END'; + +main::Foo + &main::Bar + END + } + +It's important to remember not to use "&" for the first set -- that +would be intersecting with nothing (resulting in an empty set). + +A final note on the user-defined property tests: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. + +=head2 User-Defined Case Mappings + +You can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). +The principle is similar to that of user-defined character +properties: to define subroutines in the C
package +with names like C (for lc() and lcfirst()), C (for +the first character in ucfirst()), and C (for uc(), and the +rest of the characters in ucfirst()). + +The string returned by the subroutines needs now to be three +hexadecimal numbers separated by tabulators: start of the source +range, end of the source range, and start of the destination range. +For example: + + sub ToUpper { + return </F. The mapping data is returned as +the here-document, and the C are special exception +mappings derived from <$Config{privlib}>/F. +The C and C mappings that one can see in the directory +are not directly user-accessible, one can use either the +C module, or just match case-insensitively (that's when +the C mapping is used). + +A final note on the user-defined case mappings: they will be used +only if the scalar has been marked as having Unicode characters. +Old byte-style strings will not be affected. + =head2 Character Encodings for Input and Output See L. @@ -713,8 +909,8 @@ See L. The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" -and the section numbers refer to the Unicode Technical Report 18, -"Unicode Regular Expression Guidelines". +and the section numbers refer to the Unicode Technical Standard #18, +"Unicode Regular Expressions", version 11, in May 2005. =over 4 @@ -722,35 +918,42 @@ and the section numbers refer to the Unicode Technical Report 18, Level 1 - Basic Unicode Support - 2.1 Hex Notation - done [1] - Named Notation - done [2] - 2.2 Categories - done [3][4] - 2.3 Subtraction - MISSING [5][6] - 2.4 Simple Word Boundaries - done [7] - 2.5 Simple Loose Matches - done [8] - 2.6 End of Line - MISSING [9][10] - - [ 1] \x{...} - [ 2] \N{...} - [ 3] . \p{...} \P{...} - [ 4] now scripts (see UTR#24 Script Names) in addition to blocks - [ 5] have negation - [ 6] can use regular expression look-ahead [a] - or user-defined character properties [b] to emulate subtraction - [ 7] include Letters in word characters - [ 8] note that Perl does Full case-folding in matching, not Simple: - for example U+1F88 is equivalent with U+1F000 U+03B9, + RL1.1 Hex Notation - done [1] + RL1.2 Properties - done [2][3] + RL1.2a Compatibility Properties - done [4] + RL1.3 Subtraction and Intersection - MISSING [5] + RL1.4 Simple Word Boundaries - done [6] + RL1.5 Simple Loose Matches - done [7] + RL1.6 Line Boundaries - MISSING [8] + RL1.7 Supplementary Code Points - done [9] + + [1] \x{...} + [2] \p{...} \P{...} + [3] supports not only minimal list (general category, scripts, + Alphabetic, Lowercase, Uppercase, WhiteSpace, + NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, + ASCII, Assigned), but also bidirectional types, blocks, etc. + (see L) + [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] + [5] can use regular expression look-ahead [a] or + user-defined character properties [b] to emulate set operations + [6] \b \B + [7] note that Perl does Full case-folding in matching, not Simple: + for example U+1F88 is equivalent with U+1F00 U+03B9, not with 1F80. This difference matters for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR#13 Unicode Newline Guidelines - [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}) - (should also affect <>, $., and script line numbers) - (the \x{85}, \x{2028} and \x{2029} do match \s) + [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), + CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); + should also affect <>, $., and script line numbers; + should not split lines within CRLF [c] (i.e. there is no empty + line between \r and \n) + [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF + but also beyond U+10FFFF [d] [a] You can mimic class subtraction using lookahead. -For example, what TR18 might write as +For example, what UTS#18 might write as [{Greek}-[{UNASSIGNED}]] @@ -765,36 +968,63 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. -[b] See L. - -=item * +Also see the Unicode::Regex::Set module, it does implement the full +UTS#18 grouping, intersection, union, and removal (subtraction) syntax. -Level 2 - Extended Unicode Support +[b] '+' for union, '-' for removal (set-difference), '&' for intersection +(see L) - 3.1 Surrogates - MISSING - 3.2 Canonical Equivalents - MISSING [11][12] - 3.3 Locale-Independent Graphemes - MISSING [13] - 3.4 Locale-Independent Words - MISSING [14] - 3.5 Locale-Independent Loose Matches - MISSING [15] +[c] Try the C<:crlf> layer (see L). - [11] see UTR#15 Unicode Normalization - [12] have Unicode::Normalize but not integrated to regexes - [13] have \X but at this level . should equal that - [14] need three classes, not just \w and \W - [15] see UTR#21 Case Mappings +[d] Avoid C (or say C) to allow +U+FFFF (C<\x{FFFF}>). =item * -Level 3 - Locale-Sensitive Support +Level 2 - Extended Unicode Support - 4.1 Locale-Dependent Categories - MISSING - 4.2 Locale-Dependent Graphemes - MISSING [16][17] - 4.3 Locale-Dependent Words - MISSING - 4.4 Locale-Dependent Loose Matches - MISSING - 4.5 Locale-Dependent Ranges - MISSING + RL2.1 Canonical Equivalents - MISSING [10][11] + RL2.2 Default Grapheme Clusters - MISSING [12][13] + RL2.3 Default Word Boundaries - MISSING [14] + RL2.4 Default Loose Matches - MISSING [15] + RL2.5 Name Properties - MISSING [16] + RL2.6 Wildcard Properties - MISSING + + [10] see UAX#15 "Unicode Normalization Forms" + [11] have Unicode::Normalize but not integrated to regexes + [12] have \X but at this level . should equal that + [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable + clusters as a single grapheme cluster. + [14] see UAX#29, Word Boundaries + [15] see UAX#21 "Case Mappings" + [16] have \N{...} but neither compute names of CJK Ideographs + and Hangul Syllables nor use a loose match [e] + +[e] C<\N{...}> allows namespaces (see L). + +=item * - [16] see UTR#10 Unicode Collation Algorithms - [17] have Unicode::Collate but not integrated to regexes +Level 3 - Tailored Support + + RL3.1 Tailored Punctuation - MISSING + RL3.2 Tailored Grapheme Clusters - MISSING [17][18] + RL3.3 Tailored Word Boundaries - MISSING + RL3.4 Tailored Loose Matches - MISSING + RL3.5 Tailored Ranges - MISSING + RL3.6 Context Matching - MISSING [19] + RL3.7 Incremental Matches - MISSING + ( RL3.8 Unicode Set Sharing ) + RL3.9 Possible Match Sets - MISSING + RL3.10 Folded Matching - MISSING [20] + RL3.11 Submatchers - MISSING + + [17] see UAX#10 "Unicode Collation Algorithms" + [18] have Unicode::Collate but not integrated to regexes + [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see + outside of the target substring + [20] need insensitive matching for linguistic features other than case; + for example, hiragana to katakana, wide and narrow, simplified Han + to traditional Han (see UTR#30 "Character Foldings") =back @@ -858,7 +1088,7 @@ Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. =item * -UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) +UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) The followings items are mostly for reference and general Unicode knowledge, Perl doesn't use these constructs internally. @@ -910,7 +1140,7 @@ format". =item * -UTF-32, UTF-32BE, UTF32-LE +UTF-32, UTF-32BE, UTF-32LE The UTF-32 family is pretty much like the UTF-16 family, expect that the units are 32-bit, and therefore the surrogate scheme is not @@ -1005,10 +1235,10 @@ there are a couple of exceptions: =item * -If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) -contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), -the default encodings of your STDIN, STDOUT, and STDERR, and of -B, are considered to be UTF-8. +You can enable automatic UTF-8-ification of your standard file +handles, default C layer, and C<@ARGV> by using either +the C<-C> command line switch or the C environment +variable, see L for the documentation of the C<-C> switch. =item * @@ -1018,10 +1248,73 @@ straddling of the proverbial fence causes problems. =back +=head2 When Unicode Does Not Happen + +While Perl does have extensive ways to input and output in Unicode, +and few other 'entry points' like the @ARGV which can be interpreted +as Unicode (UTF-8), there still are many places where Unicode (in some +encoding or another) could be given as arguments or received as +results, or both, but it is not. + +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. + +One reason why Perl does not attempt to resolve the role of Unicode in +this cases is that the answers are highly dependent on the operating +system and the file system(s). For example, whether filenames can be +in Unicode, and in exactly what kind of encoding, is not exactly a +portable concept. Similarly for the qx and system: how well will the +'command line interface' (and which of them?) handle Unicode? + +=over 4 + +=item * + +chdir, chmod, chown, chroot, exec, link, lstat, mkdir, +rename, rmdir, stat, symlink, truncate, unlink, utime, -X + +=item * + +%ENV + +=item * + +glob (aka the <*>) + +=item * + +open, opendir, sysopen + +=item * + +qx (aka the backtick operator), system + +=item * + +readdir, readlink + +=back + +=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) + +Sometimes (see L) there are +situations where you simply need to force Perl to believe that a byte +string is UTF-8, or vice versa. The low-level calls +utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are +the answers. + +Do not use them without careful thought, though: Perl may easily get +very confused, angry, or even crash, if you suddenly change the 'nature' +of scalar like that. Especially careful you have to be if you use the +utf8::upgrade(): any random byte string is not valid UTF-8. + =head2 Using Unicode in XS -If you want to handle Perl Unicode in XS extensions, you may find -the following C APIs useful. See L for details. +If you want to handle Perl Unicode in XS extensions, you may find the +following C APIs useful. See also L for an +explanation about Unicode at the XS level, and L for the API +details. =over 4 @@ -1041,7 +1334,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary. =item * -C) writes a Unicode character code point into +C writes a Unicode character code point into a buffer encoding the code point as UTF-8, and returns a pointer pointing after the UTF-8 bytes. @@ -1136,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1200,61 +1493,130 @@ Unicode data much easier. Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over -characters such as length(), substr() or index() can work B -faster when the underlying data are byte-encoded. Witness the -following benchmark: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-2,{ - LENGTH_B => q{ length($b) }, - LENGTH_U => q{ length($u) }, - SUBSTR_B => q{ substr($b, $l/4, $l/2) }, - SUBSTR_U => q{ substr($u, $l/4, $l/2) }, - }); - ' - Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds... - LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960) - LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648) - SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877) - SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329) - -The numbers show an incredible slowness on long UTF-8 strings. You -should carefully avoid using these functions in tight loops. If you -want to iterate over characters, the superior coding technique would -split the characters into an array instead of using substr, as the following -benchmark shows: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-5,{ - SPLIT_B => q{ for my $c (split //, $b){} }, - SPLIT_U => q{ for my $c (split //, $u){} }, - SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} }, - SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} }, - }); - ' - Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds... - SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297) - SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286) - SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658) - SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5) - -Even though the algorithm based on C is faster than -C for byte-encoded data, it pales in comparison to the speed -of C when used with UTF-8 data. +characters such as length(), substr() or index(), or matching regular +expressions can work B faster when the underlying data are +byte-encoded. + +In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 +a caching scheme was introduced which will hopefully make the slowness +somewhat less spectacular, at least for some operations. In general, +operations with UTF-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like C<\p{Nd}> are known to +be quite a bit slower (5-20 times) than their simpler counterparts +like C<\d> (then again, there 268 Unicode characters matching C +compared with the 10 ASCII characters matching C). + +=head2 Porting code from perl-5.6.X + +Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer +was required to use the C pragma to declare that a given scope +expected to deal with Unicode data and had to make sure that only +Unicode data were reaching that scope. If you have code that is +working with 5.6, you will need some of the following adjustments to +your code. The examples are written such that the code will continue +to work under 5.6, so you should be safe to try them out. + +=over 4 + +=item * + +A filehandle that should read or write UTF-8 + + if ($] > 5.007) { + binmode $fh, ":utf8"; + } + +=item * + +A scalar that is going to be passed to some extension + +Be it Compress::Zlib, Apache::Request or any extension that has no +mention of Unicode in the manpage, you need to make sure that the +UTF8 flag is stripped off. Note that at the time of this writing +(October 2002) the mentioned modules are not UTF-8-aware. Please +check the documentation to verify if this is still true. + + if ($] > 5.007) { + require Encode; + $val = Encode::encode_utf8($val); # make octets + } + +=item * + +A scalar we got back from an extension + +If you believe the scalar comes back as UTF-8, you will most likely +want the UTF8 flag restored: + + if ($] > 5.007) { + require Encode; + $val = Encode::decode_utf8($val); + } + +=item * + +Same thing, if you are really sure it is UTF-8 + + if ($] > 5.007) { + require Encode; + Encode::_utf8_on($val); + } + +=item * + +A wrapper for fetchrow_array and fetchrow_hashref + +When the database contains only UTF-8, a wrapper function or method is +a convenient way to replace all your fetchrow_array and +fetchrow_hashref calls. A wrapper function will also make it easier to +adapt to future enhancements in your database driver. Note that at the +time of this writing (October 2002), the DBI has no standardized way +to deal with UTF-8 data. Please check the documentation to verify if +that is still true. + + sub fetchrow { + my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} + if ($] < 5.007) { + return $sth->$what; + } else { + require Encode; + if (wantarray) { + my @arr = $sth->$what; + for (@arr) { + defined && /[^\000-\177]/ && Encode::_utf8_on($_); + } + return @arr; + } else { + my $ret = $sth->$what; + if (ref $ret) { + for my $k (keys %$ret) { + defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; + } + return $ret; + } else { + defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; + return $ret; + } + } + } + } + + +=item * + +A large scalar that you know can only contain ASCII + +Scalars that contain only ASCII and are marked as UTF-8 are sometimes +a drag to your program. If you recognize such a situation, just remove +the UTF8 flag: + + utf8::downgrade($val) if $] > 5.007; + +=back =head1 SEE ALSO -L, L, L, L, L, L, -L, L +L, L, L, L, L, L, +L, L =cut