X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=106a4bf610cade2c8d61b9da3d65c07b97656af2;hb=17c338f39c13131c1bc175ef38013b54bc98396d;hp=b0efcca8dfcffcc31057fba59656eae1bf0d19a6;hpb=393fec973b1b95a178b4b9600173880d9f93debf;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index b0efcca..106a4bf 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -4,44 +4,128 @@ perlunicode - Unicode support in Perl =head1 DESCRIPTION -WARNING: The implementation of Unicode support in Perl is incomplete. -Expect sudden and unannounced changes! +=head2 Important Caveats -Beginning with version 5.6, Perl uses logically wide characters to -represent strings internally. This internal representation of strings -uses the UTF-8 encoding. +WARNING: While the implementation of Unicode support in Perl is now +fairly complete it is still evolving to some extent. + +In particular the way Unicode is handled on EBCDIC platforms is still +rather experimental. On such a platform references to UTF-8 encoding +in this document and elsewhere should be read as meaning UTF-EBCDIC as +specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues +are specifically discussed. There is no C pragma or +":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean +platform's "natural" 8-bit encoding of Unicode. See L for +more discussion of the issues. + +The following areas are still under development. + +=over 4 + +=item Input and Output Disciplines + +A filehandle can be marked as containing perl's internal Unicode +encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer. +Other encodings can be converted to perl's encoding on input, or from +perl's encoding on output by use of the ":encoding()" layer. There is +not yet a clean way to mark the Perl source itself as being in an +particular encoding. + +=item Regular Expressions + +The regular expression compiler does now attempt to produce +polymorphic opcodes. That is the pattern should now adapt to the data +and automatically switch to the Unicode character scheme when +presented with Unicode data, or a traditional byte scheme when +presented with byte data. The implementation is still new and +(particularly on EBCDIC platforms) may need further work. -In future, Perl-level operations will expect to work with characters -rather than bytes, in general. +=item C still needed to enable UTF-8/UTF-EBCDIC in scripts -However, Perl v5.6 aims to provide a safe migration path from byte -semantics to character semantics for programs. To preserve compatibility -with earlier versions of Perl which allowed byte semantics in Perl -operations (owing to the fact that the internal representation for -characters was in bytes) byte semantics will continue to be in effect -until a the C pragma is used in the C
package, or the C<$^U> -global flag is explicitly set. +The C pragma implements the tables used for Unicode support. +These tables are automatically loaded on demand, so the C pragma +need not normally be used. + +However, as a compatibility measure, this pragma must be explicitly +used to enable recognition of UTF-8 in the Perl scripts themselves on +ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines. +B is +needed>. + +You can also use the C pragma to change the default encoding +of the data in your script; see L. + +=back + +=head2 Byte and Character semantics + +Beginning with version 5.6, Perl uses logically wide characters to +represent strings internally. This internal representation of strings +uses either the UTF-8 or the UTF-EBCDIC encoding. + +In future, Perl-level operations can be expected to work with +characters rather than bytes, in general. + +However, as strictly an interim compatibility measure, Perl aims to +provide a safe migration path from byte semantics to character +semantics for programs. For operations where Perl can unambiguously +decide that the input data is characters, Perl now switches to +character semantics. For operations where this determination cannot +be made without additional information from the user, Perl decides in +favor of compatibility, and chooses to use byte semantics. + +This behavior preserves compatibility with earlier versions of Perl, +which allowed byte semantics in Perl operations, but only as long as +none of the program's inputs are marked as being as source of Unicode +character data. Such data may come from filehandles, from calls to +external programs, from information provided by the system (such as %ENV), +or from literals and constants in the source text. + +If the C<-C> command line switch is used, (or the +${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls +will use the corresponding wide character APIs. Note that this is +currently only implemented on Windows since other platforms API +standard on this area. + +Regardless of the above, the C pragma can always be used to +force byte semantics in a particular lexical scope. See L. + +The C pragma is primarily a compatibility device that enables +recognition of UTF-(8|EBCDIC) in literals encountered by the parser. +Note that this pragma is only required until a future version of Perl +in which character semantics will become the default. This pragma may +then become a no-op. See L. + +Unless mentioned otherwise, Perl operators will use character semantics +when they are dealing with Unicode data, and byte semantics otherwise. +Thus, character semantics for these operations apply transparently; if +the input data came from a Unicode source (for example, by adding a +character encoding discipline to the filehandle whence it came, or a +literal UTF-8 string constant in the program), character semantics +apply; otherwise, byte semantics are in effect. To force byte semantics +on Unicode data, the C pragma should be used. + +Notice that if you concatenate strings with byte semantics and strings +with Unicode character data, the bytes will by default be upgraded +I (or if in EBCDIC, after a +translation to ISO 8859-1). To change this, use the C +pragma, see L. Under character semantics, many operations that formerly operated on -bytes change to operating on characters. For ASCII data this makes -no difference, because UTF-8 stores ASCII in single bytes, but for -any character greater than C, the character is stored in +bytes change to operating on characters. For ASCII data this makes no +difference, because UTF-8 stores ASCII in single bytes, but for any +character greater than C, the character B be stored in a sequence of two or more bytes, all of which have the high bit set. -But by and large, the user need not worry about this, because Perl -hides it from the user. A character in Perl is logically just a number -ranging from 0 to 2**32 or so. Larger characters encode to longer -sequences of bytes internally, but again, this is just an internal -detail which is hidden at the Perl level. - -The C pragma can be used to force byte semantics in a particular -lexical scope. See L. - -The C pragma is a compatibility device to enables recognition -of UTF-8 in literals encountered by the parser. It is also used -for enabling some experimental Unicode support features. Note that -this pragma is only required until a future version of Perl in which -character semantics will become the default. This pragma may then -become a no-op. See L. + +For C1 controls or Latin 1 characters on an EBCDIC platform the +character may be stored in a UTF-EBCDIC multi byte sequence. But by +and large, the user need not worry about this, because Perl hides it +from the user. A character in Perl is logically just a number ranging +from 0 to 2**32 or so. Larger characters encode to longer sequences +of bytes internally, but again, this is just an internal detail which +is hidden at the Perl level. + +=head2 Effects of character semantics Character semantics have the following effects: @@ -50,52 +134,35 @@ Character semantics have the following effects: =item * Strings and patterns may contain characters that have an ordinal value -larger than 255. In Perl v5.6, this is only enabled if the lexical -scope has a C declaration (due to compatibility needs) but -future versions may enable this by default. +larger than 255. -Presuming you use a Unicode editor to edit your program, such characters -will typically occur directly within the literal strings as UTF-8 -characters, but you can also specify a particular character with an -extension of the C<\x> notation. UTF-8 characters are specified by -putting the hexadecimal code within curlies after the C<\x>. For instance, -a Unicode smiley face is C<\x{263A}>. A character in the Latin-1 range -(128..255) should be written C<\x{ab}> rather than C<\xab>, since the -former will turn into a two-byte UTF-8 code, while the latter will -continue to be interpreted as generating a 8-bit byte rather than a -character. In fact, if C<-w> is turned on, it will produce a warning -that you might be generating invalid UTF-8. +Presuming you use a Unicode editor to edit your program, such +characters will typically occur directly within the literal strings as +UTF-8 (or UTF-EBCDIC on EBCDIC platforms) characters, but you can also +specify a particular character with an extension of the C<\x> +notation. UTF-X characters are specified by putting the hexadecimal +code within curlies after the C<\x>. For instance, a Unicode smiley +face is C<\x{263A}>. =item * Identifiers within the Perl script may contain Unicode alphanumeric characters, including ideographs. (You are currently on your own when -it comes to using the canonical forms of characters--Perl doesn't (yet) -attempt to canonicalize variable names for you.) - -This also needs C currently. [XXX: Why? High-bit chars were -syntax errors when they occurred within identifiers in previous versions, -so this should be enabled by default.] +it comes to using the canonical forms of characters--Perl doesn't +(yet) attempt to canonicalize variable names for you.) =item * Regular expressions match characters instead of bytes. For instance, "." matches a character instead of a byte. (However, the C<\C> pattern -is provided to force a match a single byte ("C" in C, hence -C<\C>).) - -Unicode support in regular expressions needs C currently. -[XXX: Because the SWASH routines need to be loaded. And the RE engine -appears to need an overhaul to Unicode by default anyway.] +is provided to force a match a single byte ("C" in C, hence C<\C>).) =item * Character classes in regular expressions match characters instead of bytes, and match against the character properties specified in the -Unicode properties database. So C<\w> can be used to match an ideograph, -for instance. - -C is needed to enable this. See above. +Unicode properties database. So C<\w> can be used to match an +ideograph, for instance. =item * @@ -103,11 +170,326 @@ Named Unicode properties and block ranges make be used as character classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't match property) constructs. For instance, C<\p{Lu}> matches any character with the Unicode uppercase property, while C<\p{M}> matches -any mark character. Single letter properties may omit the brackets, so -that can be written C<\pM> also. Many predefined character classes are -available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. - -C is needed to enable this. See above. +any mark character. Single letter properties may omit the brackets, +so that can be written C<\pM> also. Many predefined character classes +are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. + +The C<\p{Is...}> test for "general properties" such as "letter", +"digit", while the C<\p{In...}> test for Unicode scripts and blocks. + +The official Unicode script and block names have spaces and dashes and +separators, but for convenience you can have dashes, spaces, and +underbars at every word division, and you need not care about correct +casing. It is recommended, however, that for consistency you use the +following naming: the official Unicode script, block, or property name +(see below for the additional rules that apply to block names), +with whitespace and dashes replaced with underbar, and the words +"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement" +becomes "Latin_1_Supplement". + +You can also negate both C<\p{}> and C<\P{}> by introducing a caret +(^) between the first curly and the property name: C<\p{^In_Tamil}> is +equal to C<\P{In_Tamil}>. + +The C and C can be left out: C<\p{Greek}> is equal to +C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>. + + Short Long + + L Letter + Lu Uppercase_Letter + Ll Lowercase_Letter + Lt Titlecase_Letter + Lm Modifier_Letter + Lo Other_Letter + + M Mark + Mn Nonspacing_Mark + Mc Spacing_Mark + Me Enclosing_Mark + + N Number + Nd Decimal_Number + Nl Letter_Number + No Other_Number + + P Punctuation + Pc Connector_Punctuation + Pd Dash_Punctuation + Ps Open_Punctuation + Pe Close_Punctuation + Pi Initial_Punctuation + (may behave like Ps or Pe depending on usage) + Pf Final_Punctuation + (may behave like Ps or Pe depending on usage) + Po Other_Punctuation + + S Symbol + Sm Math_Symbol + Sc Currency_Symbol + Sk Modifier_Symbol + So Other_Symbol + + Z Separator + Zs Space_Separator + Zl Line_Separator + Zp Paragraph_Separator + + C Other + Cc Control + Cf Format + Cs Surrogate + Co Private_Use + Cn Unassigned + +There's also C which is an alias for C, C, and C. + +The following reserved ranges have C tests: + + CJK_Ideograph_Extension_A + CJK_Ideograph + Hangul_Syllable + Non_Private_Use_High_Surrogate + Private_Use_High_Surrogate + Low_Surrogate + Private_Surrogate + CJK_Ideograph_Extension_B + Plane_15_Private_Use + Plane_16_Private_Use + +For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. +(Handling of surrogates is not implemented yet, because Perl +uses UTF-8 and not UTF-16 internally to represent Unicode.) + +Additionally, because scripts differ in their directionality +(for example Hebrew is written right to left), all characters +have their directionality defined: + + BidiL Left-to-Right + BidiLRE Left-to-Right Embedding + BidiLRO Left-to-Right Override + BidiR Right-to-Left + BidiAL Right-to-Left Arabic + BidiRLE Right-to-Left Embedding + BidiRLO Right-to-Left Override + BidiPDF Pop Directional Format + BidiEN European Number + BidiES European Number Separator + BidiET European Number Terminator + BidiAN Arabic Number + BidiCS Common Number Separator + BidiNSM Non-Spacing Mark + BidiBN Boundary Neutral + BidiB Paragraph Separator + BidiS Segment Separator + BidiWS Whitespace + BidiON Other Neutrals + +=head2 Scripts + +The scripts available for C<\p{In...}> and C<\P{In...}>, for example +\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>: + + Arabic + Armenian + Bengali + Bopomofo + Canadian-Aboriginal + Cherokee + Cyrillic + Deseret + Devanagari + Ethiopic + Georgian + Gothic + Greek + Gujarati + Gurmukhi + Han + Hangul + Hebrew + Hiragana + Inherited + Kannada + Katakana + Khmer + Lao + Latin + Malayalam + Mongolian + Myanmar + Ogham + Old-Italic + Oriya + Runic + Sinhala + Syriac + Tamil + Telugu + Thaana + Thai + Tibetan + Yi + +There are also extended property classes that supplement the basic +properties, defined by the F Unicode database: + + ASCII_Hex_Digit + Bidi_Control + Dash + Diacritic + Extender + Hex_Digit + Hyphen + Ideographic + Join_Control + Noncharacter_Code_Point + Other_Alphabetic + Other_Lowercase + Other_Math + Other_Uppercase + Quotation_Mark + White_Space + +and further derived properties: + + Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic + Lowercase Ll + Other_Lowercase + Uppercase Lu + Other_Uppercase + Math Sm + Other_Math + + ID_Start Lu + Ll + Lt + Lm + Lo + Nl + ID_Continue ID_Start + Mn + Mc + Nd + Pc + + Any Any character + Assigned Any non-Cn character + Common Any character (or unassigned code point) + not explicitly assigned to a script + +=head2 Blocks + +In addition to B, Unicode also defines B of +characters. The difference between scripts and blocks is that the +scripts concept is closer to natural languages, while the blocks +concept is more an artificial grouping based on groups of 256 Unicode +characters. For example, the C script contains letters from +many blocks. On the other hand, the C script does not contain +all the characters from those blocks, it does not for example contain +digits because digits are shared across many scripts. Digits and +other similar groups, like punctuation, are in a category called +C. + +For more about scripts see the UTR #24: +http://www.unicode.org/unicode/reports/tr24/ +For more about blocks see +http://www.unicode.org/Public/UNIDATA/Blocks.txt + +Because there are overlaps in naming (there are, for example, both +a script called C and a block called C, the block +version has C appended to its name, C<\p{InKatakanaBlock}>. + +Notice that this definition was introduced in Perl 5.8.0: in Perl +5.6 only the blocks were used; in Perl 5.8.0 scripts became the +preferential Unicode character class definition; this meant that +the definitions of some character classes changed (the ones in the +below list that have the C appended). + + Alphabetic Presentation Forms + Arabic Block + Arabic Presentation Forms-A + Arabic Presentation Forms-B + Armenian Block + Arrows + Basic Latin + Bengali Block + Block Elements + Bopomofo Block + Bopomofo Extended + Box Drawing + Braille Patterns + Byzantine Musical Symbols + CJK Compatibility + CJK Compatibility Forms + CJK Compatibility Ideographs + CJK Compatibility Ideographs Supplement + CJK Radicals Supplement + CJK Symbols and Punctuation + CJK Unified Ideographs + CJK Unified Ideographs Extension A + CJK Unified Ideographs Extension B + Cherokee Block + Combining Diacritical Marks + Combining Half Marks + Combining Marks for Symbols + Control Pictures + Currency Symbols + Cyrillic Block + Deseret Block + Devanagari Block + Dingbats + Enclosed Alphanumerics + Enclosed CJK Letters and Months + Ethiopic Block + General Punctuation + Geometric Shapes + Georgian Block + Gothic Block + Greek Block + Greek Extended + Gujarati Block + Gurmukhi Block + Halfwidth and Fullwidth Forms + Hangul Compatibility Jamo + Hangul Jamo + Hangul Syllables + Hebrew Block + High Private Use Surrogates + High Surrogates + Hiragana Block + IPA Extensions + Ideographic Description Characters + Kanbun + Kangxi Radicals + Kannada Block + Katakana Block + Khmer Block + Lao Block + Latin 1 Supplement + Latin Extended Additional + Latin Extended-A + Latin Extended-B + Letterlike Symbols + Low Surrogates + Malayalam Block + Mathematical Alphanumeric Symbols + Mathematical Operators + Miscellaneous Symbols + Miscellaneous Technical + Mongolian Block + Musical Symbols + Myanmar Block + Number Forms + Ogham Block + Old Italic Block + Optical Character Recognition + Oriya Block + Private Use + Runic Block + Sinhala Block + Small Form Variants + Spacing Modifier Letters + Specials + Superscripts and Subscripts + Syriac Block + Tags + Tamil Block + Telugu Block + Thaana Block + Thai Block + Tibetan Block + Unified Canadian Aboriginal Syllabics + Yi Radicals + Yi Syllables =item * @@ -117,47 +499,32 @@ character is a base character and subsequent characters are mark characters that apply to the base character. It is equivalent to C<(?:\PM\pM*)>. -C is needed to enable this. See above. - =item * -The C operator translates characters instead of bytes. It can also -be forced to translate between 8-bit codes and UTF-8 regardless of the -surrounding utf8 state. For instance, if you know your input in Latin-1, -you can say: - - use utf8; - while (<>) { - tr/\0-\xff//CU; # latin1 char to utf8 - ... - } - -Similarly you could translate your output with - - tr/\0-\x{ff}//UC; # utf8 to latin1 char - -No, C doesn't take /U or /C (yet?). - -C is needed to enable this. See above. +The C operator translates characters instead of bytes. Note +that the C functionality has been removed, as the interface +was a mistake. For similar functionality see pack('U0', ...) and +pack('C0', ...). =item * Case translation operators use the Unicode case translation tables -when provided character input. Note that C translates to -uppercase, while C translates to titlecase (for languages -that make the distinction). Naturally the corresponding backslash -sequences have the same semantics. +when provided character input. Note that C (also known as C<\U> +in doublequoted strings) translates to uppercase, while C +(also known as C<\u> in doublequoted strings) translates to titlecase +(for languages that make the distinction). Naturally the +corresponding backslash sequences have the same semantics. =item * Most operators that deal with positions or lengths in the string will -automatically switch to using character positions, including C, -C, C, C, C, C, -C, and C. Operators that specifically don't switch -include C, C, and C. Operators that really -don't care include C, as well as any other operator that -treats a string as a bucket of bits, such as C, and the -operators dealing with filenames. +automatically switch to using character positions, including +C, C, C, C, C, +C, C, and C. Operators that +specifically don't switch include C, C, and +C. Operators that really don't care include C, as +well as any other operator that treats a string as a bucket of bits, +such as C, and the operators dealing with filenames. =item * @@ -172,7 +539,55 @@ outside of the utf8 pragma too.) The C and C functions work on characters. This is like C and C, not like C and C. In fact, the latter are how you now emulate -byte-oriented C and C under utf8. +byte-oriented C and C for Unicode strings. +(Note that this reveals the internal UTF-8 encoding of strings and +you are not supposed to do that unless you know what you are doing.) + +=item * + +The bit string operators C<& | ^ ~> can operate on character data. +However, for backward compatibility reasons (bit string operations +when the characters all are less than 256 in ordinal value) one should +not mix C<~> (the bit complement) and characters both less than 256 and +equal or greater than 256. Most importantly, the DeMorgan's laws +(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. +Another way to look at this is that the complement cannot return +B the 8-bit (byte) wide bit complement B the full character +wide bit complement. + +=item * + +lc(), uc(), lcfirst(), and ucfirst() work for the following cases: + +=over 8 + +=item * + +the case mapping is from a single Unicode character to another +single Unicode character + +=item * + +the case mapping is from a single Unicode character to more +than one Unicode character + +=back + +What doesn't yet work are the followng cases: + +=over 8 + +=item * + +the "final sigma" (Greek) + +=item * + +anything to with locales (Lithuanian, Turkish, Azeri) + +=back + +See the Unicode Technical Report #21, Case Mappings, for more details. =item * @@ -180,14 +595,18 @@ And finally, C reverses by character rather than by byte. =back +=head2 Character encodings for input and output + +See L. + =head1 CAVEATS As of yet, there is no method for automatically coercing input and -output to some encoding other than UTF-8. This is planned in the near -future, however. +output to some encoding other than UTF-8 or UTF-EBCDIC. This is planned +in the near future, however. -Whether a piece of data will be treated as "characters" or "bytes" -by internal operations cannot be divined at the current time. +Whether an arbitrary piece of data will be treated as "characters" or +"bytes" by internal operations cannot be divined at the current time. Use of locales with utf8 may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range @@ -195,8 +614,71 @@ some attempt to apply 8-bit locale info to characters in the range characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged. +=head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL + +The following list of Unicode regular expression support describes +feature by feature the Unicode support implemented in Perl as of Perl +5.8.0. The "Level N" and the section numbers refer to the Unicode +Technical Report 18, "Unicode Regular Expression Guidelines". + +=over 4 + +=item * + +Level 1 - Basic Unicode Support + + 2.1 Hex Notation - done [1] + Named Notation - done [2] + 2.2 Categories - done [3][4] + 2.3 Subtraction - MISSING [5][6] + 2.4 Simple Word Boundaries - done [7] + 2.5 Simple Loose Matches - MISSING [8] + 2.6 End of Line - MISSING [9][10] + + [ 1] \x{...} + [ 2] \N{...} + [ 3] . \p{Is...} \P{Is...} + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 5] have negation + [ 6] can use look-ahead to emulate subtracion + [ 7] include Letters in word characters + [ 8] see UTR#21 Case Mappings + [ 9] see UTR#13 Unicode Newline Guidelines + [10] should do ^ and $ also on \x{2028} and \x{2029} + +=item * + +Level 2 - Extended Unicode Support + + 3.1 Surrogates - MISSING + 3.2 Canonical Equivalents - MISSING [11][12] + 3.3 Locale-Independent Graphemes - MISSING [13] + 3.4 Locale-Independent Words - MISSING [14] + 3.5 Locale-Independent Loose Matches - MISSING [15] + + [11] see UTR#15 Unicode Normalization + [12] have Unicode::Normalize but not integrated to regexes + [13] have \X but at this level . should equal that + [14] need three classes, not just \w and \W + [15] see UTR#21 Case Mappings + +=item * + +Level 3 - Locale-Sensitive Support + + 4.1 Locale-Dependent Categories - MISSING + 4.2 Locale-Dependent Graphemes - MISSING [16][17] + 4.3 Locale-Dependent Words - MISSING + 4.4 Locale-Dependent Loose Matches - MISSING + 4.5 Locale-Dependent Ranges - MISSING + + [16] see UTR#10 Unicode Collation Algorithms + [17] have Unicode::Collate but not integrated to regexes + +=back + =head1 SEE ALSO -L, L, L +L, L, L, L =cut