X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=5b8d5be06cf3182e999198a980824f4a7d8786c2;hb=231c9faeb17b45588bbde0b49d0d32f25d2a1286;hp=c8e31bf66cef2896710d73a99405e5c04521059b;hpb=21bad92165270edd85ff697c883b65506d5af626;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index c8e31bf..5b8d5be 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -4,28 +4,40 @@ perlunicode - Unicode support in Perl =head1 DESCRIPTION -=head2 Important Caveat +=head2 Important Caveats -WARNING: The implementation of Unicode support in Perl is incomplete. +WARNING: While the implementation of Unicode support in Perl is now fairly +complete it is still evolving to some extent. -The following areas need further work. +In particular the way Unicode is handled on EBCDIC platforms is still rather +experimental. On such a platform references to UTF-8 encoding in this +document and elsewhere should be read as meaning UTF-EBCDIC as specified +in Unicode Technical Report 16 unless ASCII vs EBCDIC issues are specifically +discussed. There is no C pragma or ":utfebcdic" layer, rather +"utf8" and ":utf8" are re-used to mean platform's "natural" 8-bit encoding +of Unicode. See L for more discussion of the issues. -=over +The following areas are still under development. + +=over 4 =item Input and Output Disciplines -There is currently no easy way to mark data read from a file or other -external source as being utf8. This will be one of the major areas of -focus in the near future. +A filehandle can be marked as containing perl's internal Unicode encoding +(UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer. +Other encodings can be converted to perl's encoding on input, or from +perl's encoding on output by use of the ":encoding()" layer. +There is not yet a clean way to mark the perl source itself as being +in an particular encoding. =item Regular Expressions -The existing regular expression compiler does not produce polymorphic -opcodes. This means that the determination on whether to match Unicode -characters is made when the pattern is compiled, based on whether the -pattern contains Unicode characters, and not when the matching happens -at run time. This needs to be changed to adaptively match Unicode if -the string to be matched is Unicode. +The regular expression compiler does now attempt to produce +polymorphic opcodes. That is the pattern should now adapt to the data +and automatically switch to the Unicode character scheme when presented +with Unicode data, or a traditional byte scheme when presented with +byte data. The implementation is still new and (particularly on +EBCDIC platforms) may need further work. =item C still needed to enable a few features @@ -66,7 +78,7 @@ or from literals and constants in the source text. If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls will use the corresponding wide character APIs. This is currently only implemented -on Windows. +on Windows since UNIXes lack API standard on this area. Regardless of the above, the C pragma can always be used to force byte semantics in a particular lexical scope. See L. @@ -114,12 +126,7 @@ will typically occur directly within the literal strings as UTF-8 characters, but you can also specify a particular character with an extension of the C<\x> notation. UTF-8 characters are specified by putting the hexadecimal code within curlies after the C<\x>. For instance, -a Unicode smiley face is C<\x{263A}>. A character in the Latin-1 range -(128..255) should be written C<\x{ab}> rather than C<\xab>, since the -former will turn into a two-byte UTF-8 code, while the latter will -continue to be interpreted as generating a 8-bit byte rather than a -character. In fact, if C<-w> is turned on, it will produce a warning -that you might be generating invalid UTF-8. +a Unicode smiley face is C<\x{263A}>. =item * @@ -162,20 +169,10 @@ C<(?:\PM\pM*)>. =item * -The C operator translates characters instead of bytes. It can also -be forced to translate between 8-bit codes and UTF-8. For instance, if you -know your input in Latin-1, you can say: - - while (<>) { - tr/\0-\xff//CU; # latin1 char to utf8 - ... - } - -Similarly you could translate your output with - - tr/\0-\x{ff}//UC; # utf8 to latin1 char - -No, C doesn't take /U or /C (yet?). +The C operator translates characters instead of bytes. Note +that the C functionality has been removed, as the interface +was a mistake. For similar functionality see pack('U0', ...) and +pack('C0', ...). =item * @@ -213,6 +210,18 @@ byte-oriented C and C under utf8. =item * +The bit string operators C<& | ^ ~> can operate on character data. +However, for backward compatibility reasons (bit string operations +when the characters all are less than 256 in ordinal value) one cannot +mix C<~> (the bit complement) and characters both less than 256 and +equal or greater than 256. Most importantly, the DeMorgan's laws +(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold. +Another way to look at this is that the complement cannot return +B the 8-bit (byte) wide bit complement, and the full character +wide bit complement. + +=item * + And finally, C reverses by character rather than by byte. =back