X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=b05edab7b587459961b60fdd3d14086b0ecfb10c;hb=56da5a46eac515b5a165aaf05cb06f7bcdfd8e67;hp=1101b5ee084672c30a501c7f67f91531f0ab06fa;hpb=1e54db1a8aea187ba2e790aca2ab81fab24ff92d;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1101b5e..b05edab 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -42,6 +42,21 @@ is needed.> See L. You can also use the C pragma to change the default encoding of the data in your script; see L. +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +If you wish to interpret byte strings as UTF-8 instead, use the +C pragma: + + use encoding 'utf8'; + +See L for more details. + =back =head2 Byte and Character Semantics @@ -86,12 +101,12 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. To change this for +systems with non-Latin-1 and non-EBCDIC native encodings, use the +C pragma. See L. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -151,6 +166,10 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. +(However, and as a limitation of the current implementation, using +C<\w> or C<\W> I a C<[...]> character class will still match +with byte semantics.) + =item * Named Unicode properties, scripts, and block ranges may be used like @@ -1072,8 +1091,9 @@ as Unicode (UTF-8), there still are many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not. -The following are such interfaces. For all of these Perl currently -(as of 5.8.1) simply assumes byte strings both as arguments and results. +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. One reason why Perl does not attempt to resolve the role of Unicode in this cases is that the answers are highly dependent on the operating @@ -1086,8 +1106,8 @@ portable concept. Similarly for the qx and system: how well will the =item * -chmod, chmod, chown, chroot, exec, link, mkdir -rename, rmdir stat, symlink, truncate, unlink, utime +chmod, chmod, chown, chroot, exec, link, lstat, mkdir, +rename, rmdir, stat, symlink, truncate, unlink, utime, -X =item * @@ -1149,7 +1169,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary. =item * -C) writes a Unicode character code point into +C writes a Unicode character code point into a buffer encoding the code point as UTF-8, and returns a pointer pointing after the UTF-8 bytes. @@ -1314,8 +1334,12 @@ byte-encoded. In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a caching scheme was introduced which will hopefully make the slowness -somewhat less spectacular. Operations with UTF-8 encoded strings are -still slower, though. +somewhat less spectacular, at least for some operations. In general, +operations with UTF-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like C<\p{Nd}> are known to +be quite a bit slower (5-20 times) than their simpler counterparts +like C<\d> (then again, there 268 Unicode characters matching C +compared with the 10 ASCII characters matching C). =head2 Porting code from perl-5.6.X