X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=b05edab7b587459961b60fdd3d14086b0ecfb10c;hb=56da5a46eac515b5a165aaf05cb06f7bcdfd8e67;hp=190247aea796369a877a6603ef29db9832a4cbf3;hpb=fb9cc17475f7385a07b3c8693a1ca73c68a368d6;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 190247a..b05edab 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -42,6 +42,21 @@ is needed.> See L. You can also use the C pragma to change the default encoding of the data in your script; see L. +=item C needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +If you wish to interpret byte strings as UTF-8 instead, use the +C pragma: + + use encoding 'utf8'; + +See L for more details. + =back =head2 Byte and Character Semantics @@ -86,12 +101,12 @@ Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C pragma. See -L. +character data are concatenated, the new string will be created by +decoding the byte strings as I, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. To change this for +systems with non-Latin-1 and non-EBCDIC native encodings, use the +C pragma. See L. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -151,6 +166,10 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. +(However, and as a limitation of the current implementation, using +C<\w> or C<\W> I a C<[...]> character class will still match +with byte semantics.) + =item * Named Unicode properties, scripts, and block ranges may be used like @@ -1072,8 +1091,9 @@ as Unicode (UTF-8), there still are many places where Unicode (in some encoding or another) could be given as arguments or received as results, or both, but it is not. -The following are such interfaces. For all of these Perl currently -(as of 5.8.1) simply assumes byte strings both as arguments and results. +The following are such interfaces. For all of these interfaces Perl +currently (as of 5.8.3) simply assumes byte strings both as arguments +and results, or UTF-8 strings if the C pragma has been used. One reason why Perl does not attempt to resolve the role of Unicode in this cases is that the answers are highly dependent on the operating @@ -1086,8 +1106,8 @@ portable concept. Similarly for the qx and system: how well will the =item * -chmod, chmod, chown, chroot, exec, link, mkdir -rename, rmdir stat, symlink, truncate, unlink, utime +chmod, chmod, chown, chroot, exec, link, lstat, mkdir, +rename, rmdir, stat, symlink, truncate, unlink, utime, -X =item *