X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=d6367004a39b76f9ac0f85d6d4ce1c89cde5e9d4;hb=80a5d8e74b5512d4ab704d0e83466ae41247ce55;hp=783ee396950f21c4a33894ba44cbd6a44d6f6cca;hpb=004283b80f6094bb85aba6f48a74e3c5c34ea24f;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 783ee39..d636700 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -20,6 +20,11 @@ Other encodings can be converted to perl's encoding on input, or from perl's encoding on output by use of the ":encoding(...)" layer. See L. +In some filesystems (for example Microsoft NTFS and Apple HFS+) the +filenames are in UTF-8 . By using opendir() and File::Glob you can +make readdir() and glob() to return the filenames as Unicode, see +L and L for details. + To mark the Perl source itself as being in a particular encoding, see L. @@ -137,19 +142,6 @@ This works for all characters that have names. =item * -If Unicode is used in hash keys, there is a subtle effect on the hashes. -The hash becomes "Unicode-sticky" so that keys retrieved from the hash -(either by %hash, each %hash, or keys %hash) will be in Unicode, not -in bytes, even when the keys were bytes went they "went in". This -"stickiness" persists unless the hash is completely emptied, either by -using delete() or clearing the with undef() or assigning an empty list -to the hash. Most of the time this difference is negligible, but -there are few places where it matters: for example the regular -expression character classes like C<\w> behave differently for -bytes and characters. - -=item * - If an appropriate L is specified, identifiers within the Perl script may contain Unicode alphanumeric characters, including ideographs. (You are currently on your own when it comes to using the @@ -750,6 +742,11 @@ The following table is from Unicode 3.2. Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF, the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF. +The "gaps" are caused by legal UTF-8 avoiding non-shortest encodings: +it is technically possible to UTF-8-encode a single code point in different +ways, but that is explicitly forbidden, and the shortest possible encoding +should always be used (and that is what Perl does). + Or, another way to look at it, as bits: Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte @@ -792,7 +789,7 @@ are the range 0xDC00..0xDFFFF. The surrogate encoding is and the decoding is - $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00); + $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); If you try to generate surrogates (for example by using chr()), you will get a warning if warnings are turned on (C<-w> or C