X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=86360d4a726753bd72b551eddac85f8ed4e2ec1a;hb=98af1e142028dcf116f32636ea54f4c3e9494651;hp=19bc82e7dade851ff22a744c7570a7fdd8f42608;hpb=cc5e35530cecae3181bf0cafea49ebb522764a26;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 19bc82e..86360d4 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -160,14 +160,14 @@ strings contain a character beyond 0x00FF. For example, - perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' + perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' produces a fairly useless mixture of native bytes and UTF-8, as well as a warning: Wide character in print at ... -To output UTF-8, use the C<:utf8> output layer. Prepending +To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending binmode(STDOUT, ":utf8"); @@ -247,12 +247,12 @@ constants: you cannot use variables in them. if you want similar run-time functionality, use C and C. If you want to force the result to Unicode characters, use the special -C<"U0"> prefix. It consumes no arguments but forces the result to be -in Unicode characters, instead of bytes. +C<"U0"> prefix. It consumes no arguments but causes the following bytes +to be interpreted as the UTF-8 encoding of Unicode characters: - my $chars = pack("U0C*", 0x80, 0x42); + my $chars = pack("U0W*", 0x80, 0x42); -Likewise, you can force the result to be bytes by using the special +Likewise, you can stop such UTF-8 interpretation by using the special C<"C0"> prefix. =head2 Handling Unicode @@ -278,21 +278,7 @@ encodings, I/O, and certain special cases: When you combine legacy data and Unicode the legacy data needs to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if -applicable) is assumed. You can override this assumption by -using the C pragma, for example - - use encoding 'latin2'; # ISO 8859-2 - -in which case literals (string or regular expressions), C, -and C in your whole script are assumed to produce Unicode -characters from ISO 8859-2 code points. Note that the matching for -encoding names is forgiving: instead of C you could have -said C, or C, or other variations. With just - - use encoding; - -the environment variable C will be consulted. -If that variable isn't set, the encoding pragma will fail. +applicable) is assumed. The C module knows about many encodings and has interfaces for doing conversions between those encodings: @@ -331,7 +317,9 @@ and on already open streams, use C: The matching of encoding names is loose: case does not matter, and many encodings have several aliases. Note that the C<:utf8> layer must always be specified exactly like that; it is I subject to -the loose matching of encoding names. +the loose matching of encoding names. Also note that C<:utf8> is unsafe for +input, because it accepts the data without validating that it is indeed valid +UTF8. See L for the C<:utf8> layer, L and L for the C<:encoding()> layer, and @@ -343,7 +331,7 @@ Unicode or legacy encodings does not magically turn the data into Unicode in Perl's eyes. To do that, specify the appropriate layer when opening files - open(my $fh,'<:utf8', 'anything'); + open(my $fh,'<:encoding(utf8)', 'anything'); my $line_of_unicode = <$fh>; open(my $fh,'<:encoding(Big5)', 'anything'); @@ -352,7 +340,7 @@ layer when opening files The I/O layers can also be specified more flexibly with the C pragma. See L, or look at the following example. - use open ':utf8'; # input and output default layer will be UTF-8 + use open ':encoding(utf8)'; # input/output default encoding will be UTF-8 open X, ">file"; print X chr(0x100), "\n"; close X; @@ -372,11 +360,6 @@ With the C pragma you can use the C<:locale> layer printf "%#x\n", ord(), "\n"; # this should print 0xc1 close I; -or you can also use the C<':encoding(...)'> layer - - open(my $epic,'<:encoding(iso-8859-7)','iliad.greek'); - my $line_of_unicode = <$epic>; - These methods install a transparent filter on the I/O stream that converts data from the specified encoding when it is read in from the stream. The result is always Unicode. @@ -404,8 +387,8 @@ the file "text.utf8", encoded as UTF-8: while (<$nihongo>) { print $unicode $_ } The naming of encodings, both by the C and by the C -pragma, is similar to the C pragma in that it allows for -flexible names: C and C will both be understood. +pragma allows for flexible names: C and C will both be +understood. Common encodings recognized by ISO, MIME, IANA, and various other standardisation organisations are recognised; for a more detailed @@ -425,13 +408,13 @@ by repeatedly encoding the data: local $/; ## read in the whole file of 8-bit characters $t = ; close F; - open F, ">:utf8", "file"; + open F, ">:encoding(utf8)", "file"; print F $t; ## convert to UTF-8 on output close F; If you run this code twice, the contents of the F will be twice -UTF-8 encoded. A C would have avoided the bug, or -explicitly opening also the F for input as UTF-8. +UTF-8 encoded. A C would have avoided the +bug, or explicitly opening also the F for input as UTF-8. B: the C<:utf8> and C<:encoding> features work only if your Perl has been built with the new PerlIO feature (which is the default @@ -452,7 +435,7 @@ displayed as C<\x..>, and the rest of the characters as themselves: chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... sprintf("\\x%02X", $_) : # \x.. quotemeta(chr($_)) # else quoted or as themselves - } unpack("U*", $_[0])); # unpack Unicode characters + } unpack("W*", $_[0])); # unpack Unicode characters } For example, @@ -492,11 +475,12 @@ explicitly-defined I/O layers). But if you must, there are two ways of looking behind the scenes. One way of peeking inside the internal encoding of Unicode characters -is to use C to get the bytes or C -to display the bytes: +is to use C to get the bytes of whatever the string +encoding happens to be, or C to get the bytes of the +UTF-8 encoding: # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 - print join(" ", unpack("H*", pack("U", 0x100))), "\n"; + print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n"; Yet another way would be to use the Devel::Peek module: @@ -530,8 +514,8 @@ CAPITAL LETTER As should be considered equal, or even As of any case. The long answer is that you need to consider character normalization and casing issues: see L, Unicode Technical Reports #15 and #21, I and I, http://www.unicode.org/unicode/reports/tr15/ and -http://www.unicode.org/unicode/reports/tr21/ +Mappings>, http://www.unicode.org/unicode/reports/tr15/ and +http://www.unicode.org/unicode/reports/tr21/ As of Perl 5.8.0, the "Full" case-folding of I is implemented. @@ -668,22 +652,24 @@ How Do I Detect Data That's Not Valid In a Particular Encoding? Use the C package to try converting it. For example, - use Encode 'encode_utf8'; - if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) { - # valid + use Encode 'decode_utf8'; + + if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) { + # $string is valid utf8 } else { - # invalid + # $string is not valid utf8 } -For UTF-8 only, you can use: +Or use C to try decoding it: use warnings; - @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8); + @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); -If invalid, a C -warning is produced. The "U0" means "expect strictly UTF-8 encoded -Unicode". Without that the C would accept also -data like C), similarly to the C as we saw earlier. +If invalid, a C warning is produced. The "C0" means +"process the string character per character". Without that, the +C would work in C mode (the default if the format +string starts with C) and it would return the bytes making up the UTF-8 +encoding of the target string, something that will always work. =item * @@ -725,8 +711,8 @@ Back to converting data. If you have (or want) data in your system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use pack/unpack to convert to/from Unicode. - $native_string = pack("C*", unpack("U*", $Unicode_string)); - $Unicode_string = pack("U*", unpack("C*", $native_string)); + $native_string = pack("W*", unpack("U*", $Unicode_string)); + $Unicode_string = pack("U*", unpack("W*", $native_string)); If you have a sequence of bytes you B is valid UTF-8, but Perl doesn't know it yet, you can make Perl a believer, too: @@ -734,6 +720,10 @@ but Perl doesn't know it yet, you can make Perl a believer, too: use Encode 'decode_utf8'; $Unicode = decode_utf8($bytes); +or: + + $Unicode = pack("U0a*", $bytes); + You can convert well-formed UTF-8 to a sequence of bytes, but if you just want to convert random binary data into UTF-8, you can't. B. You can @@ -797,44 +787,44 @@ show a decimal number in hexadecimal. If you have just the Unicode Consortium - http://www.unicode.org/ +http://www.unicode.org/ =item * Unicode FAQ - http://www.unicode.org/unicode/faq/ +http://www.unicode.org/unicode/faq/ =item * Unicode Glossary - http://www.unicode.org/glossary/ +http://www.unicode.org/glossary/ =item * Unicode Useful Resources - http://www.unicode.org/unicode/onlinedat/resources.html +http://www.unicode.org/unicode/onlinedat/resources.html =item * Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications - http://www.alanwood.net/unicode/ +http://www.alanwood.net/unicode/ =item * UTF-8 and Unicode FAQ for Unix/Linux - http://www.cl.cam.ac.uk/~mgk25/unicode.html +http://www.cl.cam.ac.uk/~mgk25/unicode.html =item * Legacy Character Sets - http://www.czyborra.com/ - http://www.eki.ee/letter/ +http://www.czyborra.com/ +http://www.eki.ee/letter/ =item * @@ -843,7 +833,7 @@ directory $Config{installprivlib}/unicore -in Perl 5.8.0 or newer, and +in Perl 5.8.0 or newer, and $Config{installprivlib}/unicode @@ -878,7 +868,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions. =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L, L, L, L