X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=9337e5f91925ba0df7a8955bbd7470184525669f;hb=42e1efa1e62f0d241a2d8e4847bce98f732060a3;hp=eadcedd74bf01f406a1acbd7abeeb49b7c8fd9a6;hpb=4c496f0cc0d05e588e924cab74c61dfe12f0f2cb;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index eadcedd..9337e5f 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -246,16 +246,14 @@ Note that both C<\x{...}> and C<\N{...}> are compile-time string constants: you cannot use variables in them. if you want similar run-time functionality, use C and C. -Also note that if all the code points for pack "U" are below 0x100, -bytes will be generated, just like if you were using C. - - my $bytes = pack("U*", 0x80, 0xFF); - If you want to force the result to Unicode characters, use the special -C<"U0"> prefix. It consumes no arguments but forces the result to be -in Unicode characters, instead of bytes. +C<"U0"> prefix. It consumes no arguments but causes the following bytes +to be interpreted as the UTF-8 encoding of Unicode characters: + + my $chars = pack("U0W*", 0x80, 0x42); - my $chars = pack("U0U*", 0x80, 0xFF); +Likewise, you can stop such UTF-8 interpretation by using the special +C<"C0"> prefix. =head2 Handling Unicode @@ -265,7 +263,7 @@ C will work on the Unicode characters; regular expressions will work on the Unicode characters (see L and L). Note that Perl considers combining character sequences to be -characters, so for example +separate characters, so for example use charnames ':full'; print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n"; @@ -280,27 +278,13 @@ encodings, I/O, and certain special cases: When you combine legacy data and Unicode the legacy data needs to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if -applicable) is assumed. You can override this assumption by -using the C pragma, for example - - use encoding 'latin2'; # ISO 8859-2 - -in which case literals (string or regular expressions), C, -and C in your whole script are assumed to produce Unicode -characters from ISO 8859-2 code points. Note that the matching for -encoding names is forgiving: instead of C you could have -said C, or C, or other variations. With just - - use encoding; - -the environment variable C will be consulted. -If that variable isn't set, the encoding pragma will fail. +applicable) is assumed. The C module knows about many encodings and has interfaces for doing conversions between those encodings: - use Encode 'from_to'; - from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8 + use Encode 'decode'; + $data = decode("iso-8859-3", $data); # convert from legacy to utf-8 =head2 Unicode I/O @@ -406,8 +390,8 @@ the file "text.utf8", encoded as UTF-8: while (<$nihongo>) { print $unicode $_ } The naming of encodings, both by the C and by the C -pragma, is similar to the C pragma in that it allows for -flexible names: C and C will both be understood. +pragma allows for flexible names: C and C will both be +understood. Common encodings recognized by ISO, MIME, IANA, and various other standardisation organisations are recognised; for a more detailed @@ -454,7 +438,7 @@ displayed as C<\x..>, and the rest of the characters as themselves: chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... sprintf("\\x%02X", $_) : # \x.. quotemeta(chr($_)) # else quoted or as themselves - } unpack("U*", $_[0])); # unpack Unicode characters + } unpack("W*", $_[0])); # unpack Unicode characters } For example, @@ -494,17 +478,18 @@ explicitly-defined I/O layers). But if you must, there are two ways of looking behind the scenes. One way of peeking inside the internal encoding of Unicode characters -is to use C to get the bytes or C -to display the bytes: +is to use C to get the bytes of whatever the string +encoding happens to be, or C to get the bytes of the +UTF-8 encoding: # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 - print join(" ", unpack("H*", pack("U", 0x100))), "\n"; + print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n"; Yet another way would be to use the Devel::Peek module: perl -MDevel::Peek -e 'Dump(chr(0x100))' -That shows the UTF8 flag in FLAGS and both the UTF-8 bytes +That shows the C flag in FLAGS and both the UTF-8 bytes and Unicode characters in C. See also later in this document the discussion about the C function. @@ -638,7 +623,7 @@ C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar are interpreted as the (multi-byte, variable-length) UTF-8 encoded code points of the characters. Bytes added to an UTF-8 encoded string are -automatically upgraded to UTF-8. If mixed non-UTF8 and UTF-8 scalars +automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example, @@ -670,22 +655,24 @@ How Do I Detect Data That's Not Valid In a Particular Encoding? Use the C package to try converting it. For example, - use Encode 'encode_utf8'; - if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) { + use Encode 'decode_utf8'; + if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) { # valid } else { # invalid } -For UTF-8 only, you can use: +Or use C to try decoding it: use warnings; - @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8); + @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); If invalid, a C -warning is produced. The "U0" means "expect strictly UTF-8 encoded -Unicode". Without that the C would accept also -data like C), similarly to the C as we saw earlier. +warning is produced. The "C0" means +"process the string character per character". Without that the +C would work in C mode (the default if the format +string starts with C) and it would return the bytes making up the UTF-8 +encoding of the target string, something that will always work. =item * @@ -727,8 +714,8 @@ Back to converting data. If you have (or want) data in your system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use pack/unpack to convert to/from Unicode. - $native_string = pack("C*", unpack("U*", $Unicode_string)); - $Unicode_string = pack("U*", unpack("C*", $native_string)); + $native_string = pack("W*", unpack("U*", $Unicode_string)); + $Unicode_string = pack("U*", unpack("W*", $native_string)); If you have a sequence of bytes you B is valid UTF-8, but Perl doesn't know it yet, you can make Perl a believer, too: @@ -736,6 +723,10 @@ but Perl doesn't know it yet, you can make Perl a believer, too: use Encode 'decode_utf8'; $Unicode = decode_utf8($bytes); +or: + + $Unicode = pack("U0a*", $bytes); + You can convert well-formed UTF-8 to a sequence of bytes, but if you just want to convert random binary data into UTF-8, you can't. B. You can @@ -880,7 +871,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions. =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L, L, L, L