X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=9337e5f91925ba0df7a8955bbd7470184525669f;hb=42e1efa1e62f0d241a2d8e4847bce98f732060a3;hp=eadcedd74bf01f406a1acbd7abeeb49b7c8fd9a6;hpb=4c496f0cc0d05e588e924cab74c61dfe12f0f2cb;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index eadcedd..9337e5f 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -246,16 +246,14 @@ Note that both C<\x{...}> and C<\N{...}> are compile-time string
 constants: you cannot use variables in them.  if you want similar
 run-time functionality, use C<chr()> and C<charnames::vianame()>.
 
-Also note that if all the code points for pack "U" are below 0x100,
-bytes will be generated, just like if you were using C<chr()>.
-
-   my $bytes = pack("U*", 0x80, 0xFF);
-
 If you want to force the result to Unicode characters, use the special
-C<"U0"> prefix.  It consumes no arguments but forces the result to be
-in Unicode characters, instead of bytes.
+C<"U0"> prefix.  It consumes no arguments but causes the following bytes
+to be interpreted as the UTF-8 encoding of Unicode characters:
+
+   my $chars = pack("U0W*", 0x80, 0x42);
 
-   my $chars = pack("U0U*", 0x80, 0xFF);
+Likewise, you can stop such UTF-8 interpretation by using the special 
+C<"C0"> prefix.
 
 =head2 Handling Unicode
 
@@ -265,7 +263,7 @@ C<substr()> will work on the Unicode characters; regular expressions
 will work on the Unicode characters (see L<perlunicode> and L<perlretut>).
 
 Note that Perl considers combining character sequences to be
-characters, so for example
+separate characters, so for example
 
     use charnames ':full';
     print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n";
@@ -280,27 +278,13 @@ encodings, I/O, and certain special cases:
 
 When you combine legacy data and Unicode the legacy data needs
 to be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed.  You can override this assumption by
-using the C<encoding> pragma, for example
-
-    use encoding 'latin2'; # ISO 8859-2
-
-in which case literals (string or regular expressions), C<chr()>,
-and C<ord()> in your whole script are assumed to produce Unicode
-characters from ISO 8859-2 code points.  Note that the matching for
-encoding names is forgiving: instead of C<latin2> you could have
-said C<Latin 2>, or C<iso8859-2>, or other variations.  With just
-
-    use encoding;
-
-the environment variable C<PERL_ENCODING> will be consulted.
-If that variable isn't set, the encoding pragma will fail.
+applicable) is assumed. 
 
 The C<Encode> module knows about many encodings and has interfaces
 for doing conversions between those encodings:
 
-    use Encode 'from_to';
-    from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8
+    use Encode 'decode';
+    $data = decode("iso-8859-3", $data); # convert from legacy to utf-8
 
 =head2 Unicode I/O
 
@@ -406,8 +390,8 @@ the file "text.utf8", encoded as UTF-8:
     while (<$nihongo>) { print $unicode $_ }
 
 The naming of encodings, both by the C<open()> and by the C<open>
-pragma, is similar to the C<encoding> pragma in that it allows for
-flexible names: C<koi8-r> and C<KOI8R> will both be understood.
+pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
+understood.
 
 Common encodings recognized by ISO, MIME, IANA, and various other
 standardisation organisations are recognised; for a more detailed
@@ -454,7 +438,7 @@ displayed as C<\x..>, and the rest of the characters as themselves:
                chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
                sprintf("\\x%02X", $_) :    # \x..
                quotemeta(chr($_))          # else quoted or as themselves
-         } unpack("U*", $_[0]));           # unpack Unicode characters
+         } unpack("W*", $_[0]));           # unpack Unicode characters
    }
 
 For example,
@@ -494,17 +478,18 @@ explicitly-defined I/O layers). But if you must, there are two
 ways of looking behind the scenes.
 
 One way of peeking inside the internal encoding of Unicode characters
-is to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
-to display the bytes:
+is to use C<unpack("C*", ...> to get the bytes of whatever the string
+encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
+UTF-8 encoding:
 
     # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
-    print join(" ", unpack("H*", pack("U", 0x100))), "\n";
+    print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
 
 Yet another way would be to use the Devel::Peek module:
 
     perl -MDevel::Peek -e 'Dump(chr(0x100))'
 
-That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
+That shows the C<UTF8> flag in FLAGS and both the UTF-8 bytes
 and Unicode characters in C<PV>.  See also later in this document
 the discussion about the C<utf8::is_utf8()> function.
 
@@ -638,7 +623,7 @@ C<$string>.  If the flag is off, the bytes in the scalar are interpreted
 as a single byte encoding.  If the flag is on, the bytes in the scalar
 are interpreted as the (multi-byte, variable-length) UTF-8 encoded code
 points of the characters.  Bytes added to an UTF-8 encoded string are
-automatically upgraded to UTF-8.  If mixed non-UTF8 and UTF-8 scalars
+automatically upgraded to UTF-8.  If mixed non-UTF-8 and UTF-8 scalars
 are merged (double-quoted interpolation, explicit concatenation, and
 printf/sprintf parameter substitution), the result will be UTF-8 encoded
 as if copies of the byte strings were upgraded to UTF-8: for example,
@@ -670,22 +655,24 @@ How Do I Detect Data That's Not Valid In a Particular Encoding?
 Use the C<Encode> package to try converting it.
 For example,
 
-    use Encode 'encode_utf8';
-    if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
+    use Encode 'decode_utf8';
+    if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
         # valid
     } else {
         # invalid
     }
 
-For UTF-8 only, you can use:
+Or use C<unpack> to try decoding it:
 
     use warnings;
-    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+    @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
 
 If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "U0" means "expect strictly UTF-8 encoded
-Unicode".  Without that the C<unpack("U*", ...)> would accept also
-data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
+warning is produced. The "C0" means 
+"process the string character per character".  Without that the 
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format 
+string starts with C<U>) and it would return the bytes making up the UTF-8 
+encoding of the target string, something that will always work.
 
 =item *
 
@@ -727,8 +714,8 @@ Back to converting data.  If you have (or want) data in your system's
 native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
 pack/unpack to convert to/from Unicode.
 
-    $native_string  = pack("C*", unpack("U*", $Unicode_string));
-    $Unicode_string = pack("U*", unpack("C*", $native_string));
+    $native_string  = pack("W*", unpack("U*", $Unicode_string));
+    $Unicode_string = pack("U*", unpack("W*", $native_string));
 
 If you have a sequence of bytes you B<know> is valid UTF-8,
 but Perl doesn't know it yet, you can make Perl a believer, too:
@@ -736,6 +723,10 @@ but Perl doesn't know it yet, you can make Perl a believer, too:
     use Encode 'decode_utf8';
     $Unicode = decode_utf8($bytes);
 
+or:
+
+    $Unicode = pack("U0a*", $bytes);
+   
 You can convert well-formed UTF-8 to a sequence of bytes, but if
 you just want to convert random binary data into UTF-8, you can't.
 B<Any random collection of bytes isn't well-formed UTF-8>.  You can
@@ -880,7 +871,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions.
 
 =head1 SEE ALSO
 
-L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
 L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
 L<Unicode::UCD>