Add the URL for annotated svn of S03.

[p5sagit/p5-mst-13.2.git] / pod / perluniintro.pod
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index 9337e5f..86360d4 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -160,14 +160,14 @@ strings contain a character beyond 0x00FF.
 
 For example,
 
-      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'              
+      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
 
 produces a fairly useless mixture of native bytes and UTF-8, as well
 as a warning:
 
      Wide character in print at ...
 
-To output UTF-8, use the C<:utf8> output layer.  Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer.  Prepending
 
       binmode(STDOUT, ":utf8");
 
@@ -252,7 +252,7 @@ to be interpreted as the UTF-8 encoding of Unicode characters:
 
    my $chars = pack("U0W*", 0x80, 0x42);
 
-Likewise, you can stop such UTF-8 interpretation by using the special 
+Likewise, you can stop such UTF-8 interpretation by using the special
 C<"C0"> prefix.
 
 =head2 Handling Unicode
@@ -278,7 +278,7 @@ encodings, I/O, and certain special cases:
 
 When you combine legacy data and Unicode the legacy data needs
 to be upgraded to Unicode.  Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed. 
+applicable) is assumed.
 
 The C<Encode> module knows about many encodings and has interfaces
 for doing conversions between those encodings:
@@ -317,7 +317,9 @@ and on already open streams, use C<binmode()>:
 The matching of encoding names is loose: case does not matter, and
 many encodings have several aliases.  Note that the C<:utf8> layer
 must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
 
 See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
 L<Encode::PerlIO> for the C<:encoding()> layer, and
@@ -329,7 +331,7 @@ Unicode or legacy encodings does not magically turn the data into
 Unicode in Perl's eyes.  To do that, specify the appropriate
 layer when opening files
 
-    open(my $fh,'<:utf8', 'anything');
+    open(my $fh,'<:encoding(utf8)', 'anything');
     my $line_of_unicode = <$fh>;
 
     open(my $fh,'<:encoding(Big5)', 'anything');
@@ -338,7 +340,7 @@ layer when opening files
 The I/O layers can also be specified more flexibly with
 the C<open> pragma.  See L<open>, or look at the following example.
 
-    use open ':utf8'; # input and output default layer will be UTF-8
+    use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
     open X, ">file";
     print X chr(0x100), "\n";
     close X;
@@ -358,11 +360,6 @@ With the C<open> pragma you can use the C<:locale> layer
     printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
     close I;
 
-or you can also use the C<':encoding(...)'> layer
-
-    open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
-    my $line_of_unicode = <$epic>;
-
 These methods install a transparent filter on the I/O stream that
 converts data from the specified encoding when it is read in from the
 stream.  The result is always Unicode.
@@ -411,13 +408,13 @@ by repeatedly encoding the data:
     local $/; ## read in the whole file of 8-bit characters
     $t = <F>;
     close F;
-    open F, ">:utf8", "file";
+    open F, ">:encoding(utf8)", "file";
     print F $t; ## convert to UTF-8 on output
     close F;
 
 If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded.  A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded.  A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
 
 B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
 Perl has been built with the new PerlIO feature (which is the default
@@ -517,8 +514,8 @@ CAPITAL LETTER As should be considered equal, or even As of any case.
 The long answer is that you need to consider character normalization
 and casing issues: see L<Unicode::Normalize>, Unicode Technical
 Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and 
-http://www.unicode.org/unicode/reports/tr21/ 
+Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
+http://www.unicode.org/unicode/reports/tr21/
 
 As of Perl 5.8.0, the "Full" case-folding of I<Case
 Mappings/SpecialCasing> is implemented.
@@ -656,10 +653,11 @@ Use the C<Encode> package to try converting it.
 For example,
 
     use Encode 'decode_utf8';
-    if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
-        # valid
+    
+    if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) {
+        # $string is valid utf8
     } else {
-        # invalid
+        # $string is not valid utf8
     }
 
 Or use C<unpack> to try decoding it:
@@ -667,11 +665,10 @@ Or use C<unpack> to try decoding it:
     use warnings;
     @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
 
-If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "C0" means 
-"process the string character per character".  Without that the 
-C<unpack("U*", ...)> would work in C<U0> mode (the default if the format 
-string starts with C<U>) and it would return the bytes making up the UTF-8 
+If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
+"process the string character per character".  Without that, the
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
+string starts with C<U>) and it would return the bytes making up the UTF-8
 encoding of the target string, something that will always work.
 
 =item *
@@ -726,7 +723,7 @@ but Perl doesn't know it yet, you can make Perl a believer, too:
 or:
 
     $Unicode = pack("U0a*", $bytes);
-   
+
 You can convert well-formed UTF-8 to a sequence of bytes, but if
 you just want to convert random binary data into UTF-8, you can't.
 B<Any random collection of bytes isn't well-formed UTF-8>.  You can
@@ -790,44 +787,44 @@ show a decimal number in hexadecimal.  If you have just the
 
 Unicode Consortium
 
-    http://www.unicode.org/
+http://www.unicode.org/
 
 =item *
 
 Unicode FAQ
 
-    http://www.unicode.org/unicode/faq/
+http://www.unicode.org/unicode/faq/
 
 =item *
 
 Unicode Glossary
 
-    http://www.unicode.org/glossary/
+http://www.unicode.org/glossary/
 
 =item *
 
 Unicode Useful Resources
 
-    http://www.unicode.org/unicode/onlinedat/resources.html
+http://www.unicode.org/unicode/onlinedat/resources.html
 
 =item *
 
 Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
 
-    http://www.alanwood.net/unicode/
+http://www.alanwood.net/unicode/
 
 =item *
 
 UTF-8 and Unicode FAQ for Unix/Linux
 
-    http://www.cl.cam.ac.uk/~mgk25/unicode.html
+http://www.cl.cam.ac.uk/~mgk25/unicode.html
 
 =item *
 
 Legacy Character Sets
 
-    http://www.czyborra.com/
-    http://www.eki.ee/letter/
+http://www.czyborra.com/
+http://www.eki.ee/letter/
 
 =item *
 
@@ -836,7 +833,7 @@ directory
 
     $Config{installprivlib}/unicore
 
-in Perl 5.8.0 or newer, and 
+in Perl 5.8.0 or newer, and
 
     $Config{installprivlib}/unicode