perltodo update.

[p5sagit/p5-mst-13.2.git] / pod / perluniintro.pod
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index c94f3d2..d6eae60 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -128,10 +128,9 @@ This model was found to be wrong, or at least clumsy: the Unicodeness
 is now carried with the data, not attached to the operations.  (There
 is one remaining case where an explicit C<use utf8> is needed: if your
 Perl script itself is encoded in UTF-8, you can use UTF-8 in your
-variable and subroutine names, and in your string and regular
-expression literals, by saying C<use utf8>.  This is not the default
-because that would break existing scripts having legacy 8-bit data in
-them.)
+identifier names, and in your string and regular expression literals,
+by saying C<use utf8>.  This is not the default because that would
+break existing scripts having legacy 8-bit data in them.)
 
 =head2 Perl's Unicode Model
 
@@ -169,15 +168,27 @@ To output UTF-8 always, use the ":utf8" output discipline.  Prepending
 to this sample program ensures the output is completely UTF-8, and      
 of course, removes the warning.
 
-Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, the
-support is somewhat harder to implement since additional conversions
-are needed at every step.  Because of these difficulties, the Unicode
-support isn't quite as full as in other, mainly ASCII-based, platforms
-(the Unicode support is better than in the 5.6 series, which didn't
-work much at all for EBCDIC platform).  On EBCDIC platforms, the
-internal Unicode encoding form is UTF-EBCDIC instead of UTF-8 (the
-difference is that as UTF-8 is "ASCII-safe" in that ASCII characters
-encode to UTF-8 as-is, UTF-EBCDIC is "EBCDIC-safe").
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.  Note that this means
+that Perl expects other software to work, too: if STDIN coming
+in from another command is not UTF-8, Perl will complain about
+malformed UTF-8.
+
+=head2 Unicode and EBCDIC
+
+Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There,
+the Unicode support is somewhat more complex to implement since
+additional conversions are needed at every step.  Some problems
+remain, see L<perlebcdic> for details.
+
+In any case, the Unicode support on EBCDIC platforms is better than
+in the 5.6 series, which didn't work much at all for EBCDIC platform.
+On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
+instead of UTF-8 (the difference is that as UTF-8 is "ASCII-safe" in
+that ASCII characters encode to UTF-8 as-is, UTF-EBCDIC is
+"EBCDIC-safe").
 
 =head2 Creating Unicode
 
@@ -268,8 +279,8 @@ C<Latin 2>, or C<iso8859-2>, and so forth.  With just
 
     use encoding;
 
-first the environment variable C<PERL_ENCODING> will be consulted,
-and if that doesn't exist, ISO 8859-1 (Latin 1) will be assumed.
+the environment variable C<PERL_ENCODING> will be consulted,
+but if that doesn't exist, the encoding pragma fails.
 
 The C<Encode> module knows about many encodings and it has interfaces
 for doing conversions between those encodings:
@@ -596,18 +607,33 @@ string are necessary UTF-8 encoded, or that any of the characters have
 code points greater than 0xFF (255) or even 0x80 (128), or that the
 string has any characters at all.  All the C<is_utf8()> does is to
 return the value of the internal "utf8ness" flag attached to the
-$string.  If the flag is on, characters added to that string will be
-automatically upgraded to UTF-8 (and even then only if they really
-need to be upgraded, that is, if their code point is greater than 0xFF).
+$string.  If the flag is off, the bytes in the scalar are interpreted
+as a single byte encoding.  If the flag is on, the bytes in the scalar
+are interpreted as the (multibyte, variable-length) UTF-8 encoded code
+points of the characters.  Bytes added to an UTF-8 encoded string are
+automatically upgraded to UTF-8.  If mixed non-UTF8 and UTF-8 scalars
+are merged (doublequoted interpolation, explicit concatenation, and
+printf/sprintf parameter substitution), the result will be UTF-8 encoded
+as if copies of the byte strings were upgraded to UTF-8: for example,
+
+    $a = "ab\x80c";
+    $b = "\x{100}";
+    print "$a = $b\n";
+
+the output string will be UTF-8-encoded "ab\x80c\x{100}\n", but note
+that C<$a> will stay single byte encoded.
 
 Sometimes you might really need to know the byte length of a string
-instead of the character length.  For that use the C<bytes> pragma
-and its only defined function C<length()>:
+instead of the character length. For that use either the
+C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
+defined function C<length()>:
 
     my $unicode = chr(0x100);
     print length($unicode), "\n"; # will print 1
+    require Encode;
+    print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
     use bytes;
-    print length($unicode), "\n"; # will print 2 (the 0xC4 0x80 of the UTF-8)
+    print length($unicode), "\n"; # will also print 2 (the 0xC4 0x80 of the UTF-8)
 
 =item