is now carried with the data, not attached to the operations. (There
is one remaining case where an explicit C<use utf8> is needed: if your
Perl script itself is encoded in UTF-8, you can use UTF-8 in your
-variable and subroutine names, and in your string and regular
-expression literals, by saying C<use utf8>. This is not the default
-because that would break existing scripts having legacy 8-bit data in
-them.)
+identifier names, and in your string and regular expression literals,
+by saying C<use utf8>. This is not the default because that would
+break existing scripts having legacy 8-bit data in them.)
=head2 Perl's Unicode Model
to this sample program ensures the output is completely UTF-8, and
of course, removes the warning.
-Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, the
-support is somewhat harder to implement since additional conversions
-are needed at every step. Because of these difficulties, the Unicode
-support isn't quite as full as in other, mainly ASCII-based, platforms
-(the Unicode support is better than in the 5.6 series, which didn't
-work much at all for EBCDIC platform). On EBCDIC platforms, the
-internal Unicode encoding form is UTF-EBCDIC instead of UTF-8 (the
-difference is that as UTF-8 is "ASCII-safe" in that ASCII characters
-encode to UTF-8 as-is, UTF-EBCDIC is "EBCDIC-safe").
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8. Note that this means
+that Perl expects other software to work, too: if STDIN coming
+in from another command is not UTF-8, Perl will complain about
+malformed UTF-8.
+
+=head2 Unicode and EBCDIC
+
+Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,
+the Unicode support is somewhat more complex to implement since
+additional conversions are needed at every step. Some problems
+remain, see L<perlebcdic> for details.
+
+In any case, the Unicode support on EBCDIC platforms is better than
+in the 5.6 series, which didn't work much at all for EBCDIC platform.
+On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
+instead of UTF-8 (the difference is that as UTF-8 is "ASCII-safe" in
+that ASCII characters encode to UTF-8 as-is, UTF-EBCDIC is
+"EBCDIC-safe").
=head2 Creating Unicode
use encoding;
-first the environment variable C<PERL_ENCODING> will be consulted,
-and if that doesn't exist, ISO 8859-1 (Latin 1) will be assumed.
+the environment variable C<PERL_ENCODING> will be consulted,
+but if that doesn't exist, the encoding pragma fails.
The C<Encode> module knows about many encodings and it has interfaces
for doing conversions between those encodings:
code points greater than 0xFF (255) or even 0x80 (128), or that the
string has any characters at all. All the C<is_utf8()> does is to
return the value of the internal "utf8ness" flag attached to the
-$string. If the flag is on, characters added to that string will be
-automatically upgraded to UTF-8 (and even then only if they really
-need to be upgraded, that is, if their code point is greater than 0xFF).
+$string. If the flag is off, the bytes in the scalar are interpreted
+as a single byte encoding. If the flag is on, the bytes in the scalar
+are interpreted as the (multibyte, variable-length) UTF-8 encoded code
+points of the characters. Bytes added to an UTF-8 encoded string are
+automatically upgraded to UTF-8. If mixed non-UTF8 and UTF-8 scalars
+are merged (doublequoted interpolation, explicit concatenation, and
+printf/sprintf parameter substitution), the result will be UTF-8 encoded
+as if copies of the byte strings were upgraded to UTF-8: for example,
+
+ $a = "ab\x80c";
+ $b = "\x{100}";
+ print "$a = $b\n";
+
+the output string will be UTF-8-encoded "ab\x80c\x{100}\n", but note
+that C<$a> will stay single byte encoded.
Sometimes you might really need to know the byte length of a string
-instead of the character length. For that use the C<bytes> pragma
-and its only defined function C<length()>:
+instead of the character length. For that use either the
+C<Encode::encode_utf8()> function or the C<bytes> pragma and its only
+defined function C<length()>:
my $unicode = chr(0x100);
print length($unicode), "\n"; # will print 1
+ require Encode;
+ print length(Encode::encode_utf8($unicode)), "\n"; # will print 2
use bytes;
- print length($unicode), "\n"; # will print 2 (the 0xC4 0x80 of the UTF-8)
+ print length($unicode), "\n"; # will also print 2 (the 0xC4 0x80 of the UTF-8)
=item