For example,
- perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
+ perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
produces a fairly useless mixture of native bytes and UTF-8, as well
as a warning:
Wide character in print at ...
-To output UTF-8, use the C<:utf8> output layer. Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
binmode(STDOUT, ":utf8");
constants: you cannot use variables in them. if you want similar
run-time functionality, use C<chr()> and C<charnames::vianame()>.
-Also note that if all the code points for pack "U" are below 0x100,
-bytes will be generated, just like if you were using C<chr()>.
-
- my $bytes = pack("U*", 0x80, 0xFF);
-
If you want to force the result to Unicode characters, use the special
-C<"U0"> prefix. It consumes no arguments but forces the result to be
-in Unicode characters, instead of bytes.
+C<"U0"> prefix. It consumes no arguments but causes the following bytes
+to be interpreted as the UTF-8 encoding of Unicode characters:
- my $chars = pack("U0U*", 0x80, 0xFF);
+ my $chars = pack("U0W*", 0x80, 0x42);
+
+Likewise, you can stop such UTF-8 interpretation by using the special
+C<"C0"> prefix.
=head2 Handling Unicode
When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed. You can override this assumption by
-using the C<encoding> pragma, for example
-
- use encoding 'latin2'; # ISO 8859-2
-
-in which case literals (string or regular expressions), C<chr()>,
-and C<ord()> in your whole script are assumed to produce Unicode
-characters from ISO 8859-2 code points. Note that the matching for
-encoding names is forgiving: instead of C<latin2> you could have
-said C<Latin 2>, or C<iso8859-2>, or other variations. With just
-
- use encoding;
-
-the environment variable C<PERL_ENCODING> will be consulted.
-If that variable isn't set, the encoding pragma will fail.
+applicable) is assumed.
The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
The matching of encoding names is loose: case does not matter, and
many encodings have several aliases. Note that the C<:utf8> layer
must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
L<Encode::PerlIO> for the C<:encoding()> layer, and
Unicode in Perl's eyes. To do that, specify the appropriate
layer when opening files
- open(my $fh,'<:utf8', 'anything');
+ open(my $fh,'<:encoding(utf8)', 'anything');
my $line_of_unicode = <$fh>;
open(my $fh,'<:encoding(Big5)', 'anything');
The I/O layers can also be specified more flexibly with
the C<open> pragma. See L<open>, or look at the following example.
- use open ':utf8'; # input and output default layer will be UTF-8
+ use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
open X, ">file";
print X chr(0x100), "\n";
close X;
printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
close I;
-or you can also use the C<':encoding(...)'> layer
-
- open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
- my $line_of_unicode = <$epic>;
-
These methods install a transparent filter on the I/O stream that
converts data from the specified encoding when it is read in from the
stream. The result is always Unicode.
while (<$nihongo>) { print $unicode $_ }
The naming of encodings, both by the C<open()> and by the C<open>
-pragma, is similar to the C<encoding> pragma in that it allows for
-flexible names: C<koi8-r> and C<KOI8R> will both be understood.
+pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
+understood.
Common encodings recognized by ISO, MIME, IANA, and various other
standardisation organisations are recognised; for a more detailed
local $/; ## read in the whole file of 8-bit characters
$t = <F>;
close F;
- open F, ">:utf8", "file";
+ open F, ">:encoding(utf8)", "file";
print F $t; ## convert to UTF-8 on output
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
Perl has been built with the new PerlIO feature (which is the default
chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
sprintf("\\x%02X", $_) : # \x..
quotemeta(chr($_)) # else quoted or as themselves
- } unpack("U*", $_[0])); # unpack Unicode characters
+ } unpack("W*", $_[0])); # unpack Unicode characters
}
For example,
ways of looking behind the scenes.
One way of peeking inside the internal encoding of Unicode characters
-is to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
-to display the bytes:
+is to use C<unpack("C*", ...> to get the bytes of whatever the string
+encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
+UTF-8 encoding:
# this prints c4 80 for the UTF-8 bytes 0xc4 0x80
- print join(" ", unpack("H*", pack("U", 0x100))), "\n";
+ print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
Yet another way would be to use the Devel::Peek module:
The long answer is that you need to consider character normalization
and casing issues: see L<Unicode::Normalize>, Unicode Technical
Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
-http://www.unicode.org/unicode/reports/tr21/
+Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
+http://www.unicode.org/unicode/reports/tr21/
As of Perl 5.8.0, the "Full" case-folding of I<Case
Mappings/SpecialCasing> is implemented.
Use the C<Encode> package to try converting it.
For example,
- use Encode 'encode_utf8';
- if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
- # valid
+ use Encode 'decode_utf8';
+ eval { decode_utf8($string, Encode::FB_CROAK) };
+ if ($@) {
+ # $string is valid utf8
} else {
- # invalid
+ # $string is not valid utf8
}
-For UTF-8 only, you can use:
+Or use C<unpack> to try decoding it:
use warnings;
- @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+ @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
-If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "U0" means "expect strictly UTF-8 encoded
-Unicode". Without that the C<unpack("U*", ...)> would accept also
-data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
+If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
+"process the string character per character". Without that, the
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
+string starts with C<U>) and it would return the bytes making up the UTF-8
+encoding of the target string, something that will always work.
=item *
native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
pack/unpack to convert to/from Unicode.
- $native_string = pack("C*", unpack("U*", $Unicode_string));
- $Unicode_string = pack("U*", unpack("C*", $native_string));
+ $native_string = pack("W*", unpack("U*", $Unicode_string));
+ $Unicode_string = pack("U*", unpack("W*", $native_string));
If you have a sequence of bytes you B<know> is valid UTF-8,
but Perl doesn't know it yet, you can make Perl a believer, too:
use Encode 'decode_utf8';
$Unicode = decode_utf8($bytes);
+or:
+
+ $Unicode = pack("U0a*", $bytes);
+
You can convert well-formed UTF-8 to a sequence of bytes, but if
you just want to convert random binary data into UTF-8, you can't.
B<Any random collection of bytes isn't well-formed UTF-8>. You can
Unicode Consortium
- http://www.unicode.org/
+http://www.unicode.org/
=item *
Unicode FAQ
- http://www.unicode.org/unicode/faq/
+http://www.unicode.org/unicode/faq/
=item *
Unicode Glossary
- http://www.unicode.org/glossary/
+http://www.unicode.org/glossary/
=item *
Unicode Useful Resources
- http://www.unicode.org/unicode/onlinedat/resources.html
+http://www.unicode.org/unicode/onlinedat/resources.html
=item *
Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
- http://www.alanwood.net/unicode/
+http://www.alanwood.net/unicode/
=item *
UTF-8 and Unicode FAQ for Unix/Linux
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
+http://www.cl.cam.ac.uk/~mgk25/unicode.html
=item *
Legacy Character Sets
- http://www.czyborra.com/
- http://www.eki.ee/letter/
+http://www.czyborra.com/
+http://www.eki.ee/letter/
=item *
$Config{installprivlib}/unicore
-in Perl 5.8.0 or newer, and
+in Perl 5.8.0 or newer, and
$Config{installprivlib}/unicode
=head1 SEE ALSO
-L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
L<Unicode::UCD>