For example,
- perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
+ perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
produces a fairly useless mixture of native bytes and UTF-8, as well
as a warning:
Wide character in print at ...
-To output UTF-8, use the C<:utf8> output layer. Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
binmode(STDOUT, ":utf8");
my $chars = pack("U0W*", 0x80, 0x42);
-Likewise, you can stop such UTF-8 interpretation by using the special
+Likewise, you can stop such UTF-8 interpretation by using the special
C<"C0"> prefix.
=head2 Handling Unicode
When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed.
+applicable) is assumed.
The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
The matching of encoding names is loose: case does not matter, and
many encodings have several aliases. Note that the C<:utf8> layer
must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
L<Encode::PerlIO> for the C<:encoding()> layer, and
Unicode in Perl's eyes. To do that, specify the appropriate
layer when opening files
- open(my $fh,'<:utf8', 'anything');
+ open(my $fh,'<:encoding(utf8)', 'anything');
my $line_of_unicode = <$fh>;
open(my $fh,'<:encoding(Big5)', 'anything');
The I/O layers can also be specified more flexibly with
the C<open> pragma. See L<open>, or look at the following example.
- use open ':utf8'; # input and output default layer will be UTF-8
+ use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
open X, ">file";
print X chr(0x100), "\n";
close X;
printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
close I;
-or you can also use the C<':encoding(...)'> layer
-
- open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
- my $line_of_unicode = <$epic>;
-
These methods install a transparent filter on the I/O stream that
converts data from the specified encoding when it is read in from the
stream. The result is always Unicode.
local $/; ## read in the whole file of 8-bit characters
$t = <F>;
close F;
- open F, ">:utf8", "file";
+ open F, ">:encoding(utf8)", "file";
print F $t; ## convert to UTF-8 on output
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
Perl has been built with the new PerlIO feature (which is the default
The long answer is that you need to consider character normalization
and casing issues: see L<Unicode::Normalize>, Unicode Technical
Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
-http://www.unicode.org/unicode/reports/tr21/
+Mappings>, http://www.unicode.org/unicode/reports/tr15/ and
+http://www.unicode.org/unicode/reports/tr21/
As of Perl 5.8.0, the "Full" case-folding of I<Case
Mappings/SpecialCasing> is implemented.
For example,
use Encode 'decode_utf8';
- if (decode_utf8($string_of_bytes_that_I_think_is_utf8)) {
- # valid
+ eval { decode_utf8($string, Encode::FB_CROAK) };
+ if ($@) {
+ # $string is valid utf8
} else {
- # invalid
+ # $string is not valid utf8
}
Or use C<unpack> to try decoding it:
use warnings;
@chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
-If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "C0" means
-"process the string character per character". Without that the
-C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
-string starts with C<U>) and it would return the bytes making up the UTF-8
+If invalid, a C<Malformed UTF-8 character> warning is produced. The "C0" means
+"process the string character per character". Without that, the
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format
+string starts with C<U>) and it would return the bytes making up the UTF-8
encoding of the target string, something that will always work.
=item *
or:
$Unicode = pack("U0a*", $bytes);
-
+
You can convert well-formed UTF-8 to a sequence of bytes, but if
you just want to convert random binary data into UTF-8, you can't.
B<Any random collection of bytes isn't well-formed UTF-8>. You can
Unicode Consortium
- http://www.unicode.org/
+http://www.unicode.org/
=item *
Unicode FAQ
- http://www.unicode.org/unicode/faq/
+http://www.unicode.org/unicode/faq/
=item *
Unicode Glossary
- http://www.unicode.org/glossary/
+http://www.unicode.org/glossary/
=item *
Unicode Useful Resources
- http://www.unicode.org/unicode/onlinedat/resources.html
+http://www.unicode.org/unicode/onlinedat/resources.html
=item *
Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
- http://www.alanwood.net/unicode/
+http://www.alanwood.net/unicode/
=item *
UTF-8 and Unicode FAQ for Unix/Linux
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
+http://www.cl.cam.ac.uk/~mgk25/unicode.html
=item *
Legacy Character Sets
- http://www.czyborra.com/
- http://www.eki.ee/letter/
+http://www.czyborra.com/
+http://www.eki.ee/letter/
=item *
$Config{installprivlib}/unicore
-in Perl 5.8.0 or newer, and
+in Perl 5.8.0 or newer, and
$Config{installprivlib}/unicode