Naturally, C<ord()> will do the reverse: turn a character to a code point.
-Note that C<\x..>, C<\x{..}> and C<chr(...)> for arguments less than
-0x100 (decimal 256) will generate an eight-bit character for backward
-compatibility with older Perls. For arguments of 0x100 or more,
-Unicode will always be produced. If you want UTF-8 always, use
-C<pack("U", ...)> instead of C<\x..>, C<\x{..}>, or C<chr()>.
+Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>
+and C<chr(...)> for arguments less than 0x100 (decimal 256) will
+generate an eight-bit character for backward compatibility with older
+Perls. For arguments of 0x100 or more, Unicode will always be
+produced. If you want UTF-8 always, use C<pack("U", ...)> instead of
+C<\x..>, C<\x{...}>, or C<chr()>.
You can also use the C<charnames> pragma to invoke characters
by name in doublequoted strings:
my $georgian_an = pack("U", 0x10a0);
+Note that both C<\x{...}> and C<\N{...}> are compile-time string
+constants: you cannot use variables in them. if you want similar
+run-time functionality, use C<chr()> and C<charnames::vianame()>.
+
=head2 Handling Unicode
Handling Unicode is for the most part transparent: just use the
Normally writing out Unicode data
- print chr(0x100), "\n";
+ print FH chr(0x100), "\n";
+
+will print out the raw UTF-8 bytes, but you will get a warning
+out of that if you use C<-w> or C<use warnings>. To avoid the
+warning open the stream explicitly in UTF-8:
+
+ open FH, ">:utf8", "file";
-will print out the raw UTF-8 bytes.
+and on already open streams use C<binmode()>:
-But reading in correctly formed UTF-8 data will not magically turn
+ binmode(STDOUT, ":utf8");
+
+Reading in correctly formed UTF-8 data will not magically turn
the data into Unicode in Perl's eyes.
You can use either the C<':utf8'> I/O discipline when opening files
The I/O disciplines can also be specified more flexibly with
the C<open> pragma; see L<open>:
- use open ':utf8'; # input and output will be UTF-8
- open X, ">utf8";
- print X chr(0x100), "\n"; # this would have been UTF-8 without the pragma
+ use open ':utf8'; # input and output default discipline will be UTF-8
+ open X, ">file";
+ print X chr(0x100), "\n";
close X;
- open Y, "<utf8";
+ open Y, "<file";
printf "%#x\n", ord(<Y>); # this should print 0x100
close Y;
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug.
+UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
+explicitly opening also the F<file> for input as UTF-8.
=head2 Special Cases
The question of string equivalence turns somewhat complicated
in Unicode: what do you mean by equal?
- Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
- C<LATIN CAPITAL LETTER A>?
+(Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
+C<LATIN CAPITAL LETTER A>?)
The short answer is that by default Perl compares equivalence
(C<eq>, C<ne>) based only on code points of the characters.
People like to see their strings nicely sorted, or as Unicode
parlance goes, collated. But again, what do you mean by collate?
- Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
- C<LATIN CAPITAL LETTER A WITH GRAVE>?
+(Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
+C<LATIN CAPITAL LETTER A WITH GRAVE>?)
The short answer is that by default Perl compares strings (C<lt>,
C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
=back
+=head1 UNICODE IN OLDER PERLS
+
+If you cannot upgrade your Perl to 5.8.0 or later, you can still
+do some Unicode processing by using the modules C<Unicode::String>,
+C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
+If you have the GNU recode installed, you can also use the
+Perl frontend C<Convert::Recode> for character conversions.
+
=head1 SEE ALSO
L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,