in the largest Chinese, Japanese, and Korean dictionaries are also
encoded. The standards will eventually cover almost all characters in
more than 250 writing systems and thousands of languages.
+Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
A Unicode I<character> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C<char>.
I<code points>.
The Unicode standard prefers using hexadecimal notation for the code
-points. If numbers like C<0x0041> are unfamiliar to
-you, take a peek at a later section, L</"Hexadecimal Notation">.
-The Unicode standard uses the notation C<U+0041 LATIN CAPITAL LETTER A>,
-to give the hexadecimal code point and the normative name of
-the character.
+points. If numbers like C<0x0041> are unfamiliar to you, take a peek
+at a later section, L</"Hexadecimal Notation">. The Unicode standard
+uses the notation C<U+0041 LATIN CAPITAL LETTER A>, to give the
+hexadecimal code point and the normative name of the character.
Unicode also defines various I<properties> for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
A common myth about Unicode is that it would be "16-bit", that is,
Unicode is only represented as C<0x10000> (or 65536) characters from
-C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0, Unicode
-has been defined all the way up to 21 bits (C<0x10FFFF>), and since
-Unicode 3.1, characters have been defined beyond C<0xFFFF>. The first
-C<0x10000> characters are called the I<Plane 0>, or the I<Basic
-Multilingual Plane> (BMP). With Unicode 3.1, 17 planes in all are
-defined--but nowhere near full of defined characters, yet.
+C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July
+1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
+and since Unicode 3.1 (March 2001), characters have been defined
+beyond C<0xFFFF>. The first C<0x10000> characters are called the
+I<Plane 0>, or the I<Basic Multilingual Plane> (BMP). With Unicode
+3.1, 17 (yes, seventeen) planes in all were defined--but they are
+nowhere near full of defined characters, yet.
Another myth is that the 256-character blocks have something to
do with languages--that each block would define the characters used
For further information see L<Unicode::UCD>.
The Unicode code points are just abstract numbers. To input and
-output these abstract numbers, the numbers must be I<encoded> somehow.
-Unicode defines several I<character encoding forms>, of which I<UTF-8>
-is perhaps the most popular. UTF-8 is a variable length encoding that
-encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
-defined characters). Other encodings include UTF-16 and UTF-32 and their
-big- and little-endian variants (UTF-8 is byte-order independent)
-The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
+output these abstract numbers, the numbers must be I<encoded> or
+I<serialised> somehow. Unicode defines several I<character encoding
+forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a
+variable length encoding that encodes Unicode characters as 1 to 6
+bytes (only 4 with the currently defined characters). Other encodings
+include UTF-16 and UTF-32 and their big- and little-endian variants
+(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
+and UCS-4 encoding forms.
For more information about encodings--for instance, to learn what
I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.
to this sample program ensures that the output is completely UTF-8,
and removes the program's warning.
-If your locale environment variables (C<LANGUAGE>, C<LC_ALL>,
-C<LC_CTYPE>, C<LANG>) contain the strings 'UTF-8' or 'UTF8',
-regardless of case, then the default encoding of your STDIN, STDOUT,
-and STDERR and of B<any subsequent file open>, is UTF-8. Note that
-this means that Perl expects other software to work, too: if Perl has
-been led to believe that STDIN should be UTF-8, but then STDIN coming
-in from another command is not UTF-8, Perl will complain about the
-malformed UTF-8.
+You can enable automatic UTF-8-ification of your standard file
+handles, default C<open()> layer, and C<@ARGV> by using either
+the C<-C> command line switch or the C<PERL_UNICODE> environment
+variable, see L<perlrun> for the documentation of the C<-C> switch.
+
+Note that this means that Perl expects other software to work, too:
+if Perl has been led to believe that STDIN should be UTF-8, but then
+STDIN coming in from another command is not UTF-8, Perl will complain
+about the malformed UTF-8.
All features that combine Unicode and I/O also require using the new
PerlIO feature. Almost all Perl 5.8 platforms do use PerlIO, though:
With the C<open> pragma you can use the C<:locale> layer
- $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R';
+ BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' }
# the :locale will probe the locale environment variables like LC_ALL
use open OUT => ':locale'; # russki parusski
open(O, ">koi8");
explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
-Perl has been built with the new PerlIO feature.
+Perl has been built with the new PerlIO feature (which is the default
+on most systems).
=head2 Displaying Unicode As Text
sprintf("\\x{%04X}", $_) : # \x{...}
chr($_) =~ /[[:cntrl:]]/ ? # else if control character ...
sprintf("\\x%02X", $_) : # \x..
- chr($_) # else as themselves
+ quotemeta(chr($_)) # else quoted or as themselves
} unpack("U*", $_[0])); # unpack Unicode characters
}
nice_string("foo\x{100}bar\n")
-returns:
+returns the string
+
+ 'foo\x{0100}bar\x0A'
- "foo\x{0100}bar\x0A"
+which is ready to be printed.
=head2 Special Cases
That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
and Unicode characters in C<PV>. See also later in this document
-the discussion about the C<is_utf8> function of the C<Encode> module.
+the discussion about the C<utf8::is_utf8()> function.
=back
Okay, if you insist:
- use Encode 'is_utf8';
- print is_utf8($string) ? 1 : 0, "\n";
+ print utf8::is_utf8($string) ? 1 : 0, "\n";
But note that this doesn't mean that any of the characters in the
string are necessary UTF-8 encoded, or that any of the characters have
$b = "\x{100}";
print "$a = $b\n";
-the output string will be UTF-8-encoded C<ab\x80c\x{100}\n>, but note
-that C<$a> will stay byte-encoded.
+the output string will be UTF-8-encoded C<ab\x80c = \x{100}\n>, but
+C<$a> will stay byte-encoded.
Sometimes you might really need to know the byte length of a string
instead of the character length. For that use either the
How Does Unicode Work With Traditional Locales?
In Perl, not very well. Avoid using locales through the C<locale>
-pragma. Use only one or the other.
+pragma. Use only one or the other. But see L<perlrun> for the
+description of the C<-C> switch and its environment counterpart,
+C<$ENV{PERL_UNICODE}> to see how to enable various Unicode features,
+for example by using locale settings.
=back
=head1 SEE ALSO
L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
-L<perlretut>, L<Unicode::Collate>, L<Unicode::Normalize>, L<Unicode::UCD>
+L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
+L<Unicode::UCD>
=head1 ACKNOWLEDGMENTS
=head1 AUTHOR, COPYRIGHT, AND LICENSE
-Copyright 2001-2002 Jarkko Hietaniemi <jhi@iki.fi>
+Copyright 2001-2002 Jarkko Hietaniemi E<lt>jhi@iki.fiE<gt>
This document may be distributed under the same terms as Perl itself.