pod/perltrap.pod Perl traps for the unwary
pod/perlunicode.pod Perl Unicode support
pod/perluniintro.pod Perl Unicode introduction
+pod/perlunifaq.pod Perl Unicode FAQ
pod/perlunitut.pod Perl Unicode tutorial
pod/perlutil.pod utilities packaged with the Perl distribution
pod/perlvar.pod Perl predefined variables
$octets = encode("iso-8859-1", $string);
B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then $octets
-B<may not be equal to> $string. Though they both contain the same data, the utf8 flag
-for $octets is B<always> off. When you encode anything, utf8 flag of
+B<may not be equal to> $string. Though they both contain the same data, the UTF8 flag
+for $octets is B<always> off. When you encode anything, UTF8 flag of
the result is always off, even when it contains completely valid utf8
-string. See L</"The UTF-8 flag"> below.
+string. See L</"The UTF8 flag"> below.
If the $string is C<undef> then C<undef> is returned.
B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string
B<may not be equal to> $octets. Though they both contain the same data,
-the utf8 flag for $string is on unless $octets entirely consists of
-ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF-8 flag">
+the UTF8 flag for $string is on unless $octets entirely consists of
+ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF8 flag">
below.
If the $string is C<undef> then C<undef> is returned.
$data = decode("iso-8859-1", $data); #2
Both #1 and #2 make $data consist of a completely valid UTF-8 string
-but only #2 turns utf8 flag on. #1 is equivalent to
+but only #2 turns UTF8 flag on. #1 is equivalent to
$data = encode("utf8", decode("iso-8859-1", $data));
-See L</"The UTF-8 flag"> below.
+See L</"The UTF8 flag"> below.
=item $octets = encode_utf8($string);
See L<Encode::Encoding> for more details.
-=head1 The UTF-8 flag
+=head1 The UTF8 flag
-Before the introduction of utf8 support in perl, The C<eq> operator
+Before the introduction of Unicode support in perl, The C<eq> operator
just compared the strings represented by two scalars. Beginning with
-perl 5.8, C<eq> compares two strings with simultaneous consideration
-of I<the utf8 flag>. To explain why we made it so, I will quote page
-402 of C<Programming Perl, 3rd ed.>
+perl 5.8, C<eq> compares two strings with simultaneous consideration of
+I<the UTF8 flag>. To explain why we made it so, I will quote page 402 of
+C<Programming Perl, 3rd ed.>
=over 2
Back when C<Programming Perl, 3rd ed.> was written, not even Perl 5.6.0
was born and many features documented in the book remained
unimplemented for a long time. Perl 5.8 corrected this and the introduction
-of the UTF-8 flag is one of them. You can think of this perl notion as of a
-byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8
+of the UTF8 flag is one of them. You can think of this perl notion as of a
+byte-oriented mode (UTF8 flag off) and a character-oriented mode (UTF8
flag on).
-Here is how Encode takes care of the utf8 flag.
+Here is how Encode takes care of the UTF8 flag.
=over 2
=item *
-When you encode, the resulting utf8 flag is always off.
+When you encode, the resulting UTF8 flag is always off.
=item *
-When you decode, the resulting utf8 flag is on unless you can
+When you decode, the resulting UTF8 flag is on unless you can
unambiguously represent data. Here is the definition of
dis-ambiguity.
After C<$utf8 = decode('foo', $octet);>,
- When $octet is... The utf8 flag in $utf8 is
+ When $octet is... The UTF8 flag in $utf8 is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
Goal #1. And with Encode Goal #2 is assumed but you still have to be
careful in such cases mentioned in B<CAVEAT> paragraphs.
-This utf8 flag is not visible in perl scripts, exactly for the same
+This UTF8 flag is not visible in perl scripts, exactly for the same
reason you cannot (or you I<don't have to>) see if a scalar contains a
string, integer, or floating point number. But you can still peek
and poke these if you will. See the section below.
=item is_utf8(STRING [, CHECK])
-[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
+[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being well-formed
UTF-8. Returns true if successful, false otherwise.
=item _utf8_on(STRING)
-[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is
+[INTERNAL] Turns on the UTF8 flag in STRING. The data in STRING is
B<not> checked for being well-formed UTF-8. Do not use unless you
B<know> that the STRING is well-formed UTF-8. Returns the previous
-state of the UTF-8 flag (so please don't treat the return value as
+state of the UTF8 flag (so please don't treat the return value as
indicating success or failure), or C<undef> if STRING is not a string.
=item _utf8_off(STRING)
-[INTERNAL] Turns off the UTF-8 flag in STRING. Do not use frivolously.
-Returns the previous state of the UTF-8 flag (so please don't treat the
+[INTERNAL] Turns off the UTF8 flag in STRING. Do not use frivolously.
+Returns the previous state of the UTF8 flag (so please don't treat the
return value as indicating success or failure), or C<undef> if STRING is
not a string.
=back
-=head1 UTF-8 vs. utf8
+=head1 UTF-8 vs. utf8 vs. UTF8
....We now view strings not as sequences of bytes, but as sequences
of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
find_encoding("utf_8")->name # ditto. "_" are treated as "-"
find_encoding("UTF8")->name # is 'utf8'.
+The UTF8 flag is internally called UTF8, without a hyphen. It indicates
+whether a string is internally encoded as utf8, also without a hypen.
=head1 SEE ALSO
C<use encoding 'utf8';>, it will print C<4> instead, since C<$string>
is three octets when interpreted as Latin-1.
+=head2 Side effects
+
+If the C<encoding> pragma is in scope then the lengths returned are
+calculated from the length of C<$/> in Unicode characters, which is not
+always the same as the length of C<$/> in the native encoding.
+
+This pragma affects utf8::upgrade, but not utf8::downgrade.
+
=head1 FEATURES THAT REQUIRE 5.8.1
Some of the features offered by this pragma requires perl 5.8.1. Most
=item :utf8
-Declares that the stream accepts perl's internal encoding of
+Declares that the stream accepts perl's I<internal> encoding of
characters. (Which really is UTF-8 on ASCII machines, but is
UTF-EBCDIC on EBCDIC machines.) This allows any character perl can
represent to be read from or written to the stream. The UTF-X encoding
platforms). The C<no utf8> pragma tells Perl to switch back to treating
the source text as literal bytes in the current lexical scope.
-This pragma is primarily a compatibility device. Perl versions
-earlier than 5.6 allowed arbitrary bytes in source code, whereas
-in future we would like to standardize on the UTF-8 encoding for
-source text.
-
B<Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.> The utility functions described below are
-useful for their own purposes, but they are not really part of the
-"pragmatic" effect.
+directly usable without C<use utf8;>.
+
+Because it is not possible to reliably tell UTF-8 from native 8 bit
+encodings, you need either a Byte Order Mark at the beginning of your
+source code, or C<use utf8;>, to instruct perl.
-Until UTF-8 becomes the default format for source text, either this
-pragma or the L<encoding> pragma should be used to recognize UTF-8
-in the source. When UTF-8 becomes the standard source format, this
-pragma will effectively become a no-op. For convenience in what
-follows the term I<UTF-X> is used to refer to UTF-8 on ASCII and ISO
-Latin based platforms and UTF-EBCDIC on EBCDIC based platforms.
+When UTF-8 becomes the standard source format, this pragma will
+effectively become a no-op. For convenience in what follows the term
+I<UTF-X> is used to refer to UTF-8 on ASCII and ISO Latin based
+platforms and UTF-EBCDIC on EBCDIC based platforms.
See also the effects of the C<-C> switch and its cousin, the
C<$ENV{PERL_UNICODE}>, in L<perlrun>.
this pragma until the end the block (or file, if at top level) by
C<no utf8;>.
-If you want to automatically upgrade your 8-bit legacy bytes to Unicode,
-use the L<encoding> pragma instead of this pragma. For example, if
-you want to implicitly upgrade your ISO 8859-1 (Latin-1) bytes to Unicode
-as used in e.g. C<chr()> and C<\x{...}>, try this:
-
- use encoding "latin-1";
- my $c = chr(0xc4);
- my $x = "\x{c5}";
-
-In case you are wondering: C<use encoding 'utf8';> is mostly the same as
-C<use utf8;>, except that C<use encoding> marks all string literals in the
-source code as Unicode, regardless of whether they contain any high-bit bytes.
-Moreover, C<use encoding> installs IO layers on C<STDIN> and C<STDOUT> to work
-with Unicode strings; see L<encoding> for details.
-
=head2 Utility functions
The following functions are defined in the C<utf8::> package by the
=item * $num_octets = utf8::upgrade($string)
-Converts in-place the octet sequence in the native encoding
+Converts in-place the internal octet sequence in the native encoding
(Latin-1 or EBCDIC) to the equivalent character sequence in I<UTF-X>.
-I<$string> already encoded as characters does no harm.
-Returns the number of octets necessary to represent the string as I<UTF-X>.
-Can be used to make sure that the UTF-8 flag is on,
-so that C<\w> or C<lc()> work as Unicode on strings
-containing characters in the range 0x80-0xFF (on ASCII and
-derivatives).
+I<$string> already encoded as characters does no harm. Returns the
+number of octets necessary to represent the string as I<UTF-X>. Can be
+used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
+work as Unicode on strings containing characters in the range 0x80-0xFF
+(on ASCII and derivatives).
B<Note that this function does not handle arbitrary encodings.>
-Therefore I<Encode.pm> is recommended for the general purposes.
-
-Affected by the encoding pragma.
+Therefore Encode is recommended for the general purposes; see also
+L<Encode>.
=item * $success = utf8::downgrade($string[, FAIL_OK])
-Converts in-place the character sequence in I<UTF-X>
-to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
-I<$string> already encoded as octets does no harm.
-Returns true on success. On failure dies or, if the value of
-C<FAIL_OK> is true, returns false.
-Can be used to make sure that the UTF-8 flag is off,
-e.g. when you want to make sure that the substr() or length() function
-works with the usually faster byte algorithm.
+Converts in-place the internal octet sequence in I<UTF-X> to the
+equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
+I<$string> already encoded as native 8 bit does no harm. Can be used to
+make sure that the UTF-8 flag is off, e.g. when you want to make sure
+that the substr() or length() function works with the usually faster
+byte algorithm.
-B<Note that this function does not handle arbitrary encodings.>
-Therefore I<Encode.pm> is recommended for the general purposes.
+Fails if the original I<UTF-X> sequence cannot be represented in the
+native 8 bit encoding. On failure dies or, if the value of C<FAIL_OK> is
+true, returns false.
-B<Not> affected by the encoding pragma.
+Returns true on success.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore Encode is recommended for the general purposes; see also
+L<Encode>.
-B<NOTE:> this function is experimental and may change
-or be removed without notice.
+B<NOTE:> this function is experimental and may change or be removed
+without notice.
=item * utf8::encode($string)
-Converts in-place the character sequence to the corresponding octet sequence
-in I<UTF-X>. The UTF-8 flag is turned off. Returns nothing.
+Converts in-place the character sequence to the corresponding octet
+sequence in I<UTF-X>. The UTF8 flag is turned off, so that after this
+operation, the string is a byte string. Returns nothing.
B<Note that this function does not handle arbitrary encodings.>
-Therefore I<Encode.pm> is recommended for the general purposes.
+Therefore Encode is recommended for the general purposes; see also
+L<Encode>.
-=item * utf8::decode($string)
+=item * $success = utf8::decode($string)
-Attempts to convert in-place the octet sequence in I<UTF-X>
-to the corresponding character sequence. The UTF-8 flag is turned on
-only if the source string contains multiple-byte I<UTF-X> characters.
-If I<$string> is invalid as I<UTF-X>, returns false; otherwise returns true.
+Attempts to convert in-place the octet sequence in I<UTF-X> to the
+corresponding character sequence. The UTF-8 flag is turned on only if
+the source string contains multiple-byte I<UTF-X> characters. If
+I<$string> is invalid as I<UTF-X>, returns false; otherwise returns
+true.
B<Note that this function does not handle arbitrary encodings.>
-Therefore I<Encode.pm> is recommended for the general purposes.
+Therefore Encode is recommended for the general purposes; see also
+L<Encode>.
-B<NOTE:> this function is experimental and may change
-or be removed without notice.
+B<NOTE:> this function is experimental and may change or be removed
+without notice.
=item * $flag = utf8::is_utf8(STRING)
-(Since Perl 5.8.1) Test whether STRING is in UTF-8. Functionally
-the same as Encode::is_utf8().
+(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally.
+Functionally the same as Encode::is_utf8().
=item * $flag = utf8::valid(STRING)
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<perlrun>, L<bytes>, L<perlunicode>
+L<perlunitut>, L<perluniintro>, L<perlrun>, L<bytes>, L<perlunicode>
=cut
Copies a stringified representation of the source SV into the
destination SV. Automatically performs any necessary mg_get and
coercion of numeric values into strings. Guaranteed to preserve
-UTF-8 flag even from overloaded objects. Similar in nature to
+UTF8 flag even from overloaded objects. Similar in nature to
sv_2pv[_flags] but operates directly on an SV instead of just the
string. Mostly uses sv_2pv_flags to do its work, except when that
would lose the UTF-8'ness of the PV.
comparison operators, C<cmp>, C<gt>, C<lt> etc. If there are two or
more dots in the literal, the leading C<v> may be omitted.
- print v9786; # prints UTF-8 encoded SMILEY, "\x{263a}"
+ print v9786; # prints SMILEY, "\x{263a}"
print v102.111.111; # prints "foo"
print 102.111.111; # same
=item Malformed UTF-8 character (%s)
-(S utf8) (F) Perl detected something that didn't comply with UTF-8
-encoding rules.
+(S utf8) (F) Perl detected a string that didn't comply with UTF-8
+encoding rules, even though it had the UTF8 flag on.
-One possible cause is that you read in data that you thought to be in
-UTF-8 but it wasn't (it was for example legacy 8-bit data). Another
-possibility is careless use of utf8::upgrade().
+One possible cause is that you set the UTF8 flag yourself for data that
+you thought to be in UTF-8 but it wasn't (it was for example legacy
+8-bit data). To guard against this, you can use Encode::decode_utf8.
+
+If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte
+sequences are handled gracefully, but if you use C<:utf8>, the flag is
+set without validating the data, possibly resulting in this error
+message.
+
+See also L<Encode/"Handling Malformed Data">.
=item Malformed UTF-16 surrogate
If you chomp a list, each element is chomped, and the total number of
characters removed is returned.
-If the C<encoding> pragma is in scope then the lengths returned are
-calculated from the length of C<$/> in Unicode characters, which is not
-always the same as the length of C<$/> in the native encoding.
-
Note that parentheses are necessary when you're chomping anything
that is not a simple variable. This is because C<chomp $cwd = `pwd`;>
is interpreted as C<(chomp $cwd) = `pwd`;>, rather than as
Returns the character represented by that NUMBER in the character set.
For example, C<chr(65)> is C<"A"> in either ASCII or Unicode, and
-chr(0x263a) is a Unicode smiley face. Note that characters from 128
-to 255 (inclusive) are by default not encoded in UTF-8 Unicode for
-backward compatibility reasons (but see L<encoding>).
+chr(0x263a) is a Unicode smiley face.
Negative values give the Unicode replacement character (chr(0xfffd)),
except under the L<bytes> pragma, where low eight bits of the value
For the reverse, use L</ord>.
-Note that under the C<bytes> pragma the NUMBER is masked to
-the low eight bits.
+Note that characters from 128 to 255 (inclusive) are by default
+internally not encoded as UTF-8 for backward compatibility reasons.
-See L<perlunicode> and L<encoding> for more about Unicode.
+See L<perlunicode> for more about Unicode.
=item chroot FILENAME
X<chroot> X<root>
Note the I<characters>: if the EXPR is in Unicode, you will get the
number of characters, not the number of bytes. To get the length
-in bytes, use C<do { use bytes; length(EXPR) }>, see L<bytes>.
+of the internal string in bytes, use C<bytes::length(EXPR)>, see
+L<bytes>. Note that the internal encoding is variable, and the number
+of bytes usually meaningless. To get the number of bytes that the
+string would have when encoded as UTF-8, use
+C<length(Encoding::encode_utf8(EXPR))>.
=item link OLDFILE,NEWFILE
X<link>
that affect how the input and output are processed (see L<open> and
L<PerlIO> for more details). For example
- open(FH, "<:utf8", "file")
+ open(FH, "<:encoding(UTF-8)", "file")
will open the UTF-8 encoded file containing Unicode characters,
see L<perluniintro>. Note that if layers are specified in the
uses C<$_>.
For the reverse, see L</chr>.
-See L<perlunicode> and L<encoding> for more about Unicode.
+See L<perlunicode> for more about Unicode.
=item our EXPR
X<our> X<global>
extend the string with sufficiently many zero bytes. It is an error
to try to write off the beginning of the string (i.e. negative OFFSET).
-The string should not contain any character with the value > 255 (which
-can only happen if you're using UTF-8 encoding). If it does, it will be
-treated as something that is not UTF-8 encoded. When the C<vec> was
-assigned to, other parts of your program will also no longer consider the
-string to be UTF-8 encoded. In other words, if you do have such characters
-in your string, vec() will operate on the actual byte string, and not the
-conceptual character string.
+If the string happens to be encoded as UTF-8 internally (and thus has
+the UTF8 flag set), this is ignored by C<vec>, and it operates on the
+internal byte string, not the conceptual character string, even if you
+only have characters with values less than 256.
Strings created with C<vec> can also be manipulated with the logical
operators C<|>, C<&>, C<^>, and C<~>. These operators will assume a bit
produced a new character set containing all the characters you can
possibly think of and more. There are several ways of representing these
characters, and the one Perl uses is called UTF-8. UTF-8 uses
-a variable number of bytes to represent a character, instead of just
-one. You can learn more about Unicode at http://www.unicode.org/
+a variable number of bytes to represent a character. You can learn more
+about Unicode and Perl's Unicode model in L<perlunicode>.
=head2 How can I recognise a UTF-8 string?
has that byte sequence as well. So you can't tell just by looking - this
is what makes Unicode input an interesting problem.
-The API function C<is_utf8_string> can help; it'll tell you if a string
-contains only valid UTF-8 characters. However, it can't do the work for
-you. On a character-by-character basis, C<is_utf8_char> will tell you
-whether the current character in a string is valid UTF-8.
+In general, you either have to know what you're dealing with, or you
+have to guess. The API function C<is_utf8_string> can help; it'll tell
+you if a string contains only valid UTF-8 characters. However, it can't
+do the work for you. On a character-by-character basis, C<is_utf8_char>
+will tell you whether the current character in a string is valid UTF-8.
=head2 How does UTF-8 represent Unicode characters?
As mentioned above, UTF-8 uses a variable number of bytes to store a
-character. Characters with values 1...128 are stored in one byte, just
-like good ol' ASCII. Character 129 is stored as C<v194.129>; this
+character. Characters with values 0...127 are stored in one byte, just
+like good ol' ASCII. Character 128 is stored as C<v194.128>; this
continues up to character 191, which is C<v194.191>. Now we've run out of
bits (191 is binary C<10111111>) so we move on; 192 is C<v195.128>. And
so it goes on, moving to three bytes at character 2048.
=head2 How does Perl store UTF-8 strings?
Currently, Perl deals with Unicode strings and non-Unicode strings
-slightly differently. If a string has been identified as being UTF-8
-encoded, Perl will set a flag in the SV, C<SVf_UTF8>. You can check and
-manipulate this flag with the following macros:
+slightly differently. A flag in the SV, C<SVf_UTF8>, indicates that the
+string is internally encoded as UTF-8. Without it, the byte value is the
+codepoint number and vice versa (in other words, the string is encoded
+as iso-8859-1). You can check and manipulate this flag with the
+following macros:
SvUTF8(sv)
SvUTF8_on(sv)
undesirable results.
The problem comes when you have, for instance, a string that isn't
-flagged is UTF-8, and contains a byte sequence that could be UTF-8 -
+flagged as UTF-8, and contains a byte sequence that could be UTF-8 -
especially when combining non-UTF-8 and UTF-8 strings.
Never forget that the C<SVf_UTF8> flag is separate to the PV value; you
The C<char*> string does not tell you the whole story, and you can't
copy or reconstruct an SV just by copying the string value. Check if the
-old SV has the UTF-8 flag set, and act accordingly:
+old SV has the UTF8 flag set, and act accordingly:
p = SvPV(sv, len);
frobnicate(p);
appropriately.
Since just passing an SV to an XS function and copying the data of
-the SV is not enough to copy the UTF-8 flags, even less right is just
+the SV is not enough to copy the UTF8 flags, even less right is just
passing a C<char *> to an XS function.
=head2 How do I convert a string to UTF-8?
-If you're mixing UTF-8 and non-UTF-8 strings, you might find it necessary
-to upgrade one of the strings to UTF-8. If you've got an SV, the easiest
-way to do this is:
+If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
+one of the strings to UTF-8. If you've got an SV, the easiest way to do
+this is:
sv_utf8_upgrade(sv);
If you do this in a binary operator, you will actually change one of the
strings that came into the operator, and, while it shouldn't be noticeable
-by the end user, it can cause problems.
+by the end user, it can cause problems in deficient code.
Instead, C<bytes_to_utf8> will give you a UTF-8-encoded B<copy> of its
string argument. This is useful for having the data available for
point of view) characters in a single byte while encoding the rarer
ones in three or more bytes.
-So what has this got to do with C<pack>? Well, if you want to convert
-between a Unicode number and its UTF-8 representation you can do so by
-using template code C<U>. As an example, let's produce the UTF-8
-representation of the Euro currency symbol (code number 0x20AC):
+Perl uses UTF-8, internally, for most Unicode strings.
+
+So what has this got to do with C<pack>? Well, if you want to compose a
+Unicode string (that is internally encoded as UTF-8), you can do so by
+using template code C<U>. As an example, let's produce the Euro currency
+symbol (code number 0x20AC):
$UTF8{Euro} = pack( 'U', 0x20AC );
+ # Equivalent to: $UTF8{Euro} = "\x{20ac}";
-Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
-round trip can be completed with C<unpack>:
+Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes:
+"\xe2\x82\xac". However, it contains only 1 character, number 0x20AC.
+The round trip can be completed with C<unpack>:
$Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
+Unpacking using the C<U> template code also works on UTF-8 encoded byte
+strings.
+
Usually you'll want to pack or unpack UTF-8 strings:
# pack and unpack the Hebrew alphabet
my $alefbet = pack( 'U*', 0x05d0..0x05ea );
my @hebrew = unpack( 'U*', $utf );
+Please note: in the general case, you're better off using
+Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
+unicode string, and Encode::encode_utf8 to encode a Perl unicode string
+to UTF-8 bytes. These functions provide means of handling invalid byte
+sequences and generally have a friendlier interface.
=head2 Another Portable Binary Encoding
later. If the bytes are native 8-bit bytes, you can use the C<bytes>
pragma. If the bytes are in a string (regular expression being a
curious string), you can often also use the C<\xHH> notation instead
-of embedding the bytes as-is. If they are in some particular legacy
-encoding (ether single-byte or something more complicated), you can
-use the C<encoding> pragma. (If you want to write your code in UTF-8,
-you can use either the C<utf8> pragma, or the C<encoding> pragma.)
-The C<bytes> and C<utf8> pragmata are available since Perl 5.6.0, and
-the C<encoding> pragma since Perl 5.8.0.
+of embedding the bytes as-is. (If you want to write your code in UTF-8,
+you can use the C<utf8>.) The C<bytes> and C<utf8> pragmata are
+available since Perl 5.6.0.
=head2 System Resources
With the advent of 5.6.0, Perl regexps can handle more than just the
standard ASCII character set. Perl now supports I<Unicode>, a standard
for representing the alphabets from virtually all of the world's written
-languages, and a host of symbols. Perl uses the UTF-8 encoding, in which
-ASCII characters are still encoded as one byte, but characters greater
-than C<chr(127)> may be stored as two or more bytes.
+languages, and a host of symbols. Perl's text strings are unicode strings, so
+they can contain characters with a value (codepoint or character number) higher
+than 255
What does this mean for regexps? Well, regexp users don't need to know
much about Perl's internal representation of strings. But they do need
-to know 1) how to represent Unicode characters in a regexp and 2) when
-a matching operation will treat the string to be searched as a
-sequence of bytes (the old way) or as a sequence of Unicode characters
-(the new way). The answer to 1) is that Unicode characters greater
-than C<chr(127)> may be represented using the C<\x{hex}> notation,
-with C<hex> a hexadecimal integer:
+to know 1) how to represent Unicode characters in a regexp and 2) that
+a matching operation will treat the string to be searched as a sequence
+of characters, not bytes. The answer to 1) is that Unicode characters
+greater than C<chr(255)> are represented using the C<\x{hex}> notation,
+because the \0 octal and \x hex (without curly braces) don't go further
+than 255.
/\x{263a}/; # match a Unicode smiley face :)
-Unicode characters in the range of 128-255 use two hexadecimal digits
-with braces: C<\x{ab}>. Note that this is in general different than
-C<\xab>, which is just a hexadecimal byte with no Unicode significance,
-except when your script is encoded in UTF-8 where C<\xab> has the
-same byte representation as C<\x{ab}>.
-
B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
utf8> to use any Unicode features. This is no more the case: for
almost all Unicode processing, the explicit C<utf8> pragma is not
lib/perl5/X.X.X/unicore directory (where X.X.X is the perl
version number as it is installed on your system).
-The answer to requirement 2), as of 5.6.0, is that if a regexp
-contains Unicode characters, the string is searched as a sequence of
-Unicode characters. Otherwise, the string is searched as a sequence of
-bytes. If the string is being searched as a sequence of Unicode
-characters, but matching a single byte is required, we can use the C<\C>
-escape sequence. C<\C> is a character class akin to C<.> except that
-it matches I<any> byte 0-255. So
-
- use charnames ":full"; # use named chars with Unicode full names
- $x = "a";
- $x =~ /\C/; # matches 'a', eats one byte
- $x = "";
- $x =~ /\C/; # doesn't match, no bytes to match
- $x = "\N{MERCURY}"; # two-byte Unicode character
- $x =~ /\C/; # matches, but dangerous!
-
-The last regexp matches, but is dangerous because the string
-I<character> position is no longer synchronized to the string I<byte>
-position. This generates the warning 'Malformed UTF-8
-character'. The C<\C> is best used for matching the binary data in strings
-with binary data intermixed with Unicode characters.
-
-Let us now discuss the rest of the character classes. Just as with
-Unicode characters, there are named Unicode character classes
-represented by the C<\p{name}> escape sequence. Closely associated is
-the C<\P{name}> character class, which is the negation of the
-C<\p{name}> class. For example, to match lower and uppercase
-characters,
+The answer to requirement 2), as of 5.6.0, is that a regexp uses unicode
+characters. Internally, this is encoded to bytes using either UTF-8 or a
+native 8 bit encoding, depending on the history of the string, but
+conceptually it is a sequence of characters, not bytes. See
+L<perlunitut> for a tutorial about that.
+
+Let us now discuss Unicode character classes. Just as with Unicode
+characters, there are named Unicode character classes represented by the
+C<\p{name}> escape sequence. Closely associated is the C<\P{name}>
+character class, which is the negation of the C<\p{name}> class. For
+example, to match lower and uppercase characters,
use charnames ":full"; # use named chars with Unicode full names
$x = "BOB";
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial|perlunitut> before reading this reference
+document.
+
=over 4
=item Input and Output Layers
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item Regular Expressions
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
=item BOM-marked scripts and UTF-16 scripts autodetected
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
- use encoding 'utf8';
-
See L</"Byte and Character Semantics"> for more details.
=back
character data are concatenated, the new string will be created by
decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
old Unicode string used EBCDIC. This translation is done without
-regard to the system's native 8-bit encoding. To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma. See L<encoding>.
+regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>. This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
Additionally, if you
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
=item *
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
=item *
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-See L</"Unicode Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
extension doesn't know about the flag, it's likely that the extension
will return incorrectly-flagged data.
Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.
A scalar we got back from an extension
If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
if ($] > 5.007) {
require Encode;
Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
utf8::downgrade($val) if $] > 5.007;
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
=cut
--- /dev/null
+=head1 NAME
+
+perlunifaq - Perl Unicode FAQ
+
+=head1 DESCRIPTION
+
+This is a list of questions and answers about Unicode in Perl, intended to be
+read after L<perlunitut>.
+
+=head2 perlunitut isn't really a Unicode tutorial, is it?
+
+No, and this isn't really a Unicode FAQ.
+
+Perl has an abstracted interface for all supported character encodings, so they
+is actually a generic C<Encode> tutorial and C<Encode> FAQ. But many people
+think that Unicode is special and magical, and I didn't want to disappoint
+them, so I decided to call the document a Unicode tutorial.
+
+=head2 What about binary data, like images?
+
+Well, apart from a bare C<binmode $fh>, you shouldn't treat them specially.
+(The binmode is needed because otherwise Perl may convert line endings on Win32
+systems.)
+
+Be careful, though, to never combine text strings with binary strings. If you
+need text in a binary stream, encode your text strings first using the
+appropriate encoding, then join them with binary strings. See also: "What if I
+don't encode?".
+
+=head2 What about the UTF8 flag?
+
+Please, unless you're hacking the internals, or debugging weirdness, don't
+think about the UTF8 flag at all. That means that you very probably shouldn't
+use C<is_utf8>, C<_utf8_on> or C<_utf8_off> at all.
+
+Perl's internal format happens to be UTF-8. Unfortunately, Perl can't keep a
+secret, so everyone knows about this. That is the source of much confusion.
+It's better to pretend that the internal format is some unknown encoding,
+and that you always have to encode and decode explicitly.
+
+=head2 When should I decode or encode?
+
+Whenever you're communicating with anything that is external to your perl
+process, like a database, a text file, a socket, or another program. Even if
+the thing you're communicating with is also written in Perl.
+
+=head2 What if I don't decode?
+
+Whenever your encoded, binary string is used together with a text string, Perl
+will assume that your binary string was encoded with ISO-8859-1, also known as
+latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For
+example, if it was UTF-8, the individual bytes of multibyte characters are seen
+as separate characters, and then again converted to UTF-8. Such double encoding
+can be compared to double HTML encoding (C<&gt;>), or double URI encoding
+(C<%253E>).
+
+This silent implicit decoding is known as "upgrading". That may sound
+positive, but it's best to avoid it.
+
+=head2 What if I don't encode?
+
+Your text string will be sent using the bytes in Perl's internal format. In
+some cases, Perl will warn you that you're doing something wrong, with a
+friendly warning:
+
+ Wide character in print at example.pl line 2.
+
+Because the internal format is often UTF-8, these bugs are hard to spot,
+because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't
+use the fact that Perl's internal format is UTF-8 to your advantage. Encode
+explicitly to avoid weird bugs, and to show to maintenance programmers that you
+thought this through.
+
+=head2 Is there a way to automatically decode or encode?
+
+If all data that comes from a certain handle is encoded in exactly the same
+way, you can tell the PerlIO system to automatically decode everything, with
+the C<encoding> layer. If you do this, you can't accidentally forget to decode
+or encode anymore, on things that use the layered handle.
+
+You can provide this layer when C<open>ing the file:
+
+ open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write
+ open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read
+
+Or if you already have an open filehandle:
+
+ binmode $fh, ':encoding(UTF-8)';
+
+Some database drivers for DBI can also automatically encode and decode, but
+that is typically limited to the UTF-8 encoding, because they cheat.
+
+=head2 Cheat?! Tell me, how can I cheat?
+
+Well, because Perl's internal format is UTF-8, you can just skip the encoding
+or decoding step, and manipulate the UTF8 flag directly.
+
+Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>. This is widely
+accepted as good behavior when you're writing, but it can be dangerous when
+reading, because it causes internal inconsistency when you have invalid byte
+sequences.
+
+Instead of C<decode> and C<encode>, you could use C<_utf8_on> and C<_utf8_off>,
+but this is considered bad style. Especially C<_utf8_on> can be dangerous, for
+the same reason that C<:utf8> can.
+
+There are some shortcuts for oneliners; see C<-C> in L<perlrun>.
+
+=head2 What if I don't know which encoding was used?
+
+Do whatever you can to find out, and if you have to: guess. (Don't forget to
+document your guess with a comment.)
+
+You could open the document in a web browser, and change the character set or
+character encoding until you can visually confirm that all characters look the
+way they should.
+
+There is no way to reliably detect the encoding automatically, so if people
+keep sending you data without charset indication, you may have to educate them.
+
+=head2 Can I use Unicode in my Perl sources?
+
+Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the
+C<use utf8> pragma.
+
+ use utf8;
+
+This doesn't do anything to your input, or to your output. It only influences
+the way your sources are read. You can use Unicode in string literals, in
+identifiers (but they still have to be "word characters" according to C<\w>),
+and even in custom delimiters.
+
+=head2 Data::Dumper doesn't restore the UTF8 flag; is it broken?
+
+No, Data::Dumper's Unicode abilities are as they should be. There have been
+some complaints that it should restore the UTF8 flag when the data is read
+again with C<eval>. However, you should really not look at the flag, and
+nothing indicates that Data::Dumper should break this rule.
+
+Here's what happens: when Perl reads in a string literal, it sticks to 8 bit
+encoding as long as it can. (But perhaps originally it was internally encoded
+as UTF-8, when you dumped it.) When it has to give that up because other
+characters are added to the text string, it silently upgrades the string to
+UTF-8.
+
+If you properly encode your strings for output, none of this is of your
+concern, and you can just C<eval> dumped data as always.
+
+=head2 How can I determine if a string is a text string or a binary string?
+
+You can't. Some use the UTF8 flag for this, but that's misuse, and makes well
+behaved modules like Data::Dumper look bad. The flag is useless for this
+purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is
+used to store the string.
+
+This is something you, the programmer, has to keep track of; sorry. You could
+consider adopting a kind of "Hungarian notation" to help with this.
+
+=head2 How do I convert from encoding FOO to encoding BAR?
+
+By first converting the FOO-encoded byte string to a text string, and then the
+text string to a BAR-encoded byte string:
+
+ my $text_string = decode('FOO', $foo_string);
+ my $bar_string = encode('BAR', $text_string);
+
+or by skipping the text string part, and going directly from one binary
+encoding to the other:
+
+ use Encode qw(from_to);
+ from_to($string, 'FOO', 'BAR'); # changes contents of $string
+
+or by letting automatic decoding and encoding do all the work:
+
+ open my $foofh, '<:encoding(FOO)', 'example.foo.txt';
+ open my $barfh, '>:encoding(BAR)', 'example.bar.txt';
+ print { $barfh } $_ while <$foofh>;
+
+=head2 What about the C<use bytes> pragma?
+
+Don't use it. It makes no sense to deal with bytes in a text string, and it
+makes no sense to deal with characters in a byte string. Do the proper
+conversions (by decoding/encoding), and things will work out well: you get
+character counts for decoded data, and byte counts for encoded data.
+
+C<use bytes> is usually a failed attempt to do something useful. Just forget
+about it.
+
+=head2 What are C<decode_utf8> and C<encode_utf8>?
+
+These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
+...)>.
+
+=head2 What's the difference between C<UTF-8> and C<utf8>?
+
+C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
+what it accepts. If you have to communicate with things that aren't so liberal,
+you may want to consider using C<UTF-8>. If you have to communicate with things
+that are too liberal, you may have to use C<utf8>. The full explanation is in
+L<Encode>.
+
+C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
+consistently, even where utf8 is actually used internally, because the
+distinction can be hard to make, and is mostly irrelevant.
+
+For example, utf8 can be used for code points that don't exist in Unicode, like
+9999999, but if you encode that to UTF-8, you get a substitution character (by
+default; see L<Encode/"Handling Malformed Data"> for more ways of dealing with
+this.)
+
+Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not
+some other encoding.)
+
+=head2 I lost track; what encoding is the internal format really?
+
+It's good that you lost track, because you shouldn't depend on the internal
+format being any specific encoding. But since you asked: by default, the
+internal format is either ISO-8859-1 (latin-1), or utf8, depending on the
+history of the string. On EBCDIC platforms, this may be different even.
+
+Perl knows how it stored the string internally, and will use that knowledge
+when you C<encode>. In other words: don't try to find out what the internal
+encoding for a certain string is, but instead just encode it into the encoding
+that you want.
+
+=head2 What character encodings does Perl support?
+
+To find out which character encodings your Perl supports, run:
+
+ perl -MEncode -le "print for Encode->encodings(':all')"
+
+=head2 Which version of perl should I use?
+
+Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
+The tutorial and FAQ are based on the status quo as of C<5.8.8>.
+
+You should also check your modules, and upgrade them if necessary. For example,
+HTML::Entities requires version >= 1.32 to function correctly, even though the
+changelog is silent about this.
+
+=head1 AUTHOR
+
+Juerd Waalboer <juerd@cpan.org>
+
+=head1 SEE ALSO
+
+L<perlunicode>, L<perluniintro>, L<Encode>
+
When you combine legacy data and Unicode the legacy data needs
to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if
-applicable) is assumed. You can override this assumption by
-using the C<encoding> pragma, for example
-
- use encoding 'latin2'; # ISO 8859-2
-
-in which case literals (string or regular expressions), C<chr()>,
-and C<ord()> in your whole script are assumed to produce Unicode
-characters from ISO 8859-2 code points. Note that the matching for
-encoding names is forgiving: instead of C<latin2> you could have
-said C<Latin 2>, or C<iso8859-2>, or other variations. With just
-
- use encoding;
-
-the environment variable C<PERL_ENCODING> will be consulted.
-If that variable isn't set, the encoding pragma will fail.
+applicable) is assumed.
The C<Encode> module knows about many encodings and has interfaces
for doing conversions between those encodings:
while (<$nihongo>) { print $unicode $_ }
The naming of encodings, both by the C<open()> and by the C<open>
-pragma, is similar to the C<encoding> pragma in that it allows for
-flexible names: C<koi8-r> and C<KOI8R> will both be understood.
+pragma allows for flexible names: C<koi8-r> and C<KOI8R> will both be
+understood.
Common encodings recognized by ISO, MIME, IANA, and various other
standardisation organisations are recognised; for a more detailed
=head1 SEE ALSO
-L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perlunicode>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlrun>, L<Unicode::Collate>, L<Unicode::Normalize>,
L<Unicode::UCD>
irrelevant here, and so are encodings. Each character is just that: the
character.
+Text strings are also called B<Unicode strings>, because in Perl, every text
+string is a Unicode string.
+
On a text string, you would do things like:
$text =~ s/foo/bar/;
"Content-Type: text/plain; charset=UTF-8",
"Content-Length: $byte_count"
-=head2 Q and A
-
-=head3 This isn't really a Unicode tutorial, is it?
-
-No, Perl has an abstracted interface for all supported character encodings, so
-this is actually a generic C<Encode> tutorial. But many people think that
-Unicode is special and magical, and I didn't want to disappoint them, so I
-decided to call this document a Unicode tutorial.
-
-=head3 What about binary data, like images?
-
-Well, apart from a bare C<binmode $fh>, you shouldn't treat them specially.
-(The binmode is needed because otherwise Perl may convert line endings on Win32
-systems.)
-
-Be careful, though, to never combine text strings with binary strings. If you
-need text in a binary stream, encode your text strings first using the
-appropriate encoding, then join them with binary strings. See also: "What if I
-don't encode?".
-
-=head3 What about the UTF-8 flag?
-
-Please, unless you're hacking the internals, or debugging weirdness, don't
-think about the UTF-8 flag at all. That means that you very probably shouldn't
-use C<is_utf8>, C<_utf8_on> or C<_utf8_off> at all.
-
-Perl's internal format happens to be UTF-8. Unfortunately, Perl can't keep a
-secret, so everyone knows about this. That is the source of much confusion.
-It's better to pretend that the internal format is some unknown encoding,
-and that you always have to encode and decode explicitly.
-
-=head3 When should I decode or encode?
-
-Whenever you're communicating with anything that is external to your perl
-process, like a database, a text file, a socket, or another program. Even if
-the thing you're communicating with is also written in Perl.
-
-=head3 What if I don't decode?
-
-Whenever your encoded, binary string is used together with a text string, Perl
-will assume that your binary string was encoded with ISO-8859-1, also known as
-latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For
-example, if it was UTF-8, the individual bytes of multibyte characters are seen
-as separate characters, and then again converted to UTF-8. Such double encoding
-can be compared to double HTML encoding (C<&gt;>), or double URI encoding
-(C<%253E>).
-
-This silent implicit decoding is known as "upgrading". That may sound
-positive, but it's best to avoid it.
-
-=head3 What if I don't encode?
-
-Your text string will be sent using the bytes in Perl's internal format. In
-some cases, Perl will warn you that you're doing something wrong, with a
-friendly warning:
-
- Wide character in print at example.pl line 2.
-
-Because the internal format is often UTF-8, these bugs are hard to spot,
-because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't
-use the fact that Perl's internal format is UTF-8 to your advantage. Encode
-explicitly to avoid weird bugs, and to show to maintenance programmers that you
-thought this through.
-
-=head3 Is there a way to automatically decode or encode?
-
-If all data that comes from a certain handle is encoded in exactly the same
-way, you can tell the PerlIO system to automatically decode everything, with
-the C<encoding> layer. If you do this, you can't accidentally forget to decode
-or encode anymore, on things that use the layered handle.
-
-You can provide this layer when C<open>ing the file:
-
- open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write
- open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read
-
-Or if you already have an open filehandle:
-
- binmode $fh, ':encoding(UTF-8)';
-
-Some database drivers for DBI can also automatically encode and decode, but
-that is typically limited to the UTF-8 encoding, because they cheat.
-
-=head3 Cheat?! Tell me, how can I cheat?
-
-Well, because Perl's internal format is UTF-8, you can just skip the encoding
-or decoding step, and manipulate the UTF-8 flag directly.
-
-Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>. This is widely
-accepted as good behavior.
-
-Instead of C<decode> and C<encode>, you could use C<_utf8_on> and C<_utf8_off>.
-But this is, contrary to C<:utf8>, considered bad style.
-
-There are some shortcuts for oneliners; see C<-C> in L<perlrun>.
-
-=head3 What if I don't know which encoding was used?
-
-Do whatever you can to find out, and if you have to: guess. (Don't forget to
-document your guess with a comment.)
-
-You could open the document in a web browser, and change the character set or
-character encoding until you can visually confirm that all characters look the
-way they should.
-
-There is no way to reliably detect the encoding automatically, so if people
-keep sending you data without charset indication, you may have to educate them.
-
-=head3 Can I use Unicode in my Perl sources?
-
-Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the
-C<use utf8> pragma.
-
- use utf8;
-
-This doesn't do anything to your input, or to your output. It only influences
-the way your sources are read. You can use Unicode in string literals, in
-identifiers (but they still have to be "word characters" according to C<\w>),
-and even in custom delimiters.
-
-=head3 Data::Dumper doesn't restore the UTF-8 flag; is it broken?
-
-No, Data::Dumper's Unicode abilities are as they should be. There have been
-some complaints that it should restore the UTF-8 flag when the data is read
-again with C<eval>. However, you should really not look at the flag, and
-nothing indicates that Data::Dumper should break this rule.
-
-Here's what happens: when Perl reads in a string literal, it sticks to 8 bit
-encoding as long as it can. (But perhaps originally it was internally encoded
-as UTF-8, when you dumped it.) When it has to give that up because other
-characters are added to the text string, it silently upgrades the string to
-UTF-8.
-
-If you properly encode your strings for output, none of this is of your
-concern, and you can just C<eval> dumped data as always.
-
-=head3 How can I determine if a string is a text string or a binary string?
-
-You can't. Some use the UTF-8 flag for this, but that's misuse, and makes well
-behaved modules like Data::Dumper look bad. The flag is useless for this
-purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is
-used to store the string.
-
-This is something you, the programmer, has to keep track of; sorry. You could
-consider adopting a kind of "Hungarian notation" to help with this.
-
-=head3 How do I convert from encoding FOO to encoding BAR?
-
-By first converting the FOO-encoded byte string to a text string, and then the
-text string to a BAR-encoded byte string:
-
- my $text_string = decode('FOO', $foo_string);
- my $bar_string = encode('BAR', $text_string);
-
-or by skipping the text string part, and going directly from one binary
-encoding to the other:
-
- use Encode qw(from_to);
- from_to($string, 'FOO', 'BAR'); # changes contents of $string
-
-or by letting automatic decoding and encoding do all the work:
-
- open my $foofh, '<:encoding(FOO)', 'example.foo.txt';
- open my $barfh, '>:encoding(BAR)', 'example.bar.txt';
- print { $barfh } $_ while <$foofh>;
-
-=head3 What about the C<use bytes> pragma?
-
-Don't use it. It makes no sense to deal with bytes in a text string, and it
-makes no sense to deal with characters in a byte string. Do the proper
-conversions (by decoding/encoding), and things will work out well: you get
-character counts for decoded data, and byte counts for encoded data.
-
-C<use bytes> is usually a failed attempt to do something useful. Just forget
-about it.
-
-=head3 What are C<decode_utf8> and C<encode_utf8>?
-
-These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
-...)>.
-
-=head3 What's the difference between C<UTF-8> and C<utf8>?
-
-C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
-what it accepts. If you have to communicate with things that aren't so liberal,
-you may want to consider using C<UTF-8>. If you have to communicate with things
-that are too liberal, you may have to use C<utf8>. The full explanation is in
-L<Encode>.
-
-C<UTF-8> is internally known as C<utf-8-strict>. This tutorial uses UTF-8
-consistently, even where utf8 is actually used internally, because the
-distinction can be hard to make, and is mostly irrelevant.
-
-Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not
-some other encoding.)
-
-=head3 I lost track; what encoding is the internal format really?
-
-It's good that you lost track, because you shouldn't depend on the internal
-format being any specific encoding. But since you asked: by default, the
-internal format is either ISO-8859-1 (latin-1), or utf8, depending on the
-history of the string.
-
-Perl knows how it stored the string internally, and will use that knowledge
-when you C<encode>. In other words: don't try to find out what the internal
-encoding for a certain string is, but instead just encode it into the encoding
-that you want.
-
-=head3 What character encodings does Perl support?
-
-To find out which character encodings your Perl supports, run:
-
- perl -MEncode -le "print for Encode->encodings(':all')"
-
-=head3 Which version of perl should I use?
-
-Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
-This tutorial is based on the status quo as of C<5.8.7>.
-
-You should also check your modules, and upgrade them if necessary. For example,
-HTML::Entities requires version >= 1.32 to function correctly, even though the
-changelog is silent about this.
-
=head1 SUMMARY
Decode everything you receive, encode everything you send out. (If it's text
data.)
+=head1 Q and A (or FAQ)
+
+After reading this document, you ought to read L<perlunifaq> too.
+
=head1 ACKNOWLEDGEMENTS
Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the
=head1 SEE ALSO
-L<perlunicode>, L<perluniintro>, L<Encode>
+L<perlunifaq>, L<perlunicode>, L<perluniintro>, L<Encode>
as a string composed of characters with those ordinals. Thus in Perl v5.6.0
it equals C<chr(5) . chr(6) . chr(0)> and will return true for
C<$^V eq v5.6.0>. Note that the characters in this string value can
-potentially be in Unicode range.
+potentially be greater than 255.
This variable first appeared in perl 5.6.0; earlier versions of perl will
see an undefined value.
Copies a stringified representation of the source SV into the
destination SV. Automatically performs any necessary mg_get and
coercion of numeric values into strings. Guaranteed to preserve
-UTF-8 flag even from overloaded objects. Similar in nature to
+UTF8 flag even from overloaded objects. Similar in nature to
sv_2pv[_flags] but operates directly on an SV instead of just the
string. Mostly uses sv_2pv_flags to do its work, except when that
would lose the UTF-8'ness of the PV.