From: Juerd Waalboer <#####@juerd.nl> Date: Sun, 4 Mar 2007 16:00:19 +0000 (+0100) Subject: Re: [PATCH] (Re: [PATCH] unicode/utf8 pod) X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=2575c402a8f9be55f848bdfb219afbf912c50ac1;p=p5sagit%2Fp5-mst-13.2.git Re: [PATCH] (Re: [PATCH] unicode/utf8 pod) Message-ID: <20070304150019.GN4723@c4.convolution.nl> p4raw-id: //depot/perl@30493 --- diff --git a/MANIFEST b/MANIFEST index 4961f5f..5b9f51c 100644 --- a/MANIFEST +++ b/MANIFEST @@ -3117,6 +3117,7 @@ pod/perltoot.pod Perl OO tutorial, part 1 pod/perltrap.pod Perl traps for the unwary pod/perlunicode.pod Perl Unicode support pod/perluniintro.pod Perl Unicode introduction +pod/perlunifaq.pod Perl Unicode FAQ pod/perlunitut.pod Perl Unicode tutorial pod/perlutil.pod utilities packaged with the Perl distribution pod/perlvar.pod Perl predefined variables diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm index ae04755..bdfa695 100644 --- a/ext/Encode/Encode.pm +++ b/ext/Encode/Encode.pm @@ -406,10 +406,10 @@ iso-8859-1 (also known as Latin1), $octets = encode("iso-8859-1", $string); B: When you run C<$octets = encode("utf8", $string)>, then $octets -B $string. Though they both contain the same data, the utf8 flag -for $octets is B off. When you encode anything, utf8 flag of +B $string. Though they both contain the same data, the UTF8 flag +for $octets is B off. When you encode anything, UTF8 flag of the result is always off, even when it contains completely valid utf8 -string. See L below. +string. See L below. If the $string is C then C is returned. @@ -427,8 +427,8 @@ For example, to convert ISO-8859-1 data to a string in Perl's internal format: B: When you run C<$string = decode("utf8", $octets)>, then $string B $octets. Though they both contain the same data, -the utf8 flag for $string is on unless $octets entirely consists of -ASCII data (or EBCDIC on EBCDIC machines). See L +the UTF8 flag for $string is on unless $octets entirely consists of +ASCII data (or EBCDIC on EBCDIC machines). See L below. If the $string is C then C is returned. @@ -458,11 +458,11 @@ B: The following operations look the same but are not quite so; $data = decode("iso-8859-1", $data); #2 Both #1 and #2 make $data consist of a completely valid UTF-8 string -but only #2 turns utf8 flag on. #1 is equivalent to +but only #2 turns UTF8 flag on. #1 is equivalent to $data = encode("utf8", decode("iso-8859-1", $data)); -See L below. +See L below. =item $octets = encode_utf8($string); @@ -684,13 +684,13 @@ arguments are taken as aliases for I<$object>. See L for more details. -=head1 The UTF-8 flag +=head1 The UTF8 flag -Before the introduction of utf8 support in perl, The C operator +Before the introduction of Unicode support in perl, The C operator just compared the strings represented by two scalars. Beginning with -perl 5.8, C compares two strings with simultaneous consideration -of I. To explain why we made it so, I will quote page -402 of C +perl 5.8, C compares two strings with simultaneous consideration of +I. To explain why we made it so, I will quote page 402 of +C =over 2 @@ -719,27 +719,27 @@ byte-oriented Perl and a character-oriented Perl. Back when C was written, not even Perl 5.6.0 was born and many features documented in the book remained unimplemented for a long time. Perl 5.8 corrected this and the introduction -of the UTF-8 flag is one of them. You can think of this perl notion as of a -byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8 +of the UTF8 flag is one of them. You can think of this perl notion as of a +byte-oriented mode (UTF8 flag off) and a character-oriented mode (UTF8 flag on). -Here is how Encode takes care of the utf8 flag. +Here is how Encode takes care of the UTF8 flag. =over 2 =item * -When you encode, the resulting utf8 flag is always off. +When you encode, the resulting UTF8 flag is always off. =item * -When you decode, the resulting utf8 flag is on unless you can +When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After C<$utf8 = decode('foo', $octet);>, - When $octet is... The utf8 flag in $utf8 is + When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON @@ -750,7 +750,7 @@ As you see, there is one exception, In ASCII. That way you can assume Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in B paragraphs. -This utf8 flag is not visible in perl scripts, exactly for the same +This UTF8 flag is not visible in perl scripts, exactly for the same reason you cannot (or you I) see if a scalar contains a string, integer, or floating point number. But you can still peek and poke these if you will. See the section below. @@ -766,7 +766,7 @@ implementation. As such, they are efficient but may change. =item is_utf8(STRING [, CHECK]) -[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. +[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. @@ -774,22 +774,22 @@ As of perl 5.8.1, L also has utf8::is_utf8(). =item _utf8_on(STRING) -[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is +[INTERNAL] Turns on the UTF8 flag in STRING. The data in STRING is B checked for being well-formed UTF-8. Do not use unless you B that the STRING is well-formed UTF-8. Returns the previous -state of the UTF-8 flag (so please don't treat the return value as +state of the UTF8 flag (so please don't treat the return value as indicating success or failure), or C if STRING is not a string. =item _utf8_off(STRING) -[INTERNAL] Turns off the UTF-8 flag in STRING. Do not use frivolously. -Returns the previous state of the UTF-8 flag (so please don't treat the +[INTERNAL] Turns off the UTF8 flag in STRING. Do not use frivolously. +Returns the previous state of the UTF8 flag (so please don't treat the return value as indicating success or failure), or C if STRING is not a string. =back -=head1 UTF-8 vs. utf8 +=head1 UTF-8 vs. utf8 vs. UTF8 ....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit @@ -836,6 +836,8 @@ goes "liberal" find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'. +The UTF8 flag is internally called UTF8, without a hyphen. It indicates +whether a string is internally encoded as utf8, also without a hypen. =head1 SEE ALSO diff --git a/ext/Encode/encoding.pm b/ext/Encode/encoding.pm index eb84e48..1f418e3 100644 --- a/ext/Encode/encoding.pm +++ b/ext/Encode/encoding.pm @@ -307,6 +307,14 @@ Will print C<2>, because C<$string> is upgraded as UTF-8. Without C, it will print C<4> instead, since C<$string> is three octets when interpreted as Latin-1. +=head2 Side effects + +If the C pragma is in scope then the lengths returned are +calculated from the length of C<$/> in Unicode characters, which is not +always the same as the length of C<$/> in the native encoding. + +This pragma affects utf8::upgrade, but not utf8::downgrade. + =head1 FEATURES THAT REQUIRE 5.8.1 Some of the features offered by this pragma requires perl 5.8.1. Most diff --git a/lib/PerlIO.pm b/lib/PerlIO.pm index 116deb5..c0acdec 100644 --- a/lib/PerlIO.pm +++ b/lib/PerlIO.pm @@ -121,7 +121,7 @@ The C<:mmap> layer will not exist if platform does not support C. =item :utf8 -Declares that the stream accepts perl's internal encoding of +Declares that the stream accepts perl's I encoding of characters. (Which really is UTF-8 on ASCII machines, but is UTF-EBCDIC on EBCDIC machines.) This allows any character perl can represent to be read from or written to the stream. The UTF-X encoding diff --git a/lib/utf8.pm b/lib/utf8.pm index 5ff900d..f8c1c10 100644 --- a/lib/utf8.pm +++ b/lib/utf8.pm @@ -50,22 +50,18 @@ program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). The C pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. -This pragma is primarily a compatibility device. Perl versions -earlier than 5.6 allowed arbitrary bytes in source code, whereas -in future we would like to standardize on the UTF-8 encoding for -source text. - B The utility functions described below are -useful for their own purposes, but they are not really part of the -"pragmatic" effect. +directly usable without C. + +Because it is not possible to reliably tell UTF-8 from native 8 bit +encodings, you need either a Byte Order Mark at the beginning of your +source code, or C, to instruct perl. -Until UTF-8 becomes the default format for source text, either this -pragma or the L pragma should be used to recognize UTF-8 -in the source. When UTF-8 becomes the standard source format, this -pragma will effectively become a no-op. For convenience in what -follows the term I is used to refer to UTF-8 on ASCII and ISO -Latin based platforms and UTF-EBCDIC on EBCDIC based platforms. +When UTF-8 becomes the standard source format, this pragma will +effectively become a no-op. For convenience in what follows the term +I is used to refer to UTF-8 on ASCII and ISO Latin based +platforms and UTF-EBCDIC on EBCDIC based platforms. See also the effects of the C<-C> switch and its cousin, the C<$ENV{PERL_UNICODE}>, in L. @@ -93,21 +89,6 @@ UTF-X. If you want to have such bytes under C, you can disable this pragma until the end the block (or file, if at top level) by C. -If you want to automatically upgrade your 8-bit legacy bytes to Unicode, -use the L pragma instead of this pragma. For example, if -you want to implicitly upgrade your ISO 8859-1 (Latin-1) bytes to Unicode -as used in e.g. C and C<\x{...}>, try this: - - use encoding "latin-1"; - my $c = chr(0xc4); - my $x = "\x{c5}"; - -In case you are wondering: C is mostly the same as -C, except that C marks all string literals in the -source code as Unicode, regardless of whether they contain any high-bit bytes. -Moreover, C installs IO layers on C and C to work -with Unicode strings; see L for details. - =head2 Utility functions The following functions are defined in the C package by the @@ -118,64 +99,69 @@ you should not say that unless you really want to have UTF-8 source code. =item * $num_octets = utf8::upgrade($string) -Converts in-place the octet sequence in the native encoding +Converts in-place the internal octet sequence in the native encoding (Latin-1 or EBCDIC) to the equivalent character sequence in I. -I<$string> already encoded as characters does no harm. -Returns the number of octets necessary to represent the string as I. -Can be used to make sure that the UTF-8 flag is on, -so that C<\w> or C work as Unicode on strings -containing characters in the range 0x80-0xFF (on ASCII and -derivatives). +I<$string> already encoded as characters does no harm. Returns the +number of octets necessary to represent the string as I. Can be +used to make sure that the UTF-8 flag is on, so that C<\w> or C +work as Unicode on strings containing characters in the range 0x80-0xFF +(on ASCII and derivatives). B -Therefore I is recommended for the general purposes. - -Affected by the encoding pragma. +Therefore Encode is recommended for the general purposes; see also +L. =item * $success = utf8::downgrade($string[, FAIL_OK]) -Converts in-place the character sequence in I -to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). -I<$string> already encoded as octets does no harm. -Returns true on success. On failure dies or, if the value of -C is true, returns false. -Can be used to make sure that the UTF-8 flag is off, -e.g. when you want to make sure that the substr() or length() function -works with the usually faster byte algorithm. +Converts in-place the internal octet sequence in I to the +equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). +I<$string> already encoded as native 8 bit does no harm. Can be used to +make sure that the UTF-8 flag is off, e.g. when you want to make sure +that the substr() or length() function works with the usually faster +byte algorithm. -B -Therefore I is recommended for the general purposes. +Fails if the original I sequence cannot be represented in the +native 8 bit encoding. On failure dies or, if the value of C is +true, returns false. -B affected by the encoding pragma. +Returns true on success. + +B +Therefore Encode is recommended for the general purposes; see also +L. -B this function is experimental and may change -or be removed without notice. +B this function is experimental and may change or be removed +without notice. =item * utf8::encode($string) -Converts in-place the character sequence to the corresponding octet sequence -in I. The UTF-8 flag is turned off. Returns nothing. +Converts in-place the character sequence to the corresponding octet +sequence in I. The UTF8 flag is turned off, so that after this +operation, the string is a byte string. Returns nothing. B -Therefore I is recommended for the general purposes. +Therefore Encode is recommended for the general purposes; see also +L. -=item * utf8::decode($string) +=item * $success = utf8::decode($string) -Attempts to convert in-place the octet sequence in I -to the corresponding character sequence. The UTF-8 flag is turned on -only if the source string contains multiple-byte I characters. -If I<$string> is invalid as I, returns false; otherwise returns true. +Attempts to convert in-place the octet sequence in I to the +corresponding character sequence. The UTF-8 flag is turned on only if +the source string contains multiple-byte I characters. If +I<$string> is invalid as I, returns false; otherwise returns +true. B -Therefore I is recommended for the general purposes. +Therefore Encode is recommended for the general purposes; see also +L. -B this function is experimental and may change -or be removed without notice. +B this function is experimental and may change or be removed +without notice. =item * $flag = utf8::is_utf8(STRING) -(Since Perl 5.8.1) Test whether STRING is in UTF-8. Functionally -the same as Encode::is_utf8(). +(Since Perl 5.8.1) Test whether STRING is in UTF-8 internally. +Functionally the same as Encode::is_utf8(). =item * $flag = utf8::valid(STRING) @@ -213,6 +199,6 @@ portable answers. =head1 SEE ALSO -L, L, L, L, L +L, L, L, L, L =cut diff --git a/pod/perlapi.pod b/pod/perlapi.pod index 00468b6..b49f1ee 100644 --- a/pod/perlapi.pod +++ b/pod/perlapi.pod @@ -5413,7 +5413,7 @@ X Copies a stringified representation of the source SV into the destination SV. Automatically performs any necessary mg_get and coercion of numeric values into strings. Guaranteed to preserve -UTF-8 flag even from overloaded objects. Similar in nature to +UTF8 flag even from overloaded objects. Similar in nature to sv_2pv[_flags] but operates directly on an SV instead of just the string. Mostly uses sv_2pv_flags to do its work, except when that would lose the UTF-8'ness of the PV. diff --git a/pod/perldata.pod b/pod/perldata.pod index cbfe070..c960a0e 100644 --- a/pod/perldata.pod +++ b/pod/perldata.pod @@ -385,7 +385,7 @@ Unicode strings, and for comparing version "numbers" using the string comparison operators, C, C, C etc. If there are two or more dots in the literal, the leading C may be omitted. - print v9786; # prints UTF-8 encoded SMILEY, "\x{263a}" + print v9786; # prints SMILEY, "\x{263a}" print v102.111.111; # prints "foo" print 102.111.111; # same diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 4651661..1b01b6b 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2263,12 +2263,19 @@ when the function is called. =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data that +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid byte +sequences are handled gracefully, but if you use C<:utf8>, the flag is +set without validating the data, possibly resulting in this error +message. + +See also L. =item Malformed UTF-16 surrogate diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 3e2c57a..90da492 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -759,10 +759,6 @@ You can actually chomp anything that's an lvalue, including an assignment: If you chomp a list, each element is chomped, and the total number of characters removed is returned. -If the C pragma is in scope then the lengths returned are -calculated from the length of C<$/> in Unicode characters, which is not -always the same as the length of C<$/> in the native encoding. - Note that parentheses are necessary when you're chomping anything that is not a simple variable. This is because C is interpreted as C<(chomp $cwd) = `pwd`;>, rather than as @@ -839,9 +835,7 @@ X X X X Returns the character represented by that NUMBER in the character set. For example, C is C<"A"> in either ASCII or Unicode, and -chr(0x263a) is a Unicode smiley face. Note that characters from 128 -to 255 (inclusive) are by default not encoded in UTF-8 Unicode for -backward compatibility reasons (but see L). +chr(0x263a) is a Unicode smiley face. Negative values give the Unicode replacement character (chr(0xfffd)), except under the L pragma, where low eight bits of the value @@ -851,10 +845,10 @@ If NUMBER is omitted, uses C<$_>. For the reverse, use L. -Note that under the C pragma the NUMBER is masked to -the low eight bits. +Note that characters from 128 to 255 (inclusive) are by default +internally not encoded as UTF-8 for backward compatibility reasons. -See L and L for more about Unicode. +See L for more about Unicode. =item chroot FILENAME X X @@ -2664,7 +2658,11 @@ For that, use C and C respectively. Note the I: if the EXPR is in Unicode, you will get the number of characters, not the number of bytes. To get the length -in bytes, use C, see L. +of the internal string in bytes, use C, see +L. Note that the internal encoding is variable, and the number +of bytes usually meaningless. To get the number of bytes that the +string would have when encoded as UTF-8, use +C. =item link OLDFILE,NEWFILE X @@ -3113,7 +3111,7 @@ You may use the three-argument form of open to specify IO "layers" that affect how the input and output are processed (see L and L for more details). For example - open(FH, "<:utf8", "file") + open(FH, "<:encoding(UTF-8)", "file") will open the UTF-8 encoded file containing Unicode characters, see L. Note that if layers are specified in the @@ -3419,7 +3417,7 @@ or Unicode) value of the first character of EXPR. If EXPR is omitted, uses C<$_>. For the reverse, see L. -See L and L for more about Unicode. +See L for more about Unicode. =item our EXPR X X @@ -7000,13 +6998,10 @@ If an element off the end of the string is written to, Perl will first extend the string with sufficiently many zero bytes. It is an error to try to write off the beginning of the string (i.e. negative OFFSET). -The string should not contain any character with the value > 255 (which -can only happen if you're using UTF-8 encoding). If it does, it will be -treated as something that is not UTF-8 encoded. When the C was -assigned to, other parts of your program will also no longer consider the -string to be UTF-8 encoded. In other words, if you do have such characters -in your string, vec() will operate on the actual byte string, and not the -conceptual character string. +If the string happens to be encoded as UTF-8 internally (and thus has +the UTF8 flag set), this is ignored by C, and it operates on the +internal byte string, not the conceptual character string, even if you +only have characters with values less than 256. Strings created with C can also be manipulated with the logical operators C<|>, C<&>, C<^>, and C<~>. These operators will assume a bit diff --git a/pod/perlguts.pod b/pod/perlguts.pod index 36a0ea1..3a40e68 100644 --- a/pod/perlguts.pod +++ b/pod/perlguts.pod @@ -2431,8 +2431,8 @@ To fix this, some people formed Unicode, Inc. and produced a new character set containing all the characters you can possibly think of and more. There are several ways of representing these characters, and the one Perl uses is called UTF-8. UTF-8 uses -a variable number of bytes to represent a character, instead of just -one. You can learn more about Unicode at http://www.unicode.org/ +a variable number of bytes to represent a character. You can learn more +about Unicode and Perl's Unicode model in L. =head2 How can I recognise a UTF-8 string? @@ -2443,16 +2443,17 @@ C. Unfortunately, the non-Unicode string C has that byte sequence as well. So you can't tell just by looking - this is what makes Unicode input an interesting problem. -The API function C can help; it'll tell you if a string -contains only valid UTF-8 characters. However, it can't do the work for -you. On a character-by-character basis, C will tell you -whether the current character in a string is valid UTF-8. +In general, you either have to know what you're dealing with, or you +have to guess. The API function C can help; it'll tell +you if a string contains only valid UTF-8 characters. However, it can't +do the work for you. On a character-by-character basis, C +will tell you whether the current character in a string is valid UTF-8. =head2 How does UTF-8 represent Unicode characters? As mentioned above, UTF-8 uses a variable number of bytes to store a -character. Characters with values 1...128 are stored in one byte, just -like good ol' ASCII. Character 129 is stored as C; this +character. Characters with values 0...127 are stored in one byte, just +like good ol' ASCII. Character 128 is stored as C; this continues up to character 191, which is C. Now we've run out of bits (191 is binary C<10111111>) so we move on; 192 is C. And so it goes on, moving to three bytes at character 2048. @@ -2509,9 +2510,11 @@ So don't do that! =head2 How does Perl store UTF-8 strings? Currently, Perl deals with Unicode strings and non-Unicode strings -slightly differently. If a string has been identified as being UTF-8 -encoded, Perl will set a flag in the SV, C. You can check and -manipulate this flag with the following macros: +slightly differently. A flag in the SV, C, indicates that the +string is internally encoded as UTF-8. Without it, the byte value is the +codepoint number and vice versa (in other words, the string is encoded +as iso-8859-1). You can check and manipulate this flag with the +following macros: SvUTF8(sv) SvUTF8_on(sv) @@ -2523,7 +2526,7 @@ C, C and other string handling operations will have undesirable results. The problem comes when you have, for instance, a string that isn't -flagged is UTF-8, and contains a byte sequence that could be UTF-8 - +flagged as UTF-8, and contains a byte sequence that could be UTF-8 - especially when combining non-UTF-8 and UTF-8 strings. Never forget that the C flag is separate to the PV value; you @@ -2541,7 +2544,7 @@ manipulating SVs. More specifically, you cannot expect to do this: The C string does not tell you the whole story, and you can't copy or reconstruct an SV just by copying the string value. Check if the -old SV has the UTF-8 flag set, and act accordingly: +old SV has the UTF8 flag set, and act accordingly: p = SvPV(sv, len); frobnicate(p); @@ -2554,14 +2557,14 @@ not it's dealing with UTF-8 data, so that it can handle the string appropriately. Since just passing an SV to an XS function and copying the data of -the SV is not enough to copy the UTF-8 flags, even less right is just +the SV is not enough to copy the UTF8 flags, even less right is just passing a C to an XS function. =head2 How do I convert a string to UTF-8? -If you're mixing UTF-8 and non-UTF-8 strings, you might find it necessary -to upgrade one of the strings to UTF-8. If you've got an SV, the easiest -way to do this is: +If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade +one of the strings to UTF-8. If you've got an SV, the easiest way to do +this is: sv_utf8_upgrade(sv); @@ -2572,7 +2575,7 @@ However, you must not do this, for example: If you do this in a binary operator, you will actually change one of the strings that came into the operator, and, while it shouldn't be noticeable -by the end user, it can cause problems. +by the end user, it can cause problems in deficient code. Instead, C will give you a UTF-8-encoded B of its string argument. This is useful for having the data available for diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod index 1cb127e..d907b18 100644 --- a/pod/perlpacktut.pod +++ b/pod/perlpacktut.pod @@ -633,24 +633,36 @@ The UTF-8 encoding avoids this by storing the most common (from a western point of view) characters in a single byte while encoding the rarer ones in three or more bytes. -So what has this got to do with C? Well, if you want to convert -between a Unicode number and its UTF-8 representation you can do so by -using template code C. As an example, let's produce the UTF-8 -representation of the Euro currency symbol (code number 0x20AC): +Perl uses UTF-8, internally, for most Unicode strings. + +So what has this got to do with C? Well, if you want to compose a +Unicode string (that is internally encoded as UTF-8), you can do so by +using template code C. As an example, let's produce the Euro currency +symbol (code number 0x20AC): $UTF8{Euro} = pack( 'U', 0x20AC ); + # Equivalent to: $UTF8{Euro} = "\x{20ac}"; -Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The -round trip can be completed with C: +Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: +"\xe2\x82\xac". However, it contains only 1 character, number 0x20AC. +The round trip can be completed with C: $Unicode{Euro} = unpack( 'U', $UTF8{Euro} ); +Unpacking using the C template code also works on UTF-8 encoded byte +strings. + Usually you'll want to pack or unpack UTF-8 strings: # pack and unpack the Hebrew alphabet my $alefbet = pack( 'U*', 0x05d0..0x05ea ); my @hebrew = unpack( 'U*', $utf ); +Please note: in the general case, you're better off using +Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl +unicode string, and Encode::encode_utf8 to encode a Perl unicode string +to UTF-8 bytes. These functions provide means of handling invalid byte +sequences and generally have a friendlier interface. =head2 Another Portable Binary Encoding diff --git a/pod/perlport.pod b/pod/perlport.pod index 4905ad6..e7a8ca5 100644 --- a/pod/perlport.pod +++ b/pod/perlport.pod @@ -672,12 +672,9 @@ ISO 8859-1 bytes beyond 0x7f into your strings might cause trouble later. If the bytes are native 8-bit bytes, you can use the C pragma. If the bytes are in a string (regular expression being a curious string), you can often also use the C<\xHH> notation instead -of embedding the bytes as-is. If they are in some particular legacy -encoding (ether single-byte or something more complicated), you can -use the C pragma. (If you want to write your code in UTF-8, -you can use either the C pragma, or the C pragma.) -The C and C pragmata are available since Perl 5.6.0, and -the C pragma since Perl 5.8.0. +of embedding the bytes as-is. (If you want to write your code in UTF-8, +you can use the C.) The C and C pragmata are +available since Perl 5.6.0. =head2 System Resources diff --git a/pod/perlretut.pod b/pod/perlretut.pod index c1f37fe..da3e82c 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1841,27 +1841,21 @@ substituted. With the advent of 5.6.0, Perl regexps can handle more than just the standard ASCII character set. Perl now supports I, a standard for representing the alphabets from virtually all of the world's written -languages, and a host of symbols. Perl uses the UTF-8 encoding, in which -ASCII characters are still encoded as one byte, but characters greater -than C may be stored as two or more bytes. +languages, and a host of symbols. Perl's text strings are unicode strings, so +they can contain characters with a value (codepoint or character number) higher +than 255 What does this mean for regexps? Well, regexp users don't need to know much about Perl's internal representation of strings. But they do need -to know 1) how to represent Unicode characters in a regexp and 2) when -a matching operation will treat the string to be searched as a -sequence of bytes (the old way) or as a sequence of Unicode characters -(the new way). The answer to 1) is that Unicode characters greater -than C may be represented using the C<\x{hex}> notation, -with C a hexadecimal integer: +to know 1) how to represent Unicode characters in a regexp and 2) that +a matching operation will treat the string to be searched as a sequence +of characters, not bytes. The answer to 1) is that Unicode characters +greater than C are represented using the C<\x{hex}> notation, +because the \0 octal and \x hex (without curly braces) don't go further +than 255. /\x{263a}/; # match a Unicode smiley face :) -Unicode characters in the range of 128-255 use two hexadecimal digits -with braces: C<\x{ab}>. Note that this is in general different than -C<\xab>, which is just a hexadecimal byte with no Unicode significance, -except when your script is encoded in UTF-8 where C<\xab> has the -same byte representation as C<\x{ab}>. - B: In Perl 5.6.0 it used to be that one needed to say C to use any Unicode features. This is no more the case: for almost all Unicode processing, the explicit C pragma is not @@ -1896,34 +1890,17 @@ A list of full names is found in the file NamesList.txt in the lib/perl5/X.X.X/unicore directory (where X.X.X is the perl version number as it is installed on your system). -The answer to requirement 2), as of 5.6.0, is that if a regexp -contains Unicode characters, the string is searched as a sequence of -Unicode characters. Otherwise, the string is searched as a sequence of -bytes. If the string is being searched as a sequence of Unicode -characters, but matching a single byte is required, we can use the C<\C> -escape sequence. C<\C> is a character class akin to C<.> except that -it matches I byte 0-255. So - - use charnames ":full"; # use named chars with Unicode full names - $x = "a"; - $x =~ /\C/; # matches 'a', eats one byte - $x = ""; - $x =~ /\C/; # doesn't match, no bytes to match - $x = "\N{MERCURY}"; # two-byte Unicode character - $x =~ /\C/; # matches, but dangerous! - -The last regexp matches, but is dangerous because the string -I position is no longer synchronized to the string I -position. This generates the warning 'Malformed UTF-8 -character'. The C<\C> is best used for matching the binary data in strings -with binary data intermixed with Unicode characters. - -Let us now discuss the rest of the character classes. Just as with -Unicode characters, there are named Unicode character classes -represented by the C<\p{name}> escape sequence. Closely associated is -the C<\P{name}> character class, which is the negation of the -C<\p{name}> class. For example, to match lower and uppercase -characters, +The answer to requirement 2), as of 5.6.0, is that a regexp uses unicode +characters. Internally, this is encoded to bytes using either UTF-8 or a +native 8 bit encoding, depending on the history of the string, but +conceptually it is a sequence of characters, not bytes. See +L for a tutorial about that. + +Let us now discuss Unicode character classes. Just as with Unicode +characters, there are named Unicode character classes represented by the +C<\p{name}> escape sequence. Closely associated is the C<\P{name}> +character class, which is the negation of the C<\p{name}> class. For +example, to match lower and uppercase characters, use charnames ":full"; # use named chars with Unicode full names $x = "BOB"; diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1a49f04..c913047 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L before reading this reference +document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. -To indicate that Perl source itself is using a particular encoding, -see L. +To indicate that Perl source itself is in UTF-8, use C. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. -You can also use the C pragma to change the default encoding -of the data in your script; see L. - =item BOM-marked scripts and UTF-16 scripts autodetected If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, @@ -58,11 +59,6 @@ they were encoded in I, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. -If you wish to interpret byte strings as UTF-8 instead, use the -C pragma: - - use encoding 'utf8'; - See L for more details. =back @@ -112,9 +108,7 @@ If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as I, even if the old Unicode string used EBCDIC. This translation is done without -regard to the system's native 8-bit encoding. To change this for -systems with non-Latin-1 and non-EBCDIC native encodings, use the -C pragma. See L. +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -134,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -163,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C in C, hence C<\C>. +a character instead of a byte. =item * @@ -173,17 +165,13 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. -(However, and as a limitation of the current implementation, using -C<\w> or C<\W> I a C<[...]> character class will still match -with byte semantics.) - =item * Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". -See L for more details. +See L for more details. You can define your own character properties and use them in the regular expression with the C<\p{}> or C<\P{}> construct. @@ -1441,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1544,7 +1532,7 @@ A scalar that is going to be passed to some extension Be it Compress::Zlib, Apache::Request or any extension that has no mention of Unicode in the manpage, you need to make sure that the -UTF-8 flag is stripped off. Note that at the time of this writing +UTF8 flag is stripped off. Note that at the time of this writing (October 2002) the mentioned modules are not UTF-8-aware. Please check the documentation to verify if this is still true. @@ -1558,7 +1546,7 @@ check the documentation to verify if this is still true. A scalar we got back from an extension If you believe the scalar comes back as UTF-8, you will most likely -want the UTF-8 flag restored: +want the UTF8 flag restored: if ($] > 5.007) { require Encode; @@ -1620,7 +1608,7 @@ A large scalar that you know can only contain ASCII Scalars that contain only ASCII and are marked as UTF-8 are sometimes a drag to your program. If you recognize such a situation, just remove -the UTF-8 flag: +the UTF8 flag: utf8::downgrade($val) if $] > 5.007; @@ -1628,7 +1616,7 @@ the UTF-8 flag: =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L =cut diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod new file mode 100644 index 0000000..4b2290a --- /dev/null +++ b/pod/perlunifaq.pod @@ -0,0 +1,248 @@ +=head1 NAME + +perlunifaq - Perl Unicode FAQ + +=head1 DESCRIPTION + +This is a list of questions and answers about Unicode in Perl, intended to be +read after L. + +=head2 perlunitut isn't really a Unicode tutorial, is it? + +No, and this isn't really a Unicode FAQ. + +Perl has an abstracted interface for all supported character encodings, so they +is actually a generic C tutorial and C FAQ. But many people +think that Unicode is special and magical, and I didn't want to disappoint +them, so I decided to call the document a Unicode tutorial. + +=head2 What about binary data, like images? + +Well, apart from a bare C, you shouldn't treat them specially. +(The binmode is needed because otherwise Perl may convert line endings on Win32 +systems.) + +Be careful, though, to never combine text strings with binary strings. If you +need text in a binary stream, encode your text strings first using the +appropriate encoding, then join them with binary strings. See also: "What if I +don't encode?". + +=head2 What about the UTF8 flag? + +Please, unless you're hacking the internals, or debugging weirdness, don't +think about the UTF8 flag at all. That means that you very probably shouldn't +use C, C<_utf8_on> or C<_utf8_off> at all. + +Perl's internal format happens to be UTF-8. Unfortunately, Perl can't keep a +secret, so everyone knows about this. That is the source of much confusion. +It's better to pretend that the internal format is some unknown encoding, +and that you always have to encode and decode explicitly. + +=head2 When should I decode or encode? + +Whenever you're communicating with anything that is external to your perl +process, like a database, a text file, a socket, or another program. Even if +the thing you're communicating with is also written in Perl. + +=head2 What if I don't decode? + +Whenever your encoded, binary string is used together with a text string, Perl +will assume that your binary string was encoded with ISO-8859-1, also known as +latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For +example, if it was UTF-8, the individual bytes of multibyte characters are seen +as separate characters, and then again converted to UTF-8. Such double encoding +can be compared to double HTML encoding (C<&gt;>), or double URI encoding +(C<%253E>). + +This silent implicit decoding is known as "upgrading". That may sound +positive, but it's best to avoid it. + +=head2 What if I don't encode? + +Your text string will be sent using the bytes in Perl's internal format. In +some cases, Perl will warn you that you're doing something wrong, with a +friendly warning: + + Wide character in print at example.pl line 2. + +Because the internal format is often UTF-8, these bugs are hard to spot, +because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't +use the fact that Perl's internal format is UTF-8 to your advantage. Encode +explicitly to avoid weird bugs, and to show to maintenance programmers that you +thought this through. + +=head2 Is there a way to automatically decode or encode? + +If all data that comes from a certain handle is encoded in exactly the same +way, you can tell the PerlIO system to automatically decode everything, with +the C layer. If you do this, you can't accidentally forget to decode +or encode anymore, on things that use the layered handle. + +You can provide this layer when Cing the file: + + open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write + open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read + +Or if you already have an open filehandle: + + binmode $fh, ':encoding(UTF-8)'; + +Some database drivers for DBI can also automatically encode and decode, but +that is typically limited to the UTF-8 encoding, because they cheat. + +=head2 Cheat?! Tell me, how can I cheat? + +Well, because Perl's internal format is UTF-8, you can just skip the encoding +or decoding step, and manipulate the UTF8 flag directly. + +Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>. This is widely +accepted as good behavior when you're writing, but it can be dangerous when +reading, because it causes internal inconsistency when you have invalid byte +sequences. + +Instead of C and C, you could use C<_utf8_on> and C<_utf8_off>, +but this is considered bad style. Especially C<_utf8_on> can be dangerous, for +the same reason that C<:utf8> can. + +There are some shortcuts for oneliners; see C<-C> in L. + +=head2 What if I don't know which encoding was used? + +Do whatever you can to find out, and if you have to: guess. (Don't forget to +document your guess with a comment.) + +You could open the document in a web browser, and change the character set or +character encoding until you can visually confirm that all characters look the +way they should. + +There is no way to reliably detect the encoding automatically, so if people +keep sending you data without charset indication, you may have to educate them. + +=head2 Can I use Unicode in my Perl sources? + +Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the +C pragma. + + use utf8; + +This doesn't do anything to your input, or to your output. It only influences +the way your sources are read. You can use Unicode in string literals, in +identifiers (but they still have to be "word characters" according to C<\w>), +and even in custom delimiters. + +=head2 Data::Dumper doesn't restore the UTF8 flag; is it broken? + +No, Data::Dumper's Unicode abilities are as they should be. There have been +some complaints that it should restore the UTF8 flag when the data is read +again with C. However, you should really not look at the flag, and +nothing indicates that Data::Dumper should break this rule. + +Here's what happens: when Perl reads in a string literal, it sticks to 8 bit +encoding as long as it can. (But perhaps originally it was internally encoded +as UTF-8, when you dumped it.) When it has to give that up because other +characters are added to the text string, it silently upgrades the string to +UTF-8. + +If you properly encode your strings for output, none of this is of your +concern, and you can just C dumped data as always. + +=head2 How can I determine if a string is a text string or a binary string? + +You can't. Some use the UTF8 flag for this, but that's misuse, and makes well +behaved modules like Data::Dumper look bad. The flag is useless for this +purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is +used to store the string. + +This is something you, the programmer, has to keep track of; sorry. You could +consider adopting a kind of "Hungarian notation" to help with this. + +=head2 How do I convert from encoding FOO to encoding BAR? + +By first converting the FOO-encoded byte string to a text string, and then the +text string to a BAR-encoded byte string: + + my $text_string = decode('FOO', $foo_string); + my $bar_string = encode('BAR', $text_string); + +or by skipping the text string part, and going directly from one binary +encoding to the other: + + use Encode qw(from_to); + from_to($string, 'FOO', 'BAR'); # changes contents of $string + +or by letting automatic decoding and encoding do all the work: + + open my $foofh, '<:encoding(FOO)', 'example.foo.txt'; + open my $barfh, '>:encoding(BAR)', 'example.bar.txt'; + print { $barfh } $_ while <$foofh>; + +=head2 What about the C pragma? + +Don't use it. It makes no sense to deal with bytes in a text string, and it +makes no sense to deal with characters in a byte string. Do the proper +conversions (by decoding/encoding), and things will work out well: you get +character counts for decoded data, and byte counts for encoded data. + +C is usually a failed attempt to do something useful. Just forget +about it. + +=head2 What are C and C? + +These are alternate syntaxes for C and C. + +=head2 What's the difference between C and C? + +C is the official standard. C is Perl's way of being liberal in +what it accepts. If you have to communicate with things that aren't so liberal, +you may want to consider using C. If you have to communicate with things +that are too liberal, you may have to use C. The full explanation is in +L. + +C is internally known as C. The tutorial uses UTF-8 +consistently, even where utf8 is actually used internally, because the +distinction can be hard to make, and is mostly irrelevant. + +For example, utf8 can be used for code points that don't exist in Unicode, like +9999999, but if you encode that to UTF-8, you get a substitution character (by +default; see L for more ways of dealing with +this.) + +Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not +some other encoding.) + +=head2 I lost track; what encoding is the internal format really? + +It's good that you lost track, because you shouldn't depend on the internal +format being any specific encoding. But since you asked: by default, the +internal format is either ISO-8859-1 (latin-1), or utf8, depending on the +history of the string. On EBCDIC platforms, this may be different even. + +Perl knows how it stored the string internally, and will use that knowledge +when you C. In other words: don't try to find out what the internal +encoding for a certain string is, but instead just encode it into the encoding +that you want. + +=head2 What character encodings does Perl support? + +To find out which character encodings your Perl supports, run: + + perl -MEncode -le "print for Encode->encodings(':all')" + +=head2 Which version of perl should I use? + +Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer. +The tutorial and FAQ are based on the status quo as of C<5.8.8>. + +You should also check your modules, and upgrade them if necessary. For example, +HTML::Entities requires version >= 1.32 to function correctly, even though the +changelog is silent about this. + +=head1 AUTHOR + +Juerd Waalboer + +=head1 SEE ALSO + +L, L, L + diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index b0d5859..9337e5f 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -278,21 +278,7 @@ encodings, I/O, and certain special cases: When you combine legacy data and Unicode the legacy data needs to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if -applicable) is assumed. You can override this assumption by -using the C pragma, for example - - use encoding 'latin2'; # ISO 8859-2 - -in which case literals (string or regular expressions), C, -and C in your whole script are assumed to produce Unicode -characters from ISO 8859-2 code points. Note that the matching for -encoding names is forgiving: instead of C you could have -said C, or C, or other variations. With just - - use encoding; - -the environment variable C will be consulted. -If that variable isn't set, the encoding pragma will fail. +applicable) is assumed. The C module knows about many encodings and has interfaces for doing conversions between those encodings: @@ -404,8 +390,8 @@ the file "text.utf8", encoded as UTF-8: while (<$nihongo>) { print $unicode $_ } The naming of encodings, both by the C and by the C -pragma, is similar to the C pragma in that it allows for -flexible names: C and C will both be understood. +pragma allows for flexible names: C and C will both be +understood. Common encodings recognized by ISO, MIME, IANA, and various other standardisation organisations are recognised; for a more detailed @@ -885,7 +871,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions. =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L, L, L, L diff --git a/pod/perlunitut.pod b/pod/perlunitut.pod index ae8d0b1..5328049 100644 --- a/pod/perlunitut.pod +++ b/pod/perlunitut.pod @@ -64,6 +64,9 @@ B, or B are made of characters. Bytes are irrelevant here, and so are encodings. Each character is just that: the character. +Text strings are also called B, because in Perl, every text +string is a Unicode string. + On a text string, you would do things like: $text =~ s/foo/bar/; @@ -170,234 +173,15 @@ known. "Content-Type: text/plain; charset=UTF-8", "Content-Length: $byte_count" -=head2 Q and A - -=head3 This isn't really a Unicode tutorial, is it? - -No, Perl has an abstracted interface for all supported character encodings, so -this is actually a generic C tutorial. But many people think that -Unicode is special and magical, and I didn't want to disappoint them, so I -decided to call this document a Unicode tutorial. - -=head3 What about binary data, like images? - -Well, apart from a bare C, you shouldn't treat them specially. -(The binmode is needed because otherwise Perl may convert line endings on Win32 -systems.) - -Be careful, though, to never combine text strings with binary strings. If you -need text in a binary stream, encode your text strings first using the -appropriate encoding, then join them with binary strings. See also: "What if I -don't encode?". - -=head3 What about the UTF-8 flag? - -Please, unless you're hacking the internals, or debugging weirdness, don't -think about the UTF-8 flag at all. That means that you very probably shouldn't -use C, C<_utf8_on> or C<_utf8_off> at all. - -Perl's internal format happens to be UTF-8. Unfortunately, Perl can't keep a -secret, so everyone knows about this. That is the source of much confusion. -It's better to pretend that the internal format is some unknown encoding, -and that you always have to encode and decode explicitly. - -=head3 When should I decode or encode? - -Whenever you're communicating with anything that is external to your perl -process, like a database, a text file, a socket, or another program. Even if -the thing you're communicating with is also written in Perl. - -=head3 What if I don't decode? - -Whenever your encoded, binary string is used together with a text string, Perl -will assume that your binary string was encoded with ISO-8859-1, also known as -latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For -example, if it was UTF-8, the individual bytes of multibyte characters are seen -as separate characters, and then again converted to UTF-8. Such double encoding -can be compared to double HTML encoding (C<&gt;>), or double URI encoding -(C<%253E>). - -This silent implicit decoding is known as "upgrading". That may sound -positive, but it's best to avoid it. - -=head3 What if I don't encode? - -Your text string will be sent using the bytes in Perl's internal format. In -some cases, Perl will warn you that you're doing something wrong, with a -friendly warning: - - Wide character in print at example.pl line 2. - -Because the internal format is often UTF-8, these bugs are hard to spot, -because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't -use the fact that Perl's internal format is UTF-8 to your advantage. Encode -explicitly to avoid weird bugs, and to show to maintenance programmers that you -thought this through. - -=head3 Is there a way to automatically decode or encode? - -If all data that comes from a certain handle is encoded in exactly the same -way, you can tell the PerlIO system to automatically decode everything, with -the C layer. If you do this, you can't accidentally forget to decode -or encode anymore, on things that use the layered handle. - -You can provide this layer when Cing the file: - - open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write - open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read - -Or if you already have an open filehandle: - - binmode $fh, ':encoding(UTF-8)'; - -Some database drivers for DBI can also automatically encode and decode, but -that is typically limited to the UTF-8 encoding, because they cheat. - -=head3 Cheat?! Tell me, how can I cheat? - -Well, because Perl's internal format is UTF-8, you can just skip the encoding -or decoding step, and manipulate the UTF-8 flag directly. - -Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>. This is widely -accepted as good behavior. - -Instead of C and C, you could use C<_utf8_on> and C<_utf8_off>. -But this is, contrary to C<:utf8>, considered bad style. - -There are some shortcuts for oneliners; see C<-C> in L. - -=head3 What if I don't know which encoding was used? - -Do whatever you can to find out, and if you have to: guess. (Don't forget to -document your guess with a comment.) - -You could open the document in a web browser, and change the character set or -character encoding until you can visually confirm that all characters look the -way they should. - -There is no way to reliably detect the encoding automatically, so if people -keep sending you data without charset indication, you may have to educate them. - -=head3 Can I use Unicode in my Perl sources? - -Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the -C pragma. - - use utf8; - -This doesn't do anything to your input, or to your output. It only influences -the way your sources are read. You can use Unicode in string literals, in -identifiers (but they still have to be "word characters" according to C<\w>), -and even in custom delimiters. - -=head3 Data::Dumper doesn't restore the UTF-8 flag; is it broken? - -No, Data::Dumper's Unicode abilities are as they should be. There have been -some complaints that it should restore the UTF-8 flag when the data is read -again with C. However, you should really not look at the flag, and -nothing indicates that Data::Dumper should break this rule. - -Here's what happens: when Perl reads in a string literal, it sticks to 8 bit -encoding as long as it can. (But perhaps originally it was internally encoded -as UTF-8, when you dumped it.) When it has to give that up because other -characters are added to the text string, it silently upgrades the string to -UTF-8. - -If you properly encode your strings for output, none of this is of your -concern, and you can just C dumped data as always. - -=head3 How can I determine if a string is a text string or a binary string? - -You can't. Some use the UTF-8 flag for this, but that's misuse, and makes well -behaved modules like Data::Dumper look bad. The flag is useless for this -purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is -used to store the string. - -This is something you, the programmer, has to keep track of; sorry. You could -consider adopting a kind of "Hungarian notation" to help with this. - -=head3 How do I convert from encoding FOO to encoding BAR? - -By first converting the FOO-encoded byte string to a text string, and then the -text string to a BAR-encoded byte string: - - my $text_string = decode('FOO', $foo_string); - my $bar_string = encode('BAR', $text_string); - -or by skipping the text string part, and going directly from one binary -encoding to the other: - - use Encode qw(from_to); - from_to($string, 'FOO', 'BAR'); # changes contents of $string - -or by letting automatic decoding and encoding do all the work: - - open my $foofh, '<:encoding(FOO)', 'example.foo.txt'; - open my $barfh, '>:encoding(BAR)', 'example.bar.txt'; - print { $barfh } $_ while <$foofh>; - -=head3 What about the C pragma? - -Don't use it. It makes no sense to deal with bytes in a text string, and it -makes no sense to deal with characters in a byte string. Do the proper -conversions (by decoding/encoding), and things will work out well: you get -character counts for decoded data, and byte counts for encoded data. - -C is usually a failed attempt to do something useful. Just forget -about it. - -=head3 What are C and C? - -These are alternate syntaxes for C and C. - -=head3 What's the difference between C and C? - -C is the official standard. C is Perl's way of being liberal in -what it accepts. If you have to communicate with things that aren't so liberal, -you may want to consider using C. If you have to communicate with things -that are too liberal, you may have to use C. The full explanation is in -L. - -C is internally known as C. This tutorial uses UTF-8 -consistently, even where utf8 is actually used internally, because the -distinction can be hard to make, and is mostly irrelevant. - -Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not -some other encoding.) - -=head3 I lost track; what encoding is the internal format really? - -It's good that you lost track, because you shouldn't depend on the internal -format being any specific encoding. But since you asked: by default, the -internal format is either ISO-8859-1 (latin-1), or utf8, depending on the -history of the string. - -Perl knows how it stored the string internally, and will use that knowledge -when you C. In other words: don't try to find out what the internal -encoding for a certain string is, but instead just encode it into the encoding -that you want. - -=head3 What character encodings does Perl support? - -To find out which character encodings your Perl supports, run: - - perl -MEncode -le "print for Encode->encodings(':all')" - -=head3 Which version of perl should I use? - -Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer. -This tutorial is based on the status quo as of C<5.8.7>. - -You should also check your modules, and upgrade them if necessary. For example, -HTML::Entities requires version >= 1.32 to function correctly, even though the -changelog is silent about this. - =head1 SUMMARY Decode everything you receive, encode everything you send out. (If it's text data.) +=head1 Q and A (or FAQ) + +After reading this document, you ought to read L too. + =head1 ACKNOWLEDGEMENTS Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the @@ -421,5 +205,5 @@ Juerd Waalboer =head1 SEE ALSO -L, L, L +L, L, L, L diff --git a/pod/perlvar.pod b/pod/perlvar.pod index 563a599..f5b098b 100644 --- a/pod/perlvar.pod +++ b/pod/perlvar.pod @@ -1310,7 +1310,7 @@ The revision, version, and subversion of the Perl interpreter, represented as a string composed of characters with those ordinals. Thus in Perl v5.6.0 it equals C and will return true for C<$^V eq v5.6.0>. Note that the characters in this string value can -potentially be in Unicode range. +potentially be greater than 255. This variable first appeared in perl 5.6.0; earlier versions of perl will see an undefined value. diff --git a/sv.c b/sv.c index 3d5a4f3..f24eb29 100644 --- a/sv.c +++ b/sv.c @@ -2809,7 +2809,7 @@ Perl_sv_2pv_flags(pTHX_ register SV *sv, STRLEN *lp, I32 flags) Copies a stringified representation of the source SV into the destination SV. Automatically performs any necessary mg_get and coercion of numeric values into strings. Guaranteed to preserve -UTF-8 flag even from overloaded objects. Similar in nature to +UTF8 flag even from overloaded objects. Similar in nature to sv_2pv[_flags] but operates directly on an SV instead of just the string. Mostly uses sv_2pv_flags to do its work, except when that would lose the UTF-8'ness of the PV.