X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=bee286f5eaad9f5f322c7fca99f191dd2bfdcd8f;hb=ee811f5e3bed4f099447f2fff9d2bdc4f0220f5e;hp=eadcedd74bf01f406a1acbd7abeeb49b7c8fd9a6;hpb=4c496f0cc0d05e588e924cab74c61dfe12f0f2cb;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index eadcedd..bee286f 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -24,7 +24,7 @@ Unicode 1.0 was released in October 1991, and 4.0 in April 2003. A Unicode I is an abstract entity. It is not bound to any particular integer width, especially not to the C language C. Unicode is language-neutral and display-neutral: it does not encode the -language of the text and it does not define fonts or other graphical +language of the text, and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters. @@ -45,25 +45,29 @@ these properties are independent of the names of the characters. Furthermore, various operations on the characters like uppercasing, lowercasing, and collating (sorting) are defined. -A Unicode character consists either of a single code point, or a -I (like C), followed by one or -more I (like C). This sequence of +A Unicode I "character" can actually consist of more than one internal +I "character" or code point. For Western languages, this is adequately +modelled by a I (like C) followed +by one or more I (like C). This sequence of base character and modifiers is called a I. - -Whether to call these combining character sequences "characters" -depends on your point of view. If you are a programmer, you probably -would tend towards seeing each element in the sequences as one unit, -or "character". The whole sequence could be seen as one "character", -however, from the user's point of view, since that's probably what it -looks like in the context of the user's language. +sequence>. Some non-western languages require more complicated +models, so Unicode created the I concept, and then the +I. For example, a Korean Hangul syllable is +considered a single logical character, but most often consists of three actual +Unicode characters: a leading consonant followed by an interior vowel followed +by a trailing consonant. + +Whether to call these extended grapheme clusters "characters" depends on your +point of view. If you are a programmer, you probably would tend towards seeing +each element in the sequences as one unit, or "character". The whole sequence +could be seen as one "character", however, from the user's point of view, since +that's probably what it looks like in the context of the user's language. With this "whole sequence" view of characters, the total number of characters is open-ended. But in the programmer's "one unit is one character" point of view, the concept of "characters" is more -deterministic. In this document, we take that second point of view: -one "character" is one Unicode code point, be it a base character or -a combining character. +deterministic. In this document, we take that second point of view: +one "character" is one Unicode code point. For some combinations, there are I characters. C, for example, is defined as @@ -84,7 +88,7 @@ character. Firstly, there are unallocated code points within otherwise used blocks. Secondly, there are special Unicode control characters that do not represent true characters. -A common myth about Unicode is that it would be "16-bit", that is, +A common myth about Unicode is that it is "16-bit", that is, Unicode is only represented as C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF>. B Since Unicode 2.0 (July 1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>), @@ -94,22 +98,22 @@ I, or the I (BMP). With Unicode 3.1, 17 (yes, seventeen) planes in all were defined--but they are nowhere near full of defined characters, yet. -Another myth is that the 256-character blocks have something to +Another myth is about Unicode blocks--that they have something to do with languages--that each block would define the characters used by a language or a set of languages. B The division into blocks exists, but it is almost completely accidental--an artifact of how the characters have been and -still are allocated. Instead, there is a concept called I, -which is more useful: there is C script, C script, and -so on. Scripts usually span varied parts of several blocks. -For further information see L. +still are allocated. Instead, there is a concept called I, which is +more useful: there is C script, C script, and so on. Scripts +usually span varied parts of several blocks. For more information about +scripts, see L. The Unicode code points are just abstract numbers. To input and output these abstract numbers, the numbers must be I or I somehow. Unicode defines several I, of which I is perhaps the most popular. UTF-8 is a variable length encoding that encodes Unicode characters as 1 to 6 -bytes (only 4 with the currently defined characters). Other encodings +bytes. Other encodings include UTF-16 and UTF-32 and their big- and little-endian variants (UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms. @@ -125,8 +129,7 @@ serious Unicode work. The maintenance release 5.6.1 fixed many of the problems of the initial Unicode implementation, but for example regular expressions still do not work with Unicode in 5.6.1. -B is no longer -necessary.> In earlier releases the C pragma was used to declare +B is needed only in much more restricted circumstances.> In earlier releases the C pragma was used to declare that operations in the current block or file would be Unicode-aware. This model was found to be wrong, or at least clumsy: the "Unicodeness" is now carried with the data, instead of being attached to the @@ -141,8 +144,8 @@ scripts with legacy 8-bit data in them would break. See L. Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon -as Unicodeness cannot be avoided, the data is transparently upgraded -to Unicode. +as Unicodeness cannot be avoided, the data is (mostly) transparently upgraded +to Unicode. There are some problems--see L. Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to @@ -152,22 +155,22 @@ character set. Otherwise, it uses UTF-8. A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes relevant when -outputting Unicode strings to a stream without a PerlIO layer -- one with -the "default" encoding. In such a case, the raw bytes used internally +outputting Unicode strings to a stream without a PerlIO layer (one with +the "default" encoding). In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond 0x00FF. For example, - perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' + perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' produces a fairly useless mixture of native bytes and UTF-8, as well as a warning: Wide character in print at ... -To output UTF-8, use the C<:utf8> output layer. Prepending +To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending binmode(STDOUT, ":utf8"); @@ -193,11 +196,12 @@ C. Perl 5.8.0 also supports Unicode on EBCDIC platforms. There, Unicode support is somewhat more complex to implement since -additional conversions are needed at every step. Some problems -remain, see L for details. +additional conversions are needed at every step. + +Later Perl releases have added code that will not work on EBCDIC platforms, and +no one has complained, so the divergence has continued. If you want to run +Perl on an EBCDIC platform, send email to perlbug@perl.org -In any case, the Unicode support on EBCDIC platforms is better than -in the 5.6 series, which didn't work much at all for EBCDIC platform. On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is @@ -246,16 +250,14 @@ Note that both C<\x{...}> and C<\N{...}> are compile-time string constants: you cannot use variables in them. if you want similar run-time functionality, use C and C. -Also note that if all the code points for pack "U" are below 0x100, -bytes will be generated, just like if you were using C. - - my $bytes = pack("U*", 0x80, 0xFF); - If you want to force the result to Unicode characters, use the special -C<"U0"> prefix. It consumes no arguments but forces the result to be -in Unicode characters, instead of bytes. +C<"U0"> prefix. It consumes no arguments but causes the following bytes +to be interpreted as the UTF-8 encoding of Unicode characters: + + my $chars = pack("U0W*", 0x80, 0x42); - my $chars = pack("U0U*", 0x80, 0xFF); +Likewise, you can stop such UTF-8 interpretation by using the special +C<"C0"> prefix. =head2 Handling Unicode @@ -264,14 +266,14 @@ strings as usual. Functions like C, C, and C will work on the Unicode characters; regular expressions will work on the Unicode characters (see L and L). -Note that Perl considers combining character sequences to be -characters, so for example +Note that Perl considers grapheme clusters to be separate characters, so for +example use charnames ':full'; print length("\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"), "\n"; will print 2, not 1. The only exception is that regular expressions -have C<\X> for matching a combining character sequence. +have C<\X> for matching an extended grapheme cluster. Life is not quite so transparent, however, when working with legacy encodings, I/O, and certain special cases: @@ -280,27 +282,13 @@ encodings, I/O, and certain special cases: When you combine legacy data and Unicode the legacy data needs to be upgraded to Unicode. Normally ISO 8859-1 (or EBCDIC, if -applicable) is assumed. You can override this assumption by -using the C pragma, for example - - use encoding 'latin2'; # ISO 8859-2 - -in which case literals (string or regular expressions), C, -and C in your whole script are assumed to produce Unicode -characters from ISO 8859-2 code points. Note that the matching for -encoding names is forgiving: instead of C you could have -said C, or C, or other variations. With just - - use encoding; - -the environment variable C will be consulted. -If that variable isn't set, the encoding pragma will fail. +applicable) is assumed. The C module knows about many encodings and has interfaces for doing conversions between those encodings: - use Encode 'from_to'; - from_to($data, "iso-8859-3", "utf-8"); # from legacy to utf-8 + use Encode 'decode'; + $data = decode("iso-8859-3", $data); # convert from legacy to utf-8 =head2 Unicode I/O @@ -333,7 +321,9 @@ and on already open streams, use C: The matching of encoding names is loose: case does not matter, and many encodings have several aliases. Note that the C<:utf8> layer must always be specified exactly like that; it is I subject to -the loose matching of encoding names. +the loose matching of encoding names. Also note that C<:utf8> is unsafe for +input, because it accepts the data without validating that it is indeed valid +UTF8. See L for the C<:utf8> layer, L and L for the C<:encoding()> layer, and @@ -345,7 +335,7 @@ Unicode or legacy encodings does not magically turn the data into Unicode in Perl's eyes. To do that, specify the appropriate layer when opening files - open(my $fh,'<:utf8', 'anything'); + open(my $fh,'<:encoding(utf8)', 'anything'); my $line_of_unicode = <$fh>; open(my $fh,'<:encoding(Big5)', 'anything'); @@ -354,7 +344,7 @@ layer when opening files The I/O layers can also be specified more flexibly with the C pragma. See L, or look at the following example. - use open ':utf8'; # input and output default layer will be UTF-8 + use open ':encoding(utf8)'; # input/output default encoding will be UTF-8 open X, ">file"; print X chr(0x100), "\n"; close X; @@ -374,11 +364,6 @@ With the C pragma you can use the C<:locale> layer printf "%#x\n", ord(), "\n"; # this should print 0xc1 close I; -or you can also use the C<':encoding(...)'> layer - - open(my $epic,'<:encoding(iso-8859-7)','iliad.greek'); - my $line_of_unicode = <$epic>; - These methods install a transparent filter on the I/O stream that converts data from the specified encoding when it is read in from the stream. The result is always Unicode. @@ -406,8 +391,8 @@ the file "text.utf8", encoded as UTF-8: while (<$nihongo>) { print $unicode $_ } The naming of encodings, both by the C and by the C -pragma, is similar to the C pragma in that it allows for -flexible names: C and C will both be understood. +pragma allows for flexible names: C and C will both be +understood. Common encodings recognized by ISO, MIME, IANA, and various other standardisation organisations are recognised; for a more detailed @@ -427,13 +412,13 @@ by repeatedly encoding the data: local $/; ## read in the whole file of 8-bit characters $t = ; close F; - open F, ">:utf8", "file"; + open F, ">:encoding(utf8)", "file"; print F $t; ## convert to UTF-8 on output close F; If you run this code twice, the contents of the F will be twice -UTF-8 encoded. A C would have avoided the bug, or -explicitly opening also the F for input as UTF-8. +UTF-8 encoded. A C would have avoided the +bug, or explicitly opening also the F for input as UTF-8. B: the C<:utf8> and C<:encoding> features work only if your Perl has been built with the new PerlIO feature (which is the default @@ -454,7 +439,7 @@ displayed as C<\x..>, and the rest of the characters as themselves: chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... sprintf("\\x%02X", $_) : # \x.. quotemeta(chr($_)) # else quoted or as themselves - } unpack("U*", $_[0])); # unpack Unicode characters + } unpack("W*", $_[0])); # unpack Unicode characters } For example, @@ -494,17 +479,18 @@ explicitly-defined I/O layers). But if you must, there are two ways of looking behind the scenes. One way of peeking inside the internal encoding of Unicode characters -is to use C to get the bytes or C -to display the bytes: +is to use C to get the bytes of whatever the string +encoding happens to be, or C to get the bytes of the +UTF-8 encoding: # this prints c4 80 for the UTF-8 bytes 0xc4 0x80 - print join(" ", unpack("H*", pack("U", 0x100))), "\n"; + print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n"; Yet another way would be to use the Devel::Peek module: perl -MDevel::Peek -e 'Dump(chr(0x100))' -That shows the UTF8 flag in FLAGS and both the UTF-8 bytes +That shows the C flag in FLAGS and both the UTF-8 bytes and Unicode characters in C. See also later in this document the discussion about the C function. @@ -530,13 +516,12 @@ case, the answer is no (because 0x00C1 != 0x0041). But sometimes, any CAPITAL LETTER As should be considered equal, or even As of any case. The long answer is that you need to consider character normalization -and casing issues: see L, Unicode Technical -Reports #15 and #21, I and I, http://www.unicode.org/unicode/reports/tr15/ and -http://www.unicode.org/unicode/reports/tr21/ +and casing issues: see L, Unicode Technical Report #15, +L and +sections on case mapping in the L. As of Perl 5.8.0, the "Full" case-folding of I is implemented. +Mappings/SpecialCasing> is implemented, but bugs remain in C with them. =item * @@ -556,7 +541,7 @@ C<0x00C1> > C<0x00C0>. The long answer is that "it depends", and a good answer cannot be given without knowing (at the very least) the language context. See L, and I -http://www.unicode.org/unicode/reports/tr10/ +L =back @@ -568,19 +553,19 @@ http://www.unicode.org/unicode/reports/tr10/ Character Ranges and Classes -Character ranges in regular expression character classes (C) -and in the C (also known as C) operator are not magically -Unicode-aware. What this means that C<[A-Za-z]> will not magically start -to mean "all alphabetic letters"; not that it does mean that even for -8-bit characters, you should be using C in that case. +Character ranges in regular expression bracketed character classes ( e.g., +C) and in the C (also known as C) operator are not +magically Unicode-aware. What this means is that C<[A-Za-z]> will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8-bit characters; for those, if you are using locales (L), +use C; and if not, use the 8-bit-aware property C<\p{alpha}>). + +All the properties that begin with C<\p> (and its inverse C<\P>) are actually +character classes that are Unicode-aware. There are dozens of them, see +L. -For specifying character classes like that in regular expressions, -you can use the various Unicode properties--C<\pL>, or perhaps -C<\p{Alphabetic}>, in this particular case. You can use Unicode -code points as the end points of character ranges, but there is no -magic associated with specifying a certain range. For further -information--there are dozens of Unicode character classes--see -L. +You can use Unicode code points as the end points of character ranges, and the +range will include all Unicode code points that lie between those end points. =item * @@ -621,11 +606,12 @@ Unicode; for that, see the earlier I/O discussion. How Do I Know Whether My String Is In Unicode? -You shouldn't care. No, you really shouldn't. No, really. If you -have to care--beyond the cases described above--it means that we -didn't get the transparency of Unicode quite right. +You shouldn't have to care. But you may, because currently the semantics of the +characters whose ordinals are in the range 128 to 255 are different depending on +whether the string they are contained within is in Unicode or not. +(See L.) -Okay, if you insist: +To determine if a string is in Unicode, use: print utf8::is_utf8($string) ? 1 : 0, "\n"; @@ -636,9 +622,9 @@ string has any characters at all. All the C does is to return the value of the internal "utf8ness" flag attached to the C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar -are interpreted as the (multi-byte, variable-length) UTF-8 encoded code -points of the characters. Bytes added to an UTF-8 encoded string are -automatically upgraded to UTF-8. If mixed non-UTF8 and UTF-8 scalars +are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded +code points of the characters. Bytes added to a UTF-8 encoded string are +automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded as if copies of the byte strings were upgraded to UTF-8: for example, @@ -652,8 +638,8 @@ C<$a> will stay byte-encoded. Sometimes you might really need to know the byte length of a string instead of the character length. For that use either the -C function or the C pragma and its only -defined function C: +C function or the C pragma and +the C function: my $unicode = chr(0x100); print length($unicode), "\n"; # will print 1 @@ -662,6 +648,7 @@ defined function C: use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) + no bytes; =item * @@ -670,22 +657,24 @@ How Do I Detect Data That's Not Valid In a Particular Encoding? Use the C package to try converting it. For example, - use Encode 'encode_utf8'; - if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) { - # valid + use Encode 'decode_utf8'; + + if (eval { decode_utf8($string, Encode::FB_CROAK); 1 }) { + # $string is valid utf8 } else { - # invalid + # $string is not valid utf8 } -For UTF-8 only, you can use: +Or use C to try decoding it: use warnings; - @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8); + @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8); -If invalid, a C -warning is produced. The "U0" means "expect strictly UTF-8 encoded -Unicode". Without that the C would accept also -data like C), similarly to the C as we saw earlier. +If invalid, a C warning is produced. The "C0" means +"process the string character per character". Without that, the +C would work in C mode (the default if the format +string starts with C) and it would return the bytes making up the UTF-8 +encoding of the target string, something that will always work. =item * @@ -727,8 +716,8 @@ Back to converting data. If you have (or want) data in your system's native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use pack/unpack to convert to/from Unicode. - $native_string = pack("C*", unpack("U*", $Unicode_string)); - $Unicode_string = pack("U*", unpack("C*", $native_string)); + $native_string = pack("W*", unpack("U*", $Unicode_string)); + $Unicode_string = pack("U*", unpack("W*", $native_string)); If you have a sequence of bytes you B is valid UTF-8, but Perl doesn't know it yet, you can make Perl a believer, too: @@ -736,18 +725,24 @@ but Perl doesn't know it yet, you can make Perl a believer, too: use Encode 'decode_utf8'; $Unicode = decode_utf8($bytes); -You can convert well-formed UTF-8 to a sequence of bytes, but if -you just want to convert random binary data into UTF-8, you can't. -B. You can -use C for the former, and you can create -well-formed Unicode data by C. +or: + + $Unicode = pack("U0a*", $bytes); + +You can find the bytes that make up a UTF-8 sequence with + + @bytes = unpack("C*", $Unicode_string) + +and you can create well-formed Unicode with + + $Unicode_string = pack("U*", 0xff, ...) =item * How Do I Display Unicode? How Do I Input Unicode? -See http://www.alanwood.net/unicode/ and -http://www.cl.cam.ac.uk/~mgk25/unicode.html +See L and +L =item * @@ -799,44 +794,44 @@ show a decimal number in hexadecimal. If you have just the Unicode Consortium - http://www.unicode.org/ +L =item * Unicode FAQ - http://www.unicode.org/unicode/faq/ +L =item * Unicode Glossary - http://www.unicode.org/glossary/ +L =item * Unicode Useful Resources - http://www.unicode.org/unicode/onlinedat/resources.html +L =item * Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications - http://www.alanwood.net/unicode/ +L =item * UTF-8 and Unicode FAQ for Unix/Linux - http://www.cl.cam.ac.uk/~mgk25/unicode.html +L =item * Legacy Character Sets - http://www.czyborra.com/ - http://www.eki.ee/letter/ +L +L =item * @@ -845,7 +840,7 @@ directory $Config{installprivlib}/unicore -in Perl 5.8.0 or newer, and +in Perl 5.8.0 or newer, and $Config{installprivlib}/unicode @@ -880,7 +875,7 @@ to UTF-8 bytes and back, the code works even with older Perl 5 versions. =head1 SEE ALSO -L, L, L, L, L, L, +L, L, L, L, L, L, L, L, L, L, L