X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperluniintro.pod;h=54ce2f0a1c6803ded1e19378fa015f2a14775736;hb=9e5bbba0de25c01ae9355c7a97e237602a37e9f3;hp=6c82efde15922e28db25997f7957c59de6839cbc;hpb=a69635b797939b348e6ed6c090a2b89709dc47b1;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 6c82efd..54ce2f0 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -344,7 +344,8 @@ layer when opening files The I/O layers can also be specified more flexibly with the C pragma. See L, or look at the following example. - use open ':encoding(utf8)'; # input/output default encoding will be UTF-8 + use open ':encoding(utf8)'; # input/output default encoding will be + # UTF-8 open X, ">file"; print X chr(0x100), "\n"; close X; @@ -355,7 +356,8 @@ the C pragma. See L, or look at the following example. With the C pragma you can use the C<:locale> layer BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' } - # the :locale will probe the locale environment variables like LC_ALL + # the :locale will probe the locale environment variables like + # LC_ALL use open OUT => ':locale'; # russki parusski open(O, ">koi8"); print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1 @@ -432,13 +434,13 @@ its argument so that Unicode characters with code points greater than 255 are displayed as C<\x{...}>, control characters (like C<\n>) are displayed as C<\x..>, and the rest of the characters as themselves: - sub nice_string { - join("", - map { $_ > 255 ? # if wide character... - sprintf("\\x{%04X}", $_) : # \x{...} - chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... - sprintf("\\x%02X", $_) : # \x.. - quotemeta(chr($_)) # else quoted or as themselves + sub nice_string { + join("", + map { $_ > 255 ? # if wide character... + sprintf("\\x{%04X}", $_) : # \x{...} + chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... + sprintf("\\x%02X", $_) : # \x.. + quotemeta(chr($_)) # else quoted or as themselves } unpack("W*", $_[0])); # unpack Unicode characters } @@ -553,19 +555,19 @@ L Character Ranges and Classes -Character ranges in regular expression character classes (C) -and in the C (also known as C) operator are not magically -Unicode-aware. What this means is that C<[A-Za-z]> will not magically start -to mean "all alphabetic letters"; not that it does mean that even for -8-bit characters, you should be using C in that case. +Character ranges in regular expression bracketed character classes ( e.g., +C) and in the C (also known as C) operator are not +magically Unicode-aware. What this means is that C<[A-Za-z]> will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8-bit characters; for those, if you are using locales (L), +use C; and if not, use the 8-bit-aware property C<\p{alpha}>). -For specifying character classes like that in regular expressions, -you can use the various Unicode properties--C<\pL>, or perhaps -C<\p{Alphabetic}>, in this particular case. You can use Unicode -code points as the end points of character ranges, but there is no -magic associated with specifying a certain range. For further -information--there are dozens of Unicode character classes--see -L. +All the properties that begin with C<\p> (and its inverse C<\P>) are actually +character classes that are Unicode-aware. There are dozens of them, see +L. + +You can use Unicode code points as the end points of character ranges, and the +range will include all Unicode code points that lie between those end points. =item * @@ -607,7 +609,7 @@ Unicode; for that, see the earlier I/O discussion. How Do I Know Whether My String Is In Unicode? You shouldn't have to care. But you may, because currently the semantics of the -characters whose ordinals are in the range 128 to 255 is different depending on +characters whose ordinals are in the range 128 to 255 are different depending on whether the string they are contained within is in Unicode or not. (See L.) @@ -622,8 +624,8 @@ string has any characters at all. All the C does is to return the value of the internal "utf8ness" flag attached to the C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar -are interpreted as the (multi-byte, variable-length) UTF-8 encoded code -points of the characters. Bytes added to a UTF-8 encoded string are +are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded +code points of the characters. Bytes added to a UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded @@ -648,6 +650,7 @@ the C function: use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) + no bytes; =item * @@ -730,11 +733,11 @@ or: You can find the bytes that make up a UTF-8 sequence with - @bytes = unpack("C*", $Unicode_string) + @bytes = unpack("C*", $Unicode_string) and you can create well-formed Unicode with - $Unicode_string = pack("U*", 0xff, ...) + $Unicode_string = pack("U*", 0xff, ...) =item *