From: Jeffrey Friedl Date: Sun, 16 Dec 2001 11:36:32 +0000 (-0800) Subject: Will the real Unicode encoding please stand up? X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=58c274a11b245c0b622f3aa697372d5c1dc88354;p=p5sagit%2Fp5-mst-13.2.git Will the real Unicode encoding please stand up? Message-Id: <200112161936.fBGJaWe41263@ventrue.corp.yahoo.com> p4raw-id: //depot/perl@13726 --- diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ecfba0..67ce214 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -358,18 +358,23 @@ its argument so that Unicode characters with code points greater than 255 are displayed as "\x{...}", control characters (like "\n") are displayed as "\x..", and the rest of the characters as themselves. -sub nice_string { - join("", - map { $_ > 255 ? # if wide character... - sprintf("\\x{%x}", $_) : # \x{...} - chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... - sprintf("\\x%02x", $_) : # \x.. - chr($_) } # else as themselves - unpack("U*", $_[0])); # unpack Unicode characters -} - -For example, C will return -C<"foo\x{100}bar\x0a">. + sub nice_string { + join("", + map { $_ > 255 ? # if wide character... + sprintf("\\x{%x}", $_) : # \x{...} + chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... + sprintf("\\x%02x", $_) : # \x.. + chr($_) # else as themselves + } unpack("U*", $_[0])); # unpack Unicode characters + } + +For example, + + nice_string("foo\x{100}bar\n") + +will return: + + "foo\x{100}bar\x0a" =head2 Special Cases @@ -423,7 +428,7 @@ C?) The short answer is that by default Perl compares equivalence (C, C) based only on code points of the characters. -In the above case, no (because 0x00C1 != 0x0041). But sometimes any +In the above case, the answer is no (because 0x00C1 != 0x0041). But sometimes any CAPITAL LETTER As being considered equal, or even any As of any case, would be desirable. @@ -433,7 +438,7 @@ Reports #15 and #21, I and I, http://www.unicode.org/unicode/reports/tr15/ http://www.unicode.org/unicode/reports/tr21/ -As of Perl 5.8.0, the's regular expression case-ignoring matching +As of Perl 5.8.0, regular expression case-ignoring matching implements only 1:1 semantics: one character matches one character. In I both 1:N and N:1 matches are defined. @@ -447,9 +452,9 @@ parlance goes, collated. But again, what do you mean by collate? (Does C come before or after C?) -The short answer is that by default Perl compares strings (C, +The short answer is that by default, Perl compares strings (C, C, C, C, C) based only on the code points of the -characters. In the above case, after, since 0x00C1 > 0x00C0. +characters. In the above case, the answer is "after", since 0x00C1 > 0x00C0. The long answer is that "it depends", and a good answer cannot be given without knowing (at the very least) the language context. @@ -468,12 +473,12 @@ Character Ranges Character ranges in regular expression character classes (C) and in the C (also known as C) operator are not magically -Unicode-aware. What this means that C<[a-z]> will not magically start +Unicode-aware. What this means that C<[A-Za-z]> will not magically start to mean "all alphabetic letters" (not that it does mean that even for 8-bit characters, you should be using C for that). -For specifying things like that in regular expressions you can use the -various Unicode properties, C<\pL> in this particular case. You can +For specifying things like that in regular expressions, you can use the +various Unicode properties, C<\pL> or perhaps C<\p{Alphabetic}>, in this particular case. You can use Unicode code points as the end points of character ranges, but that means that particular code point range, nothing more. For further information, see L. @@ -485,7 +490,7 @@ String-To-Number Conversions Unicode does define several other decimal (and numeric) characters than just the familiar 0 to 9, such as the Arabic and Indic digits. Perl does not support string-to-number conversion for digits other -than the 0 to 9 (and a to f for hexadecimal). +than ASCII 0 to 9 (and ASCII a to f for hexadecimal). =back