Add perltodo: write an XS cookbook

[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index 1a49f04..5dbd3cd 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not
 implement the Unicode standard or the accompanying technical reports
 from cover to cover, Perl does support many Unicode features.
 
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading
+this reference document.
+
 =over 4
 
 =item Input and Output Layers
@@ -20,15 +24,15 @@ the ":utf8" layer.  Other encodings can be converted to Perl's
 encoding on input or from Perl's encoding on output by use of the
 ":encoding(...)"  layer.  See L<open>.
 
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
 
 =item Regular Expressions
 
 The regular expression compiler produces polymorphic opcodes.  That is,
 the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
 
 =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
 
@@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
 machines.  B<These are the only times when an explicit C<use utf8>
 is needed.>  See L<utf8>.
 
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
 =item BOM-marked scripts and UTF-16 scripts autodetected
 
 If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
@@ -52,17 +53,12 @@ ISO 8859-1 or other eight-bit encodings.)
 
 =item C<use encoding> needed to upgrade non-Latin-1 byte strings
 
-By default, there is a fundamental asymmetry in Perl's unicode model:
+By default, there is a fundamental asymmetry in Perl's Unicode model:
 implicit upgrading from byte strings to Unicode strings assumes that
 they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
 downgraded with UTF-8 encoding.  This happens because the first 256
 codepoints in Unicode happens to agree with Latin-1.  
 
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
-    use encoding 'utf8';
-
 See L</"Byte and Character Semantics"> for more details.
 
 =back
@@ -83,6 +79,16 @@ character semantics.  For operations where this determination cannot
 be made without additional information from the user, Perl decides in
 favor of compatibility and chooses to use byte semantics.
 
+Under byte semantics, when C<use locale> is in effect, Perl uses the 
+semantics associated with the current locale.  Absent a C<use locale>, Perl
+currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
+meaning that characters whose ordinal numbers are in the range 128 - 255 are
+undefined except for their ordinal numbers.  This means that none have case
+(upper and lower), nor are any a member of character classes, like C<[:alpha:]>
+or C<\w>.
+(But all do belong to the C<\W> class or the Perl regular expression extension
+C<[:^alpha:]>.)
+
 This behavior preserves compatibility with earlier versions of Perl,
 which allowed byte semantics in Perl operations only if
 none of the program's inputs were marked as being as source of Unicode
@@ -109,12 +115,8 @@ Otherwise, byte semantics are in effect.  The C<bytes> pragma should
 be used to force byte semantics on Unicode data.
 
 If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be created by
-decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
-old Unicode string used EBCDIC.  This translation is done without
-regard to the system's native 8-bit encoding.  To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma.  See L<encoding>.
+character data are concatenated, the new string will have 
+character semantics.  This can cause surprises: See L</BUGS>, below
 
 Under character semantics, many operations that formerly operated on
 bytes now operate on characters. A character in Perl is
@@ -134,17 +136,16 @@ Character semantics have the following effects:
 Strings--including hash keys--and regular expression patterns may
 contain characters that have an ordinal value larger than 255.
 
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
 
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation.  The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>.  This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation.  The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>.  This encoding scheme works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
 
 Additionally, if you
 
@@ -163,8 +164,7 @@ names.
 =item *
 
 Regular expressions match characters instead of bytes.  "." matches
-a character instead of a byte.  The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
 
 =item *
 
@@ -173,17 +173,13 @@ bytes and match against the character properties specified in the
 Unicode properties database.  C<\w> can be used to match a Japanese
 ideograph, for instance.
 
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
 =item *
 
 Named Unicode properties, scripts, and block ranges may be used like
 character classes via the C<\p{}> "matches property" construct and
 the C<\P{}> negation, "doesn't match property".
 
-See L</"Unicode  Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
 
 You can define your own character properties and use them
 in the regular expression with the C<\p{}> or C<\P{}> construct.
@@ -196,7 +192,7 @@ The special pattern C<\X> matches any extended Unicode
 sequence--"a combining character sequence" in Standardese--where the
 first character is a base character and subsequent characters are mark
 characters that apply to the base character.  C<\X> is equivalent to
-C<(?:\PM\pM*)>.
+C<< (?>\PM\pM*) >>.
 
 =item *
 
@@ -766,6 +762,10 @@ or more newline-separated lines.  Each line must be one of the following:
 
 =item *
 
+A single hexadecimal number denoting a Unicode code point to include.
+
+=item *
+
 Two hexadecimal numbers separated by horizontal whitespace (space or
 tabular characters) denoting a range of Unicode code points to include.
 
@@ -856,10 +856,6 @@ two (or more) classes.
 It's important to remember not to use "&" for the first set -- that
 would be intersecting with nothing (resulting in an empty set).
 
-A final note on the user-defined property tests: they will be used
-only if the scalar has been marked as having Unicode characters.
-Old byte-style strings will not be affected.
-
 =head2 User-Defined Case Mappings
 
 You can also define your own mappings to be used in the lc(),
@@ -945,14 +941,14 @@ Level 1 - Basic Unicode Support
              Alphabetic, Lowercase, Uppercase, WhiteSpace,
              NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
              ASCII, Assigned), but also bidirectional types, blocks, etc.
-             (see L</"Unicode Character Properties">)
+             (see "Unicode Character Properties")
         [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
         [5]  can use regular expression look-ahead [a] or
              user-defined character properties [b] to emulate set operations
         [6]  \b \B
         [7]  note that Perl does Full case-folding in matching, not Simple:
-             for example U+1F88 is equivalent with U+1F00 U+03B9,
-             not with 1F80.  This difference matters for certain Greek
+             for example U+1F88 is equivalent to U+1F00 U+03B9,
+             not with 1F80.  This difference matters mainly for certain Greek
              capital letters with certain modifiers: the Full case-folding
              decomposes the letter, while the Simple case-folding would map
              it to a single character.
@@ -1311,15 +1307,13 @@ readdir, readlink
 =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
 
 Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force Perl to believe that a byte
-string is UTF-8, or vice versa.  The low-level calls
-utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
+situations where you simply need to force a byte
+string into UTF-8, or vice versa.  The low-level calls
+utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
 the answers.
 
-Do not use them without careful thought, though: Perl may easily get
-very confused, angry, or even crash, if you suddenly change the 'nature'
-of scalar like that.  Especially careful you have to be if you use the
-utf8::upgrade(): any random byte string is not valid UTF-8.
+Note that utf8::downgrade() can fail if the string contains characters
+that don't fit into a byte.
 
 =head2 Using Unicode in XS
 
@@ -1333,7 +1327,7 @@ details.
 =item *
 
 C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
-pragma is not in effect.  C<SvUTF8(sv)> returns true is the C<UTF8>
+pragma is not in effect.  C<SvUTF8(sv)> returns true if the C<UTF8>
 flag is on; the bytes pragma is ignored.  The C<UTF8> flag being on
 does B<not> mean that there are any characters of code points greater
 than 255 (or 127) in the scalar or that there are even any characters
@@ -1346,15 +1340,15 @@ Unicode model is not to use UTF-8 until it is absolutely necessary.
 
 =item *
 
-C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
+C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
 a buffer encoding the code point as UTF-8, and returns a pointer
-pointing after the UTF-8 bytes.
+pointing after the UTF-8 bytes.  It works appropriately on EBCDIC machines.
 
 =item *
 
-C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
+C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
 returns the Unicode character code point and, optionally, the length of
-the UTF-8 byte sequence.
+the UTF-8 byte sequence.  It works appropriately on EBCDIC machines.
 
 =item *
 
@@ -1400,7 +1394,7 @@ two pointers pointing to the same UTF-8 encoded buffer.
 
 =item *
 
-C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
+C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
 that is C<off> (positive or negative) Unicode characters displaced
 from the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
 C<utf8_hop()> will merrily run off the end or the beginning of the
@@ -1418,7 +1412,7 @@ output more readable.
 
 =item *
 
-C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
+C<ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
 compare two strings case-insensitively in Unicode.  For case-sensitive
 comparisons you can just use C<memEQ()> and C<memNE()> as usual.
 
@@ -1438,10 +1432,32 @@ use characters above that range when mapped into Unicode.  Perl's
 Unicode support will also tend to run slower.  Use of locales with
 Unicode is discouraged.
 
+=head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified
+
+Without a locale specified, unlike all other characters or code points,
+these characters have very different semantics in byte semantics versus
+character semantics.
+In character semantics they are interpreted as Unicode code points, which means
+they are viewed as Latin-1 (ISO-8859-1).
+In byte semantics, they are considered to be unassigned characters,
+meaning that the only semantics they have is their
+ordinal numbers, and that they are not members of various character classes.
+None are considered to match C<\w> for example, but all match C<\W>.
+Besides these class matches,
+the known operations that this affects are those that change the case,
+regular expression matching while ignoring case,
+and B<quotemeta()>.
+This can lead to unexpected results in which a string's semantics suddenly
+change if a code point above 255 is appended to or removed from it,
+which changes the string's semantics from byte to character or vice versa.
+This behavior is scheduled to change in version 5.12, but in the meantime,
+a workaround is to always call utf8::upgrade($string), or to use the
+standard modules L<Encode> or L<charnames>.
+
 =head2 Interaction with Extensions
 
 When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
 extension doesn't know about the flag, it's likely that the extension
 will return incorrectly-flagged data.
 
@@ -1486,7 +1502,7 @@ derived class with such a C<param> method:
     sub param {
       my($self,$name,$value) = @_;
       utf8::upgrade($name);     # make sure it is UTF-8 encoded
-      if (defined $value)
+      if (defined $value) {
         utf8::upgrade($value);  # make sure it is UTF-8 encoded
         return $self->SUPER::param($name,$value);
       } else {
@@ -1518,6 +1534,15 @@ be quite a bit slower (5-20 times) than their simpler counterparts
 like C<\d> (then again, there 268 Unicode characters matching C<Nd>
 compared with the 10 ASCII characters matching C<d>).
 
+=head2 Possible problems on EBCDIC platforms
+
+In earlier versions, when byte and character data were concatenated,
+the new string was sometimes created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC.
+
+If you find any of these, please report them as bugs.
+
 =head2 Porting code from perl-5.6.X
 
 Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
@@ -1535,7 +1560,7 @@ to work under 5.6, so you should be safe to try them out.
 A filehandle that should read or write UTF-8
 
   if ($] > 5.007) {
-    binmode $fh, ":utf8";
+    binmode $fh, ":encoding(utf8)";
   }
 
 =item *
@@ -1544,7 +1569,7 @@ A scalar that is going to be passed to some extension
 
 Be it Compress::Zlib, Apache::Request or any extension that has no
 mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
 (October 2002) the mentioned modules are not UTF-8-aware. Please
 check the documentation to verify if this is still true.
 
@@ -1558,7 +1583,7 @@ check the documentation to verify if this is still true.
 A scalar we got back from an extension
 
 If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
 
   if ($] > 5.007) {
     require Encode;
@@ -1620,7 +1645,7 @@ A large scalar that you know can only contain ASCII
 
 Scalars that contain only ASCII and are marked as UTF-8 are sometimes
 a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
 
   utf8::downgrade($val) if $] > 5.007;
 
@@ -1628,7 +1653,7 @@ the UTF-8 flag:
 
 =head1 SEE ALSO
 
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
 L<perlretut>, L<perlvar/"${^UNICODE}">
 
 =cut