Remove the other 4 bits of MAD code designed to abort on local $^L.

[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod

index 21c5bb3..c913047 100644 (file)
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not
 implement the Unicode standard or the accompanying technical reports
 from cover to cover, Perl does support many Unicode features.
 
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial|perlunitut> before reading this reference
+document.
+
 =over 4
 
 =item Input and Output Layers
@@ -20,15 +24,15 @@ the ":utf8" layer.  Other encodings can be converted to Perl's
 encoding on input or from Perl's encoding on output by use of the
 ":encoding(...)"  layer.  See L<open>.
 
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
 
 =item Regular Expressions
 
 The regular expression compiler produces polymorphic opcodes.  That is,
 the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
 
 =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
 
@@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
 machines.  B<These are the only times when an explicit C<use utf8>
 is needed.>  See L<utf8>.
 
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
 =item BOM-marked scripts and UTF-16 scripts autodetected
 
 If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
@@ -58,11 +59,6 @@ they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
 downgraded with UTF-8 encoding.  This happens because the first 256
 codepoints in Unicode happens to agree with Latin-1.  
 
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
-    use encoding 'utf8';
-
 See L</"Byte and Character Semantics"> for more details.
 
 =back
@@ -112,9 +108,7 @@ If strings operating under byte semantics and strings with Unicode
 character data are concatenated, the new string will be created by
 decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
 old Unicode string used EBCDIC.  This translation is done without
-regard to the system's native 8-bit encoding.  To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma.  See L<encoding>.
+regard to the system's native 8-bit encoding. 
 
 Under character semantics, many operations that formerly operated on
 bytes now operate on characters. A character in Perl is
@@ -134,17 +128,16 @@ Character semantics have the following effects:
 Strings--including hash keys--and regular expression patterns may
 contain characters that have an ordinal value larger than 255.
 
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
 
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation.  The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>.  This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation.  The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>.  This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
 
 Additionally, if you
 
@@ -163,8 +156,7 @@ names.
 =item *
 
 Regular expressions match characters instead of bytes.  "." matches
-a character instead of a byte.  The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
 
 =item *
 
@@ -173,17 +165,13 @@ bytes and match against the character properties specified in the
 Unicode properties database.  C<\w> can be used to match a Japanese
 ideograph, for instance.
 
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
 =item *
 
 Named Unicode properties, scripts, and block ranges may be used like
 character classes via the C<\p{}> "matches property" construct and
 the C<\P{}> negation, "doesn't match property".
 
-See L</"Unicode  Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
 
 You can define your own character properties and use them
 in the regular expression with the C<\p{}> or C<\P{}> construct.
@@ -317,8 +305,7 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
 equal to C<\P{Tamil}>.
 
 B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002.  Unicode 4.0.0
-came out in April 2003, and Perl 5.8.1 in September 2003.>
+Unicode 5.0.0 in July 2006.>
 
 =over 4
 
@@ -425,16 +412,23 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
 
     Arabic
     Armenian
+    Balinese
     Bengali
     Bopomofo
+    Braille
+    Buginese
     Buhid
     CanadianAboriginal
     Cherokee
+    Coptic
+    Cuneiform
+    Cypriot
     Cyrillic
     Deseret
     Devanagari
     Ethiopic
     Georgian
+    Glagolitic
     Gothic
     Greek
     Gujarati
@@ -447,25 +441,39 @@ such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
     Inherited
     Kannada
     Katakana
+    Kharoshthi
     Khmer
     Lao
     Latin
+    Limbu
+    LinearB
     Malayalam
     Mongolian
     Myanmar
+    NewTaiLue
+    Nko
     Ogham
     OldItalic
+    OldPersian
     Oriya
+    Osmanya
+    PhagsPa
+    Phoenician
     Runic
+    Shavian
     Sinhala
+    SylotiNagri
     Syriac
     Tagalog
     Tagbanwa
+    TaiLe
     Tamil
     Telugu
     Thaana
     Thai
     Tibetan
+    Tifinagh
+    Ugaritic
     Yi
 
 =item Extended property classes
@@ -479,7 +487,6 @@ properties, defined by the F<PropList> Unicode database:
     Deprecated
     Diacritic
     Extender
-    GraphemeLink
     HexDigit
     Hyphen
     Ideographic
@@ -491,31 +498,44 @@ properties, defined by the F<PropList> Unicode database:
     OtherAlphabetic
     OtherDefaultIgnorableCodePoint
     OtherGraphemeExtend
+    OtherIDStart
+    OtherIDContinue
     OtherLowercase
     OtherMath
     OtherUppercase
+    PatternSyntax
+    PatternWhiteSpace
     QuotationMark
     Radical
     SoftDotted
+    STerm
     TerminalPunctuation
     UnifiedIdeograph
+    VariationSelector
     WhiteSpace
 
 and there are further derived properties:
 
-    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
-    Lowercase       Ll + OtherLowercase
-    Uppercase       Lu + OtherUppercase
-    Math            Sm + OtherMath
+    Alphabetic  =  Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
+    Lowercase   =  Ll + OtherLowercase
+    Uppercase   =  Lu + OtherUppercase
+    Math        =  Sm + OtherMath
+
+    IDStart     =  Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
+    IDContinue  =  IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
 
-    ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
-    ID_Continue     ID_Start + Mn + Mc + Nd + Pc
+    DefaultIgnorableCodePoint
+                =  OtherDefaultIgnorableCodePoint
+                   + Cf + Cc + Cs + Noncharacters + VariationSelector
+                   - WhiteSpace - FFF9..FFFB (Annotation Characters)
 
-    Any             Any character
-    Assigned        Any non-Cn character (i.e. synonym for \P{Cn})
-    Unassigned      Synonym for \p{Cn}
-    Common          Any character (or unassigned code point)
-                    not explicitly assigned to a script
+    Any         =  Any code points (i.e. U+0000 to U+10FFFF)
+    Assigned    =  Any non-Cn code points (i.e. synonym for \P{Cn})
+    Unassigned  =  Synonym for \p{Cn}
+    ASCII       =  ASCII (i.e. U+0000 to U+007F)
+
+    Common      =  Any character (or unassigned code point)
+                   not explicitly assigned to a script
 
 =item Use of "Is" Prefix
 
@@ -535,9 +555,9 @@ blocks. It does not, for example, contain digits, because digits are
 shared across many scripts. Digits and similar groups, like
 punctuation, are in a category called C<Common>.
 
-For more about scripts, see the UTR #24:
+For more about scripts, see the UAX#24 "Script Names":
 
-   http://www.unicode.org/unicode/reports/tr24/
+   http://www.unicode.org/reports/tr24/
 
 For more about blocks, see:
 
@@ -551,12 +571,17 @@ for block tests to avoid confusion.
 
 These block names are supported:
 
+    InAegeanNumbers
     InAlphabeticPresentationForms
+    InAncientGreekMusicalNotation
+    InAncientGreekNumbers
     InArabic
     InArabicPresentationFormsA
     InArabicPresentationFormsB
+    InArabicSupplement
     InArmenian
     InArrows
+    InBalinese
     InBasicLatin
     InBengali
     InBlockElements
@@ -564,6 +589,7 @@ These block names are supported:
     InBopomofoExtended
     InBoxDrawing
     InBraillePatterns
+    InBuginese
     InBuhid
     InByzantineMusicalSymbols
     InCJKCompatibility
@@ -571,27 +597,38 @@ These block names are supported:
     InCJKCompatibilityIdeographs
     InCJKCompatibilityIdeographsSupplement
     InCJKRadicalsSupplement
+    InCJKStrokes
     InCJKSymbolsAndPunctuation
     InCJKUnifiedIdeographs
     InCJKUnifiedIdeographsExtensionA
     InCJKUnifiedIdeographsExtensionB
     InCherokee
     InCombiningDiacriticalMarks
+    InCombiningDiacriticalMarksSupplement
     InCombiningDiacriticalMarksforSymbols
     InCombiningHalfMarks
     InControlPictures
+    InCoptic
+    InCountingRodNumerals
+    InCuneiform
+    InCuneiformNumbersAndPunctuation
     InCurrencySymbols
+    InCypriotSyllabary
     InCyrillic
-    InCyrillicSupplementary
+    InCyrillicSupplement
     InDeseret
     InDevanagari
     InDingbats
     InEnclosedAlphanumerics
     InEnclosedCJKLettersAndMonths
     InEthiopic
+    InEthiopicExtended
+    InEthiopicSupplement
     InGeneralPunctuation
     InGeometricShapes
     InGeorgian
+    InGeorgianSupplement
+    InGlagolitic
     InGothic
     InGreekExtended
     InGreekAndCoptic
@@ -613,13 +650,20 @@ These block names are supported:
     InKannada
     InKatakana
     InKatakanaPhoneticExtensions
+    InKharoshthi
     InKhmer
+    InKhmerSymbols
     InLao
     InLatin1Supplement
     InLatinExtendedA
     InLatinExtendedAdditional
     InLatinExtendedB
+    InLatinExtendedC
+    InLatinExtendedD
     InLetterlikeSymbols
+    InLimbu
+    InLinearBIdeograms
+    InLinearBSyllabary
     InLowSurrogates
     InMalayalam
     InMathematicalAlphanumericSymbols
@@ -627,17 +671,28 @@ These block names are supported:
     InMiscellaneousMathematicalSymbolsA
     InMiscellaneousMathematicalSymbolsB
     InMiscellaneousSymbols
+    InMiscellaneousSymbolsAndArrows
     InMiscellaneousTechnical
+    InModifierToneLetters
     InMongolian
     InMusicalSymbols
     InMyanmar
+    InNKo
+    InNewTaiLue
     InNumberForms
     InOgham
     InOldItalic
+    InOldPersian
     InOpticalCharacterRecognition
     InOriya
+    InOsmanya
+    InPhagspa
+    InPhoenician
+    InPhoneticExtensions
+    InPhoneticExtensionsSupplement
     InPrivateUseArea
     InRunic
+    InShavian
     InSinhala
     InSmallFormVariants
     InSpacingModifierLetters
@@ -646,21 +701,30 @@ These block names are supported:
     InSupplementalArrowsA
     InSupplementalArrowsB
     InSupplementalMathematicalOperators
+    InSupplementalPunctuation
     InSupplementaryPrivateUseAreaA
     InSupplementaryPrivateUseAreaB
+    InSylotiNagri
     InSyriac
     InTagalog
     InTagbanwa
     InTags
+    InTaiLe
+    InTaiXuanJingSymbols
     InTamil
     InTelugu
     InThaana
     InThai
     InTibetan
+    InTifinagh
+    InUgaritic
     InUnifiedCanadianAboriginalSyllabics
     InVariationSelectors
+    InVariationSelectorsSupplement
+    InVerticalForms
     InYiRadicals
     InYiSyllables
+    InYijingHexagramSymbols
 
 =back
 
@@ -845,9 +909,8 @@ See L<Encode>.
 
 The following list of Unicode support for regular expressions describes
 all the features currently supported.  The references to "Level N"
-and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
-Perl 5.8.0).
+and the section numbers refer to the Unicode Technical Standard #18,
+"Unicode Regular Expressions", version 11, in May 2005.
 
 =over 4
 
@@ -855,37 +918,42 @@ Perl 5.8.0).
 
 Level 1 - Basic Unicode Support
 
-        2.1 Hex Notation                        - done          [1]
-            Named Notation                      - done          [2]
-        2.2 Categories                          - done          [3][4]
-        2.3 Subtraction                         - MISSING       [5][6]
-        2.4 Simple Word Boundaries              - done          [7]
-        2.5 Simple Loose Matches                - done          [8]
-        2.6 End of Line                         - MISSING       [9][10]
-
-        [ 1] \x{...}
-        [ 2] \N{...}
-        [ 3] . \p{...} \P{...}
-        [ 4] support for scripts (see UTR#24 Script Names), blocks,
-             binary properties, enumerated non-binary properties, and
-             numeric properties (as listed in UTR#18 Other Properties)
-        [ 5] have negation
-        [ 6] can use regular expression look-ahead [a]
-             or user-defined character properties [b] to emulate subtraction
-        [ 7] include Letters in word characters
-        [ 8] note that Perl does Full case-folding in matching, not Simple:
+        RL1.1   Hex Notation                        - done          [1]
+        RL1.2   Properties                          - done          [2][3]
+        RL1.2a  Compatibility Properties            - done          [4]
+        RL1.3   Subtraction and Intersection        - MISSING       [5]
+        RL1.4   Simple Word Boundaries              - done          [6]
+        RL1.5   Simple Loose Matches                - done          [7]
+        RL1.6   Line Boundaries                     - MISSING       [8]
+        RL1.7   Supplementary Code Points           - done          [9]
+
+        [1]  \x{...}
+        [2]  \p{...} \P{...}
+        [3]  supports not only minimal list (general category, scripts,
+             Alphabetic, Lowercase, Uppercase, WhiteSpace,
+             NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
+             ASCII, Assigned), but also bidirectional types, blocks, etc.
+             (see L</"Unicode Character Properties">)
+        [4]  \d \D \s \S \w \W \X [:prop:] [:^prop:]
+        [5]  can use regular expression look-ahead [a] or
+             user-defined character properties [b] to emulate set operations
+        [6]  \b \B
+        [7]  note that Perl does Full case-folding in matching, not Simple:
              for example U+1F88 is equivalent with U+1F00 U+03B9,
              not with 1F80.  This difference matters for certain Greek
              capital letters with certain modifiers: the Full case-folding
              decomposes the letter, while the Simple case-folding would map
              it to a single character.
-        [ 9] see UTR #13 Unicode Newline Guidelines
-        [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
-             (should also affect <>, $., and script line numbers)
-             (the \x{85}, \x{2028} and \x{2029} do match \s)
+        [8]  should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
+             CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
+             should also affect <>, $., and script line numbers;
+             should not split lines within CRLF [c] (i.e. there is no empty
+             line between \r and \n)
+        [9]  UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
+             but also beyond U+10FFFF [d]
 
 [a] You can mimic class subtraction using lookahead.
-For example, what UTR #18 might write as
+For example, what UTS#18 might write as
 
     [{Greek}-[{UNASSIGNED}]]
 
@@ -901,40 +969,62 @@ But in this particular example, you probably really want
 which will match assigned characters known to be part of the Greek script.
 
 Also see the Unicode::Regex::Set module, it does implement the full
-UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
+UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
+
+[b] '+' for union, '-' for removal (set-difference), '&' for intersection
+(see L</"User-Defined Character Properties">)
+
+[c] Try the C<:crlf> layer (see L<PerlIO>).
 
-[b] See L</"User-Defined Character Properties">.
+[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
+U+FFFF (C<\x{FFFF}>).
 
 =item *
 
 Level 2 - Extended Unicode Support
 
-        3.1 Surrogates                          - MISSING      [11]
-        3.2 Canonical Equivalents               - MISSING       [12][13]
-        3.3 Locale-Independent Graphemes        - MISSING       [14]
-        3.4 Locale-Independent Words            - MISSING       [15]
-        3.5 Locale-Independent Loose Matches    - MISSING       [16]
-
-        [11] Surrogates are solely a UTF-16 concept and Perl's internal
-             representation is UTF-8.  The Encode module does UTF-16, though.
-        [12] see UTR#15 Unicode Normalization
-        [13] have Unicode::Normalize but not integrated to regexes
-        [14] have \X but at this level . should equal that
-        [15] need three classes, not just \w and \W
-        [16] see UTR#21 Case Mappings
+        RL2.1   Canonical Equivalents           - MISSING       [10][11]
+        RL2.2   Default Grapheme Clusters       - MISSING       [12][13]
+        RL2.3   Default Word Boundaries         - MISSING       [14]
+        RL2.4   Default Loose Matches           - MISSING       [15]
+        RL2.5   Name Properties                 - MISSING       [16]
+        RL2.6   Wildcard Properties             - MISSING
+
+        [10] see UAX#15 "Unicode Normalization Forms"
+        [11] have Unicode::Normalize but not integrated to regexes
+        [12] have \X but at this level . should equal that
+        [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
+             clusters as a single grapheme cluster.
+        [14] see UAX#29, Word Boundaries
+        [15] see UAX#21 "Case Mappings"
+        [16] have \N{...} but neither compute names of CJK Ideographs
+             and Hangul Syllables nor use a loose match [e]
+
+[e] C<\N{...}> allows namespaces (see L<charnames>).
 
 =item *
 
-Level 3 - Locale-Sensitive Support
-
-        4.1 Locale-Dependent Categories         - MISSING
-        4.2 Locale-Dependent Graphemes          - MISSING       [16][17]
-        4.3 Locale-Dependent Words              - MISSING
-        4.4 Locale-Dependent Loose Matches      - MISSING
-        4.5 Locale-Dependent Ranges             - MISSING
-
-        [16] see UTR#10 Unicode Collation Algorithms
-        [17] have Unicode::Collate but not integrated to regexes
+Level 3 - Tailored Support
+
+        RL3.1   Tailored Punctuation            - MISSING
+        RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
+        RL3.3   Tailored Word Boundaries        - MISSING
+        RL3.4   Tailored Loose Matches          - MISSING
+        RL3.5   Tailored Ranges                 - MISSING
+        RL3.6   Context Matching                - MISSING       [19]
+        RL3.7   Incremental Matches             - MISSING
+      ( RL3.8   Unicode Set Sharing )
+        RL3.9   Possible Match Sets             - MISSING
+        RL3.10  Folded Matching                 - MISSING       [20]
+        RL3.11  Submatchers                     - MISSING
+
+        [17] see UAX#10 "Unicode Collation Algorithms"
+        [18] have Unicode::Collate but not integrated to regexes
+        [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
+             outside of the target substring
+        [20] need insensitive matching for linguistic features other than case;
+             for example, hiragana to katakana, wide and narrow, simplified Han
+             to traditional Han (see UTR#30 "Character Foldings")
 
 =back
 
@@ -1339,7 +1429,7 @@ Unicode is discouraged.
 =head2 Interaction with Extensions
 
 When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
 extension doesn't know about the flag, it's likely that the extension
 will return incorrectly-flagged data.
 
@@ -1442,7 +1532,7 @@ A scalar that is going to be passed to some extension
 
 Be it Compress::Zlib, Apache::Request or any extension that has no
 mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
 (October 2002) the mentioned modules are not UTF-8-aware. Please
 check the documentation to verify if this is still true.
 
@@ -1456,7 +1546,7 @@ check the documentation to verify if this is still true.
 A scalar we got back from an extension
 
 If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
 
   if ($] > 5.007) {
     require Encode;
@@ -1518,7 +1608,7 @@ A large scalar that you know can only contain ASCII
 
 Scalars that contain only ASCII and are marked as UTF-8 are sometimes
 a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
 
   utf8::downgrade($val) if $] > 5.007;
 
@@ -1526,7 +1616,7 @@ the UTF-8 flag:
 
 =head1 SEE ALSO
 
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
 L<perlretut>, L<perlvar/"${^UNICODE}">
 
 =cut