X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=106a4bf610cade2c8d61b9da3d65c07b97656af2;hb=17c338f39c13131c1bc175ef38013b54bc98396d;hp=641d99991de3bc56d4cf5f9b83aaf321d10b2ef8;hpb=1ac13f9adaf79f6c342d2230ad9a2b9a7918e1b2;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 641d999..106a4bf 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -52,6 +52,9 @@ ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
 B<NOTE: this should be the only place where an explicit C<use utf8> is
 needed>.
 
+You can also use the C<encoding> pragma to change the default encoding
+of the data in your script; see L<encoding>.
+
 =back
 
 =head2 Byte and Character semantics
@@ -102,10 +105,11 @@ literal UTF-8 string constant in the program), character semantics
 apply; otherwise, byte semantics are in effect.  To force byte semantics
 on Unicode data, the C<bytes> pragma should be used.
 
-Notice that if you have a string with byte semantics and you then
-add character data into it, the bytes will be upgraded I<as if they
-were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
-to ISO 8859-1).
+Notice that if you concatenate strings with byte semantics and strings
+with Unicode character data, the bytes will by default be upgraded
+I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
+translation to ISO 8859-1).  To change this, use the C<encoding>
+pragma, see L<encoding>.
 
 Under character semantics, many operations that formerly operated on
 bytes change to operating on characters.  For ASCII data this makes no
@@ -173,157 +177,179 @@ are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
 The C<\p{Is...}> test for "general properties" such as "letter",
 "digit", while the C<\p{In...}> test for Unicode scripts and blocks.
 
-The official Unicode script and block names have spaces and
-dashes and separators, but for convenience you can have
-dashes, spaces, and underbars at every word division, and
-you need not care about correct casing.  It is recommended,
-however, that for consistency you use the following naming:
-the official Unicode script or block name (see below for
-the additional rules that apply to block names), with the whitespace
-and dashes removed, and the words "uppercase-first-lowercase-otherwise".
-That is, "Latin-1 Supplement" becomes "Latin1Supplement".
+The official Unicode script and block names have spaces and dashes and
+separators, but for convenience you can have dashes, spaces, and
+underbars at every word division, and you need not care about correct
+casing.  It is recommended, however, that for consistency you use the
+following naming: the official Unicode script, block, or property name
+(see below for the additional rules that apply to block names),
+with whitespace and dashes replaced with underbar, and the words
+"uppercase-first-lowercase-rest".  That is, "Latin-1 Supplement"
+becomes "Latin_1_Supplement".
 
 You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^InTamil}> is
-equal to C<\P{InTamil}>.
+(^) between the first curly and the property name: C<\p{^In_Tamil}> is
+equal to C<\P{In_Tamil}>.
 
 The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
-C<\p{InGreek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
-
-Here is the list as of Unicode 3.1.1 (the two-letter classes) and as
-defined by Perl (the one-letter classes).
-
-    L  Letter
-    Lu Letter, Uppercase
-    Ll Letter, Lowercase
-    Lt Letter, Titlecase
-    Lm Letter, Modifier
-    Lo Letter, Other
-    M  Mark
-    Mn Mark, Non-Spacing
-    Mc Mark, Spacing Combining
-    Me Mark, Enclosing
-    N  Number
-    Nd Number, Decimal Digit
-    Nl Number, Letter
-    No Number, Other
-    P  Punctuation
-    Pc Punctuation, Connector
-    Pd Punctuation, Dash
-    Ps Punctuation, Open
-    Pe Punctuation, Close
-    Pi Punctuation, Initial quote
-        (may behave like Ps or Pe depending on usage)
-    Pf Punctuation, Final quote
-        (may behave like Ps or Pe depending on usage)
-    Po Punctuation, Other
-    S  Symbol
-    Sm Symbol, Math
-    Sc Symbol, Currency
-    Sk Symbol, Modifier
-    So Symbol, Other
-    Z  Separator
-    Zs Separator, Space
-    Zl Separator, Line
-    Zp Separator, Paragraph
-    C  Other
-    Cc Other, Control
-    Cf Other, Format
-    Cs Other, Surrogate
-    Co Other, Private Use
-    Cn Other, Not Assigned
+C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+
+    Short       Long
+
+    L           Letter
+    Lu          Uppercase_Letter
+    Ll          Lowercase_Letter
+    Lt          Titlecase_Letter
+    Lm          Modifier_Letter
+    Lo          Other_Letter
+
+    M           Mark
+    Mn          Nonspacing_Mark
+    Mc          Spacing_Mark
+    Me          Enclosing_Mark
+
+    N           Number
+    Nd          Decimal_Number
+    Nl          Letter_Number
+    No          Other_Number
+
+    P           Punctuation
+    Pc          Connector_Punctuation
+    Pd          Dash_Punctuation
+    Ps          Open_Punctuation
+    Pe          Close_Punctuation
+    Pi          Initial_Punctuation
+                (may behave like Ps or Pe depending on usage)
+    Pf          Final_Punctuation
+                (may behave like Ps or Pe depending on usage)
+    Po          Other_Punctuation
+
+    S           Symbol
+    Sm          Math_Symbol
+    Sc          Currency_Symbol
+    Sk          Modifier_Symbol
+    So          Other_Symbol
+
+    Z           Separator
+    Zs          Space_Separator
+    Zl          Line_Separator
+    Zp          Paragraph_Separator
+
+    C           Other
+    Cc          Control
+    Cf          Format
+    Cs          Surrogate
+    Co          Private_Use
+    Cn          Unassigned
 
 There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
 
+The following reserved ranges have C<In> tests:
+
+    CJK_Ideograph_Extension_A
+    CJK_Ideograph
+    Hangul_Syllable
+    Non_Private_Use_High_Surrogate
+    Private_Use_High_Surrogate
+    Low_Surrogate
+    Private_Surrogate
+    CJK_Ideograph_Extension_B
+    Plane_15_Private_Use
+    Plane_16_Private_Use
+
+For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
+(Handling of surrogates is not implemented yet, because Perl
+uses UTF-8 and not UTF-16 internally to represent Unicode.)
+
 Additionally, because scripts differ in their directionality
 (for example Hebrew is written right to left), all characters
 have their directionality defined:
 
-    BidiL   Left-to-Right
-    BidiLRE Left-to-Right Embedding
-    BidiLRO Left-to-Right Override
-    BidiR   Right-to-Left
-    BidiAL  Right-to-Left Arabic
-    BidiRLE Right-to-Left Embedding
-    BidiRLO Right-to-Left Override
-    BidiPDF Pop Directional Format
-    BidiEN  European Number
-    BidiES  European Number Separator
-    BidiET  European Number Terminator
-    BidiAN  Arabic Number
-    BidiCS  Common Number Separator
-    BidiNSM Non-Spacing Mark
-    BidiBN  Boundary Neutral
-    BidiB   Paragraph Separator
-    BidiS   Segment Separator
-    BidiWS  Whitespace
-    BidiON  Other Neutrals
+    BidiL       Left-to-Right
+    BidiLRE     Left-to-Right Embedding
+    BidiLRO     Left-to-Right Override
+    BidiR       Right-to-Left
+    BidiAL      Right-to-Left Arabic
+    BidiRLE     Right-to-Left Embedding
+    BidiRLO     Right-to-Left Override
+    BidiPDF     Pop Directional Format
+    BidiEN      European Number
+    BidiES      European Number Separator
+    BidiET      European Number Terminator
+    BidiAN      Arabic Number
+    BidiCS      Common Number Separator
+    BidiNSM     Non-Spacing Mark
+    BidiBN      Boundary Neutral
+    BidiB       Paragraph Separator
+    BidiS       Segment Separator
+    BidiWS      Whitespace
+    BidiON      Other Neutrals
 
 =head2 Scripts
 
 The scripts available for C<\p{In...}> and C<\P{In...}>, for example
 \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
 
-    Latin
-    Greek
-    Cyrillic
-    Armenian
-    Hebrew
     Arabic
-    Syriac
-    Thaana
-    Devanagari
+    Armenian
     Bengali
-    Gurmukhi
+    Bopomofo
+    Canadian-Aboriginal
+    Cherokee
+    Cyrillic
+    Deseret
+    Devanagari
+    Ethiopic
+    Georgian
+    Gothic
+    Greek
     Gujarati
-    Oriya
-    Tamil
-    Telugu
+    Gurmukhi
+    Han
+    Hangul
+    Hebrew
+    Hiragana
+    Inherited
     Kannada
-    Malayalam
-    Sinhala
-    Thai
+    Katakana
+    Khmer
     Lao
-    Tibetan
+    Latin
+    Malayalam
+    Mongolian
     Myanmar
-    Georgian
-    Hangul
-    Ethiopic
-    Cherokee
-    CanadianAboriginal
     Ogham
+    Old-Italic
+    Oriya
     Runic
-    Khmer
-    Mongolian
-    Hiragana
-    Katakana
-    Bopomofo
-    Han
+    Sinhala
+    Syriac
+    Tamil
+    Telugu
+    Thaana
+    Thai
+    Tibetan
     Yi
-    OldItalic
-    Gothic
-    Deseret
-    Inherited
 
 There are also extended property classes that supplement the basic
 properties, defined by the F<PropList> Unicode database:
 
-    White_space
+    ASCII_Hex_Digit
     Bidi_Control
-    Join_Control
     Dash
-    Hyphen
-    Quotation_Mark
-    Other_Math
-    Hex_Digit
-    ASCII_Hex_Digit
-    Other_Alphabetic
-    Ideographic
     Diacritic
     Extender
+    Hex_Digit
+    Hyphen
+    Ideographic
+    Join_Control
+    Noncharacter_Code_Point
+    Other_Alphabetic
     Other_Lowercase
+    Other_Math
     Other_Uppercase
-    Noncharacter_Code_Point
+    Quotation_Mark
+    White_Space
 
 and further derived properties:
 
@@ -338,17 +364,20 @@ and further derived properties:
     Any             Any character
     Assigned        Any non-Cn character
     Common          Any character (or unassigned code point)
-                    not explicitly assigned to a script.
+                    not explicitly assigned to a script
 
 =head2 Blocks
 
 In addition to B<scripts>, Unicode also defines B<blocks> of
 characters.  The difference between scripts and blocks is that the
-former concept is closer to natural languages, while the latter
+scripts concept is closer to natural languages, while the blocks
 concept is more an artificial grouping based on groups of 256 Unicode
 characters.  For example, the C<Latin> script contains letters from
-many blocks, but it does not contain all the characters from those
-blocks, it does not for example contain digits.
+many blocks.  On the other hand, the C<Latin> script does not contain
+all the characters from those blocks, it does not for example contain
+digits because digits are shared across many scripts.  Digits and
+other similar groups, like punctuation, are in a category called
+C<Common>.
 
 For more about scripts see the UTR #24:
 http://www.unicode.org/unicode/reports/tr24/
@@ -360,107 +389,107 @@ a script called C<Katakana> and a block called C<Katakana>, the block
 version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
 
 Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+5.6 only the blocks were used; in Perl 5.8.0 scripts became the
 preferential Unicode character class definition; this meant that
 the definitions of some character classes changed (the ones in the
 below list that have the C<Block> appended).
 
-   BasicLatin
-   Latin1Supplement
-   LatinExtendedA
-   LatinExtendedB
-   IPAExtensions
-   SpacingModifierLetters
-   CombiningDiacriticalMarks
-   GreekBlock
-   CyrillicBlock
-   ArmenianBlock
-   HebrewBlock
-   ArabicBlock
-   SyriacBlock
-   ThaanaBlock
-   DevanagariBlock
-   BengaliBlock
-   GurmukhiBlock
-   GujaratiBlock
-   OriyaBlock
-   TamilBlock
-   TeluguBlock
-   KannadaBlock
-   MalayalamBlock
-   SinhalaBlock
-   ThaiBlock
-   LaoBlock
-   TibetanBlock
-   MyanmarBlock
-   GeorgianBlock
-   HangulJamo
-   EthiopicBlock
-   CherokeeBlock
-   UnifiedCanadianAboriginalSyllabics
-   OghamBlock
-   RunicBlock
-   KhmerBlock
-   MongolianBlock
-   LatinExtendedAdditional
-   GreekExtended
-   GeneralPunctuation
-   SuperscriptsandSubscripts
-   CurrencySymbols
-   CombiningMarksforSymbols
-   LetterlikeSymbols
-   NumberForms
+   Alphabetic Presentation Forms
+   Arabic Block
+   Arabic Presentation Forms-A
+   Arabic Presentation Forms-B
+   Armenian Block
    Arrows
-   MathematicalOperators
-   MiscellaneousTechnical
-   ControlPictures
-   OpticalCharacterRecognition
-   EnclosedAlphanumerics
-   BoxDrawing
-   BlockElements
-   GeometricShapes
-   MiscellaneousSymbols
+   Basic Latin
+   Bengali Block
+   Block Elements
+   Bopomofo Block
+   Bopomofo Extended
+   Box Drawing
+   Braille Patterns
+   Byzantine Musical Symbols
+   CJK Compatibility
+   CJK Compatibility Forms
+   CJK Compatibility Ideographs
+   CJK Compatibility Ideographs Supplement
+   CJK Radicals Supplement
+   CJK Symbols and Punctuation
+   CJK Unified Ideographs
+   CJK Unified Ideographs Extension A
+   CJK Unified Ideographs Extension B
+   Cherokee Block
+   Combining Diacritical Marks
+   Combining Half Marks
+   Combining Marks for Symbols
+   Control Pictures
+   Currency Symbols
+   Cyrillic Block
+   Deseret Block
+   Devanagari Block
    Dingbats
-   BraillePatterns
-   CJKRadicalsSupplement
-   KangxiRadicals
-   IdeographicDescriptionCharacters
-   CJKSymbolsandPunctuation
-   HiraganaBlock
-   KatakanaBlock
-   BopomofoBlock
-   HangulCompatibilityJamo
+   Enclosed Alphanumerics
+   Enclosed CJK Letters and Months
+   Ethiopic Block
+   General Punctuation
+   Geometric Shapes
+   Georgian Block
+   Gothic Block
+   Greek Block
+   Greek Extended
+   Gujarati Block
+   Gurmukhi Block
+   Halfwidth and Fullwidth Forms
+   Hangul Compatibility Jamo
+   Hangul Jamo
+   Hangul Syllables
+   Hebrew Block
+   High Private Use Surrogates
+   High Surrogates
+   Hiragana Block
+   IPA Extensions
+   Ideographic Description Characters
    Kanbun
-   BopomofoExtended
-   EnclosedCJKLettersandMonths
-   CJKCompatibility
-   CJKUnifiedIdeographsExtensionA
-   CJKUnifiedIdeographs
-   YiSyllables
-   YiRadicals
-   HangulSyllables
-   HighSurrogates
-   HighPrivateUseSurrogates
-   LowSurrogates
-   PrivateUse
-   CJKCompatibilityIdeographs
-   AlphabeticPresentationForms
-   ArabicPresentationFormsA
-   CombiningHalfMarks
-   CJKCompatibilityForms
-   SmallFormVariants
-   ArabicPresentationFormsB
+   Kangxi Radicals
+   Kannada Block
+   Katakana Block
+   Khmer Block
+   Lao Block
+   Latin 1 Supplement
+   Latin Extended Additional
+   Latin Extended-A
+   Latin Extended-B
+   Letterlike Symbols
+   Low Surrogates
+   Malayalam Block
+   Mathematical Alphanumeric Symbols
+   Mathematical Operators
+   Miscellaneous Symbols
+   Miscellaneous Technical
+   Mongolian Block
+   Musical Symbols
+   Myanmar Block
+   Number Forms
+   Ogham Block
+   Old Italic Block
+   Optical Character Recognition
+   Oriya Block
+   Private Use
+   Runic Block
+   Sinhala Block
+   Small Form Variants
+   Spacing Modifier Letters
    Specials
-   HalfwidthandFullwidthForms
-   OldItalicBlock
-   GothicBlock
-   DeseretBlock
-   ByzantineMusicalSymbols
-   MusicalSymbols
-   MathematicalAlphanumericSymbols
-   CJKUnifiedIdeographsExtensionB
-   CJKCompatibilityIdeographsSupplement
+   Superscripts and Subscripts
+   Syriac Block
    Tags
+   Tamil Block
+   Telugu Block
+   Thaana Block
+   Thai Block
+   Tibetan Block
+   Unified Canadian Aboriginal Syllabics
+   Yi Radicals
+   Yi Syllables
 
 =item *
 
@@ -480,10 +509,11 @@ pack('C0', ...).
 =item *
 
 Case translation operators use the Unicode case translation tables
-when provided character input.  Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction).  Naturally the corresponding backslash
-sequences have the same semantics.
+when provided character input.  Note that C<uc()> (also known as C<\U>
+in doublequoted strings) translates to uppercase, while C<ucfirst>
+(also known as C<\u> in doublequoted strings) translates to titlecase
+(for languages that make the distinction).  Naturally the
+corresponding backslash sequences have the same semantics.
 
 =item *
 
@@ -527,15 +557,37 @@ wide bit complement.
 
 =item *
 
-lc(), uc(), lcfirst(), and ucfirst() work only for some of the
-simplest cases, where the mapping goes from a single Unicode character
-to another single Unicode character, and where the mapping does not
-depend on surrounding characters, or on locales.  More complex cases,
-where for example one character maps into several, are not yet
-implemented.  See the Unicode Technical Report #21, Case Mappings,
-for more details.  The Unicode::UCD module (part of Perl since 5.8.0)
-casespec() and casefold() interfaces supply information about the more
-complex cases.
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character
+
+=back
+
+What doesn't yet work are the followng cases:
+
+=over 8
+
+=item *
+
+the "final sigma" (Greek)
+
+=item *
+
+anything to with locales (Lithuanian, Turkish, Azeri)
+
+=back
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
 
 =item *