B<NOTE: this should be the only place where an explicit C<use utf8> is
needed>.
+You can also use the C<encoding> pragma to change the default encoding
+of the data in your script; see L<encoding>.
+
=back
=head2 Byte and Character semantics
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the C<bytes> pragma should be used.
-Notice that if you have a string with byte semantics and you then
-add character data into it, the bytes will be upgraded I<as if they
-were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
-to ISO 8859-1).
+Notice that if you concatenate strings with byte semantics and strings
+with Unicode character data, the bytes will by default be upgraded
+I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
+translation to ISO 8859-1). To change this, use the C<encoding>
+pragma, see L<encoding>.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes no
any mark character. Single letter properties may omit the brackets,
so that can be written C<\pM> also. Many predefined character classes
are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
-The recommended naming convention of the C<In> classes are the
-official Unicode script and block names, but with all non-alphanumeric
-characters removed, for example the block name C<"Latin-1 Supplement">
-becomes C<\p{InLatin1Supplement}>. Perl will ignore the case of
-letters, and any space or dash can be a space, dash, underbar, or be
-missing altogether, so C<\p{ in latin 1 supplement }> will work, too.
+
+The C<\p{Is...}> test for "general properties" such as "letter",
+"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
+
+The official Unicode script and block names have spaces and dashes and
+separators, but for convenience you can have dashes, spaces, and
+underbars at every word division, and you need not care about correct
+casing. It is recommended, however, that for consistency you use the
+following naming: the official Unicode script, block, or property name
+(see below for the additional rules that apply to block names),
+with whitespace and dashes replaced with underbar, and the words
+"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
+becomes "Latin_1_Supplement".
+
You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^InTamil}> is
-equal to C<\P{Tamil}>.
-
-Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-as defined by Perl (the one-letter classes) (in Unicode materials
-what Perl calls C<L> is often called C<L&>):
-
- L Letter
- Lu Letter, Uppercase
- Ll Letter, Lowercase
- Lt Letter, Titlecase
- Lm Letter, Modifier
- Lo Letter, Other
- M Mark
- Mn Mark, Non-Spacing
- Mc Mark, Spacing Combining
- Me Mark, Enclosing
- N Number
- Nd Number, Decimal Digit
- Nl Number, Letter
- No Number, Other
- P Punctuation
- Pc Punctuation, Connector
- Pd Punctuation, Dash
- Ps Punctuation, Open
- Pe Punctuation, Close
- Pi Punctuation, Initial quote
- (may behave like Ps or Pe depending on usage)
- Pf Punctuation, Final quote
- (may behave like Ps or Pe depending on usage)
- Po Punctuation, Other
- S Symbol
- Sm Symbol, Math
- Sc Symbol, Currency
- Sk Symbol, Modifier
- So Symbol, Other
- Z Separator
- Zs Separator, Space
- Zl Separator, Line
- Zp Separator, Paragraph
- C Other
- Cc Other, Control
- Cf Other, Format
- Cs Other, Surrogate
- Co Other, Private Use
- Cn Other, Not Assigned (Unicode defines no Cn characters)
+(^) between the first curly and the property name: C<\p{^In_Tamil}> is
+equal to C<\P{In_Tamil}>.
+
+The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
+C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+
+ Short Long
+
+ L Letter
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
+
+ M Mark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
+
+ N Number
+ Nd Decimal_Number
+ Nl Letter_Number
+ No Other_Number
+
+ P Punctuation
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
+ (may behave like Ps or Pe depending on usage)
+ Pf Final_Punctuation
+ (may behave like Ps or Pe depending on usage)
+ Po Other_Punctuation
+
+ S Symbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
+
+ Z Separator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
+
+ C Other
+ Cc Control
+ Cf Format
+ Cs Surrogate
+ Co Private_Use
+ Cn Unassigned
+
+There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+
+The following reserved ranges have C<In> tests:
+
+ CJK_Ideograph_Extension_A
+ CJK_Ideograph
+ Hangul_Syllable
+ Non_Private_Use_High_Surrogate
+ Private_Use_High_Surrogate
+ Low_Surrogate
+ Private_Surrogate
+ CJK_Ideograph_Extension_B
+ Plane_15_Private_Use
+ Plane_16_Private_Use
+
+For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
+(Handling of surrogates is not implemented yet, because Perl
+uses UTF-8 and not UTF-16 internally to represent Unicode.)
Additionally, because scripts differ in their directionality
(for example Hebrew is written right to left), all characters
have their directionality defined:
- BidiL Left-to-Right
- BidiLRE Left-to-Right Embedding
- BidiLRO Left-to-Right Override
- BidiR Right-to-Left
- BidiAL Right-to-Left Arabic
- BidiRLE Right-to-Left Embedding
- BidiRLO Right-to-Left Override
- BidiPDF Pop Directional Format
- BidiEN European Number
- BidiES European Number Separator
- BidiET European Number Terminator
- BidiAN Arabic Number
- BidiCS Common Number Separator
- BidiNSM Non-Spacing Mark
- BidiBN Boundary Neutral
- BidiB Paragraph Separator
- BidiS Segment Separator
- BidiWS Whitespace
- BidiON Other Neutrals
+ BidiL Left-to-Right
+ BidiLRE Left-to-Right Embedding
+ BidiLRO Left-to-Right Override
+ BidiR Right-to-Left
+ BidiAL Right-to-Left Arabic
+ BidiRLE Right-to-Left Embedding
+ BidiRLO Right-to-Left Override
+ BidiPDF Pop Directional Format
+ BidiEN European Number
+ BidiES European Number Separator
+ BidiET European Number Terminator
+ BidiAN Arabic Number
+ BidiCS Common Number Separator
+ BidiNSM Non-Spacing Mark
+ BidiBN Boundary Neutral
+ BidiB Paragraph Separator
+ BidiS Segment Separator
+ BidiWS Whitespace
+ BidiON Other Neutrals
=head2 Scripts
The scripts available for C<\p{In...}> and C<\P{In...}>, for example
\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
- Latin
- Greek
- Cyrillic
- Armenian
- Hebrew
- Arabic
- Syriac
- Thaana
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Tamil
- Telugu
- Kannada
- Malayalam
- Sinhala
- Thai
- Lao
- Tibetan
- Myanmar
- Georgian
- Hangul
- Ethiopic
- Cherokee
- CanadianAboriginal
- Ogham
- Runic
- Khmer
- Mongolian
- Hiragana
- Katakana
- Bopomofo
- Han
- Yi
- OldItalic
- Gothic
- Deseret
- Inherited
+ Arabic
+ Armenian
+ Bengali
+ Bopomofo
+ Canadian-Aboriginal
+ Cherokee
+ Cyrillic
+ Deseret
+ Devanagari
+ Ethiopic
+ Georgian
+ Gothic
+ Greek
+ Gujarati
+ Gurmukhi
+ Han
+ Hangul
+ Hebrew
+ Hiragana
+ Inherited
+ Kannada
+ Katakana
+ Khmer
+ Lao
+ Latin
+ Malayalam
+ Mongolian
+ Myanmar
+ Ogham
+ Old-Italic
+ Oriya
+ Runic
+ Sinhala
+ Syriac
+ Tamil
+ Telugu
+ Thaana
+ Thai
+ Tibetan
+ Yi
+
+There are also extended property classes that supplement the basic
+properties, defined by the F<PropList> Unicode database:
+
+ ASCII_Hex_Digit
+ Bidi_Control
+ Dash
+ Diacritic
+ Extender
+ Hex_Digit
+ Hyphen
+ Ideographic
+ Join_Control
+ Noncharacter_Code_Point
+ Other_Alphabetic
+ Other_Lowercase
+ Other_Math
+ Other_Uppercase
+ Quotation_Mark
+ White_Space
+
+and further derived properties:
+
+ Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
+ Lowercase Ll + Other_Lowercase
+ Uppercase Lu + Other_Uppercase
+ Math Sm + Other_Math
+
+ ID_Start Lu + Ll + Lt + Lm + Lo + Nl
+ ID_Continue ID_Start + Mn + Mc + Nd + Pc
+
+ Any Any character
+ Assigned Any non-Cn character
+ Common Any character (or unassigned code point)
+ not explicitly assigned to a script
=head2 Blocks
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
-former concept is closer to natural languages, while the latter
+scripts concept is closer to natural languages, while the blocks
concept is more an artificial grouping based on groups of 256 Unicode
characters. For example, the C<Latin> script contains letters from
-many blocks, but it does not contain all the characters from those
-blocks, it does not for example contain digits.
+many blocks. On the other hand, the C<Latin> script does not contain
+all the characters from those blocks, it does not for example contain
+digits because digits are shared across many scripts. Digits and
+other similar groups, like punctuation, are in a category called
+C<Common>.
For more about scripts see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
-preferential character class definition; this meant that the
-definitions of some character classes changed (the ones in the
+5.6 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential Unicode character class definition; this meant that
+the definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).
- BasicLatin
- Latin1Supplement
- LatinExtendedA
- LatinExtendedB
- IPAExtensions
- SpacingModifierLetters
- CombiningDiacriticalMarks
- GreekBlock
- CyrillicBlock
- ArmenianBlock
- HebrewBlock
- ArabicBlock
- SyriacBlock
- ThaanaBlock
- DevanagariBlock
- BengaliBlock
- GurmukhiBlock
- GujaratiBlock
- OriyaBlock
- TamilBlock
- TeluguBlock
- KannadaBlock
- MalayalamBlock
- SinhalaBlock
- ThaiBlock
- LaoBlock
- TibetanBlock
- MyanmarBlock
- GeorgianBlock
- HangulJamo
- EthiopicBlock
- CherokeeBlock
- UnifiedCanadianAboriginalSyllabics
- OghamBlock
- RunicBlock
- KhmerBlock
- MongolianBlock
- LatinExtendedAdditional
- GreekExtended
- GeneralPunctuation
- SuperscriptsandSubscripts
- CurrencySymbols
- CombiningMarksforSymbols
- LetterlikeSymbols
- NumberForms
+ Alphabetic Presentation Forms
+ Arabic Block
+ Arabic Presentation Forms-A
+ Arabic Presentation Forms-B
+ Armenian Block
Arrows
- MathematicalOperators
- MiscellaneousTechnical
- ControlPictures
- OpticalCharacterRecognition
- EnclosedAlphanumerics
- BoxDrawing
- BlockElements
- GeometricShapes
- MiscellaneousSymbols
+ Basic Latin
+ Bengali Block
+ Block Elements
+ Bopomofo Block
+ Bopomofo Extended
+ Box Drawing
+ Braille Patterns
+ Byzantine Musical Symbols
+ CJK Compatibility
+ CJK Compatibility Forms
+ CJK Compatibility Ideographs
+ CJK Compatibility Ideographs Supplement
+ CJK Radicals Supplement
+ CJK Symbols and Punctuation
+ CJK Unified Ideographs
+ CJK Unified Ideographs Extension A
+ CJK Unified Ideographs Extension B
+ Cherokee Block
+ Combining Diacritical Marks
+ Combining Half Marks
+ Combining Marks for Symbols
+ Control Pictures
+ Currency Symbols
+ Cyrillic Block
+ Deseret Block
+ Devanagari Block
Dingbats
- BraillePatterns
- CJKRadicalsSupplement
- KangxiRadicals
- IdeographicDescriptionCharacters
- CJKSymbolsandPunctuation
- HiraganaBlock
- KatakanaBlock
- BopomofoBlock
- HangulCompatibilityJamo
+ Enclosed Alphanumerics
+ Enclosed CJK Letters and Months
+ Ethiopic Block
+ General Punctuation
+ Geometric Shapes
+ Georgian Block
+ Gothic Block
+ Greek Block
+ Greek Extended
+ Gujarati Block
+ Gurmukhi Block
+ Halfwidth and Fullwidth Forms
+ Hangul Compatibility Jamo
+ Hangul Jamo
+ Hangul Syllables
+ Hebrew Block
+ High Private Use Surrogates
+ High Surrogates
+ Hiragana Block
+ IPA Extensions
+ Ideographic Description Characters
Kanbun
- BopomofoExtended
- EnclosedCJKLettersandMonths
- CJKCompatibility
- CJKUnifiedIdeographsExtensionA
- CJKUnifiedIdeographs
- YiSyllables
- YiRadicals
- HangulSyllables
- HighSurrogates
- HighPrivateUseSurrogates
- LowSurrogates
- PrivateUse
- CJKCompatibilityIdeographs
- AlphabeticPresentationForms
- ArabicPresentationFormsA
- CombiningHalfMarks
- CJKCompatibilityForms
- SmallFormVariants
- ArabicPresentationFormsB
+ Kangxi Radicals
+ Kannada Block
+ Katakana Block
+ Khmer Block
+ Lao Block
+ Latin 1 Supplement
+ Latin Extended Additional
+ Latin Extended-A
+ Latin Extended-B
+ Letterlike Symbols
+ Low Surrogates
+ Malayalam Block
+ Mathematical Alphanumeric Symbols
+ Mathematical Operators
+ Miscellaneous Symbols
+ Miscellaneous Technical
+ Mongolian Block
+ Musical Symbols
+ Myanmar Block
+ Number Forms
+ Ogham Block
+ Old Italic Block
+ Optical Character Recognition
+ Oriya Block
+ Private Use
+ Runic Block
+ Sinhala Block
+ Small Form Variants
+ Spacing Modifier Letters
Specials
- HalfwidthandFullwidthForms
- OldItalicBlock
- GothicBlock
- DeseretBlock
- ByzantineMusicalSymbols
- MusicalSymbols
- MathematicalAlphanumericSymbols
- CJKUnifiedIdeographsExtensionB
- CJKCompatibilityIdeographsSupplement
+ Superscripts and Subscripts
+ Syriac Block
Tags
+ Tamil Block
+ Telugu Block
+ Thaana Block
+ Thai Block
+ Tibetan Block
+ Unified Canadian Aboriginal Syllabics
+ Yi Radicals
+ Yi Syllables
=item *
=item *
Case translation operators use the Unicode case translation tables
-when provided character input. Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction). Naturally the corresponding backslash
-sequences have the same semantics.
+when provided character input. Note that C<uc()> (also known as C<\U>
+in doublequoted strings) translates to uppercase, while C<ucfirst>
+(also known as C<\u> in doublequoted strings) translates to titlecase
+(for languages that make the distinction). Naturally the
+corresponding backslash sequences have the same semantics.
=item *
=item *
-lc(), uc(), lcfirst(), and ucfirst() work only for some of the
-simplest cases, where the mapping goes from a single Unicode character
-to another single Unicode character, and where the mapping does not
-depend on surrounding characters, or on locales. More complex cases,
-where for example one character maps into several, are not yet
-implemented. See the Unicode Technical Report #21, Case Mappings,
-for more details. The Unicode::UCD module (part of Perl since 5.8.0)
-casespec() and casefold() interfaces supply information about the more
-complex cases.
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character
+
+=back
+
+What doesn't yet work are the followng cases:
+
+=over 8
+
+=item *
+
+the "final sigma" (Greek)
+
+=item *
+
+anything to with locales (Lithuanian, Turkish, Azeri)
+
+=back
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
=item *