B<NOTE: this should be the only place where an explicit C<use utf8> is
needed>.
+You can also use the C<encoding> pragma to change the default encoding
+of the data in your script; see L<encoding>.
+
=back
=head2 Byte and Character semantics
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the C<bytes> pragma should be used.
-Notice that if you have a string with byte semantics and you then
-add character data into it, the bytes will be upgraded I<as if they
-were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
-to ISO 8859-1).
+Notice that if you concatenate strings with byte semantics and strings
+with Unicode character data, the bytes will by default be upgraded
+I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
+translation to ISO 8859-1). To change this, use the C<encoding>
+pragma, see L<encoding>.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes no
character with the Unicode uppercase property, while C<\p{M}> matches
any mark character. Single letter properties may omit the brackets,
so that can be written C<\pM> also. Many predefined character classes
-are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. The
-names of the C<In> classes are the official Unicode script and block
-names but with all non-alphanumeric characters removed, for example
-the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
-
-Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-as defined by Perl (the one-letter classes) (in Unicode materials
-what Perl calls C<L> is often called C<L&>):
-
- L Letter
- Lu Letter, Uppercase
- Ll Letter, Lowercase
- Lt Letter, Titlecase
- Lm Letter, Modifier
- Lo Letter, Other
- M Mark
- Mn Mark, Non-Spacing
- Mc Mark, Spacing Combining
- Me Mark, Enclosing
- N Number
- Nd Number, Decimal Digit
- Nl Number, Letter
- No Number, Other
- P Punctuation
- Pc Punctuation, Connector
- Pd Punctuation, Dash
- Ps Punctuation, Open
- Pe Punctuation, Close
- Pi Punctuation, Initial quote
- (may behave like Ps or Pe depending on usage)
- Pf Punctuation, Final quote
- (may behave like Ps or Pe depending on usage)
- Po Punctuation, Other
- S Symbol
- Sm Symbol, Math
- Sc Symbol, Currency
- Sk Symbol, Modifier
- So Symbol, Other
- Z Separator
- Zs Separator, Space
- Zl Separator, Line
- Zp Separator, Paragraph
- C Other
- Cc Other, Control
- Cf Other, Format
- Cs Other, Surrogate
- Co Other, Private Use
- Cn Other, Not Assigned (Unicode defines no Cn characters)
+are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
+
+The C<\p{Is...}> test for "general properties" such as "letter",
+"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
+
+The official Unicode script and block names have spaces and dashes and
+separators, but for convenience you can have dashes, spaces, and
+underbars at every word division, and you need not care about correct
+casing. It is recommended, however, that for consistency you use the
+following naming: the official Unicode script, block, or property name
+(see below for the additional rules that apply to block names),
+with whitespace and dashes replaced with underbar, and the words
+"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
+becomes "Latin_1_Supplement".
+
+You can also negate both C<\p{}> and C<\P{}> by introducing a caret
+(^) between the first curly and the property name: C<\p{^In_Tamil}> is
+equal to C<\P{In_Tamil}>.
+
+The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
+C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+
+ Short Long
+
+ L Letter
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
+
+ M Mark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
+
+ N Number
+ Nd Decimal_Number
+ Nl Letter_Number
+ No Other_Number
+
+ P Punctuation
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
+ (may behave like Ps or Pe depending on usage)
+ Pf Final_Punctuation
+ (may behave like Ps or Pe depending on usage)
+ Po Other_Punctuation
+
+ S Symbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
+
+ Z Separator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
+
+ C Other
+ Cc Control
+ Cf Format
+ Cs Surrogate
+ Co Private_Use
+ Cn Unassigned
+
+There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+
+The following reserved ranges have C<In> tests:
+
+ CJK_Ideograph_Extension_A
+ CJK_Ideograph
+ Hangul_Syllable
+ Non_Private_Use_High_Surrogate
+ Private_Use_High_Surrogate
+ Low_Surrogate
+ Private_Surrogate
+ CJK_Ideograph_Extension_B
+ Plane_15_Private_Use
+ Plane_16_Private_Use
+
+For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
+(Handling of surrogates is not implemented yet, because Perl
+uses UTF-8 and not UTF-16 internally to represent Unicode.)
Additionally, because scripts differ in their directionality
(for example Hebrew is written right to left), all characters
have their directionality defined:
- BidiL Left-to-Right
- BidiLRE Left-to-Right Embedding
- BidiLRO Left-to-Right Override
- BidiR Right-to-Left
- BidiAL Right-to-Left Arabic
- BidiRLE Right-to-Left Embedding
- BidiRLO Right-to-Left Override
- BidiPDF Pop Directional Format
- BidiEN European Number
- BidiES European Number Separator
- BidiET European Number Terminator
- BidiAN Arabic Number
- BidiCS Common Number Separator
- BidiNSM Non-Spacing Mark
- BidiBN Boundary Neutral
- BidiB Paragraph Separator
- BidiS Segment Separator
- BidiWS Whitespace
- BidiON Other Neutrals
+ BidiL Left-to-Right
+ BidiLRE Left-to-Right Embedding
+ BidiLRO Left-to-Right Override
+ BidiR Right-to-Left
+ BidiAL Right-to-Left Arabic
+ BidiRLE Right-to-Left Embedding
+ BidiRLO Right-to-Left Override
+ BidiPDF Pop Directional Format
+ BidiEN European Number
+ BidiES European Number Separator
+ BidiET European Number Terminator
+ BidiAN Arabic Number
+ BidiCS Common Number Separator
+ BidiNSM Non-Spacing Mark
+ BidiBN Boundary Neutral
+ BidiB Paragraph Separator
+ BidiS Segment Separator
+ BidiWS Whitespace
+ BidiON Other Neutrals
=head2 Scripts
The scripts available for C<\p{In...}> and C<\P{In...}>, for example
\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
- Latin
- Greek
- Cyrillic
- Armenian
- Hebrew
- Arabic
- Syriac
- Thaana
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Tamil
- Telugu
- Kannada
- Malayalam
- Sinhala
- Thai
- Lao
- Tibetan
- Myanmar
- Georgian
- Hangul
- Ethiopic
- Cherokee
- CanadianAboriginal
- Ogham
- Runic
- Khmer
- Mongolian
- Hiragana
- Katakana
- Bopomofo
- Han
- Yi
- OldItalic
- Gothic
- Deseret
- Inherited
+ Arabic
+ Armenian
+ Bengali
+ Bopomofo
+ Canadian-Aboriginal
+ Cherokee
+ Cyrillic
+ Deseret
+ Devanagari
+ Ethiopic
+ Georgian
+ Gothic
+ Greek
+ Gujarati
+ Gurmukhi
+ Han
+ Hangul
+ Hebrew
+ Hiragana
+ Inherited
+ Kannada
+ Katakana
+ Khmer
+ Lao
+ Latin
+ Malayalam
+ Mongolian
+ Myanmar
+ Ogham
+ Old-Italic
+ Oriya
+ Runic
+ Sinhala
+ Syriac
+ Tamil
+ Telugu
+ Thaana
+ Thai
+ Tibetan
+ Yi
+
+There are also extended property classes that supplement the basic
+properties, defined by the F<PropList> Unicode database:
+
+ ASCII_Hex_Digit
+ Bidi_Control
+ Dash
+ Diacritic
+ Extender
+ Hex_Digit
+ Hyphen
+ Ideographic
+ Join_Control
+ Noncharacter_Code_Point
+ Other_Alphabetic
+ Other_Lowercase
+ Other_Math
+ Other_Uppercase
+ Quotation_Mark
+ White_Space
+
+and further derived properties:
+
+ Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
+ Lowercase Ll + Other_Lowercase
+ Uppercase Lu + Other_Uppercase
+ Math Sm + Other_Math
+
+ ID_Start Lu + Ll + Lt + Lm + Lo + Nl
+ ID_Continue ID_Start + Mn + Mc + Nd + Pc
+
+ Any Any character
+ Assigned Any non-Cn character
+ Common Any character (or unassigned code point)
+ not explicitly assigned to a script
=head2 Blocks
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
-former concept is closer to natural languages, while the latter
+scripts concept is closer to natural languages, while the blocks
concept is more an artificial grouping based on groups of 256 Unicode
characters. For example, the C<Latin> script contains letters from
-many blocks, but it does not contain all the characters from those
-blocks, it does not for example contain digits.
+many blocks. On the other hand, the C<Latin> script does not contain
+all the characters from those blocks, it does not for example contain
+digits because digits are shared across many scripts. Digits and
+other similar groups, like punctuation, are in a category called
+C<Common>.
For more about scripts see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
-preferential character class definition; this meant that the
-definitions of some character classes changed (the ones in the
+5.6 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential Unicode character class definition; this meant that
+the definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).
- BasicLatin
- Latin1Supplement
- LatinExtendedA
- LatinExtendedB
- IPAExtensions
- SpacingModifierLetters
- CombiningDiacriticalMarks
- GreekBlock
- CyrillicBlock
- ArmenianBlock
- HebrewBlock
- ArabicBlock
- SyriacBlock
- ThaanaBlock
- DevanagariBlock
- BengaliBlock
- GurmukhiBlock
- GujaratiBlock
- OriyaBlock
- TamilBlock
- TeluguBlock
- KannadaBlock
- MalayalamBlock
- SinhalaBlock
- ThaiBlock
- LaoBlock
- TibetanBlock
- MyanmarBlock
- GeorgianBlock
- HangulJamo
- EthiopicBlock
- CherokeeBlock
- UnifiedCanadianAboriginalSyllabics
- OghamBlock
- RunicBlock
- KhmerBlock
- MongolianBlock
- LatinExtendedAdditional
- GreekExtended
- GeneralPunctuation
- SuperscriptsandSubscripts
- CurrencySymbols
- CombiningMarksforSymbols
- LetterlikeSymbols
- NumberForms
+ Alphabetic Presentation Forms
+ Arabic Block
+ Arabic Presentation Forms-A
+ Arabic Presentation Forms-B
+ Armenian Block
Arrows
- MathematicalOperators
- MiscellaneousTechnical
- ControlPictures
- OpticalCharacterRecognition
- EnclosedAlphanumerics
- BoxDrawing
- BlockElements
- GeometricShapes
- MiscellaneousSymbols
+ Basic Latin
+ Bengali Block
+ Block Elements
+ Bopomofo Block
+ Bopomofo Extended
+ Box Drawing
+ Braille Patterns
+ Byzantine Musical Symbols
+ CJK Compatibility
+ CJK Compatibility Forms
+ CJK Compatibility Ideographs
+ CJK Compatibility Ideographs Supplement
+ CJK Radicals Supplement
+ CJK Symbols and Punctuation
+ CJK Unified Ideographs
+ CJK Unified Ideographs Extension A
+ CJK Unified Ideographs Extension B
+ Cherokee Block
+ Combining Diacritical Marks
+ Combining Half Marks
+ Combining Marks for Symbols
+ Control Pictures
+ Currency Symbols
+ Cyrillic Block
+ Deseret Block
+ Devanagari Block
Dingbats
- BraillePatterns
- CJKRadicalsSupplement
- KangxiRadicals
- IdeographicDescriptionCharacters
- CJKSymbolsandPunctuation
- HiraganaBlock
- KatakanaBlock
- BopomofoBlock
- HangulCompatibilityJamo
+ Enclosed Alphanumerics
+ Enclosed CJK Letters and Months
+ Ethiopic Block
+ General Punctuation
+ Geometric Shapes
+ Georgian Block
+ Gothic Block
+ Greek Block
+ Greek Extended
+ Gujarati Block
+ Gurmukhi Block
+ Halfwidth and Fullwidth Forms
+ Hangul Compatibility Jamo
+ Hangul Jamo
+ Hangul Syllables
+ Hebrew Block
+ High Private Use Surrogates
+ High Surrogates
+ Hiragana Block
+ IPA Extensions
+ Ideographic Description Characters
Kanbun
- BopomofoExtended
- EnclosedCJKLettersandMonths
- CJKCompatibility
- CJKUnifiedIdeographsExtensionA
- CJKUnifiedIdeographs
- YiSyllables
- YiRadicals
- HangulSyllables
- HighSurrogates
- HighPrivateUseSurrogates
- LowSurrogates
- PrivateUse
- CJKCompatibilityIdeographs
- AlphabeticPresentationForms
- ArabicPresentationFormsA
- CombiningHalfMarks
- CJKCompatibilityForms
- SmallFormVariants
- ArabicPresentationFormsB
+ Kangxi Radicals
+ Kannada Block
+ Katakana Block
+ Khmer Block
+ Lao Block
+ Latin 1 Supplement
+ Latin Extended Additional
+ Latin Extended-A
+ Latin Extended-B
+ Letterlike Symbols
+ Low Surrogates
+ Malayalam Block
+ Mathematical Alphanumeric Symbols
+ Mathematical Operators
+ Miscellaneous Symbols
+ Miscellaneous Technical
+ Mongolian Block
+ Musical Symbols
+ Myanmar Block
+ Number Forms
+ Ogham Block
+ Old Italic Block
+ Optical Character Recognition
+ Oriya Block
+ Private Use
+ Runic Block
+ Sinhala Block
+ Small Form Variants
+ Spacing Modifier Letters
Specials
- HalfwidthandFullwidthForms
- OldItalicBlock
- GothicBlock
- DeseretBlock
- ByzantineMusicalSymbols
- MusicalSymbols
- MathematicalAlphanumericSymbols
- CJKUnifiedIdeographsExtensionB
- CJKCompatibilityIdeographsSupplement
+ Superscripts and Subscripts
+ Syriac Block
Tags
+ Tamil Block
+ Telugu Block
+ Thaana Block
+ Thai Block
+ Tibetan Block
+ Unified Canadian Aboriginal Syllabics
+ Yi Radicals
+ Yi Syllables
=item *
=item *
Case translation operators use the Unicode case translation tables
-when provided character input. Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction). Naturally the corresponding backslash
-sequences have the same semantics.
+when provided character input. Note that C<uc()> (also known as C<\U>
+in doublequoted strings) translates to uppercase, while C<ucfirst>
+(also known as C<\u> in doublequoted strings) translates to titlecase
+(for languages that make the distinction). Naturally the
+corresponding backslash sequences have the same semantics.
=item *
=item *
-lc(), uc(), lcfirst(), and ucfirst() work for simple cases
-where the mapping goes from a single Unicode character to
-another single Unicode character. More complex cases, where
-for example one character maps into several, are not yet implemented.
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character
+
+=back
+
+What doesn't yet work are the followng cases:
+
+=over 8
+
+=item *
+
+the "final sigma" (Greek)
+
+=item *
+
+anything to with locales (Lithuanian, Turkish, Azeri)
+
+=back
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
=item *