B<NOTE: this should be the only place where an explicit C<use utf8> is
needed>.
+You can also use the C<encoding> pragma to change the default encoding
+of the data in your script; see L<encoding>.
+
=back
=head2 Byte and Character semantics
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the C<bytes> pragma should be used.
-Notice that if you have a string with byte semantics and you then
-add character data into it, the bytes will be upgraded I<as if they
-were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
-to ISO 8859-1).
+Notice that if you concatenate strings with byte semantics and strings
+with Unicode character data, the bytes will by default be upgraded
+I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
+translation to ISO 8859-1). To change this, use the C<encoding>
+pragma, see L<encoding>.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes no
The C<\p{Is...}> test for "general properties" such as "letter",
"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
-The official Unicode script and block names have spaces and
-dashes and separators, but for convenience you can have
-dashes, spaces, and underbars at every word division, and
-you need not care about correct casing. It is recommended,
-however, that for consistency you use the following naming:
-the official Unicode script or block name (see below for
-the additional rules that apply to block names), with the whitespace
-and dashes removed, and the words "uppercase-first-lowercase-otherwise".
-That is, "Latin-1 Supplement" becomes "Latin1Supplement".
+The official Unicode script and block names have spaces and dashes and
+separators, but for convenience you can have dashes, spaces, and
+underbars at every word division, and you need not care about correct
+casing. It is recommended, however, that for consistency you use the
+following naming: the official Unicode script, block, or property name
+(see below for the additional rules that apply to block names),
+with whitespace and dashes replaced with underbar, and the words
+"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
+becomes "Latin_1_Supplement".
You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^InTamil}> is
-equal to C<\P{InTamil}>.
+(^) between the first curly and the property name: C<\p{^In_Tamil}> is
+equal to C<\P{In_Tamil}>.
The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
-C<\p{InGreek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
Short Long
L Letter
- Lu Uppercase Letter
- Ll Lowercase Letter
- Lt Titlecase Letter
- Lm Modifier Letter
- Lo Other Letter
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
M Mark
- Mn Non-Spacing Mark
- Mc Spacing Combining Mark
- Me Enclosing Mark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
N Number
- Nd Decimal Digit Number
- Nl Letter Number
- No Other Number
+ Nd Decimal_Number
+ Nl Letter_Number
+ No Other_Number
P Punctuation
- Pc Connector Punctuation
- Pd Dash Punctuation
- Ps Open Punctuation
- Pe Close Punctuation
- Pi Initial Punctuation
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
- Pf Final Punctuation
+ Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
- Po Other Punctuation
+ Po Other_Punctuation
S Symbol
- Sm Math Symbol
- Sc Currency Symbol
- Sk Modifier Symbol
- So Other Symbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
Z Separator
- Zs Space Separator
- Zl Line Separator
- Zp Paragraph Separator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
C Other
- Cc (Other) Control
- Cf (Other) Format
- Cs (Other) Surrogate
- Co (Other) Private Use
- Cn (Other) Not Assigned
+ Cc Control
+ Cf Format
+ Cs Surrogate
+ Co Private_Use
+ Cn Unassigned
There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
The following reserved ranges have C<In> tests:
- CJK Ideograph Extension A
- CJK Ideograph
- Hangul Syllable
- Non Private Use High Surrogate
- Private Use High Surrogate
- Low Surrogate
- Private Surrogate
- CJK Ideograph Extension B
- Plane 15 Private Use
- Plane 16 Private Use
+ CJK_Ideograph_Extension_A
+ CJK_Ideograph
+ Hangul_Syllable
+ Non_Private_Use_High_Surrogate
+ Private_Use_High_Surrogate
+ Low_Surrogate
+ Private_Surrogate
+ CJK_Ideograph_Extension_B
+ Plane_15_Private_Use
+ Plane_16_Private_Use
For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
-(Handling of surrogates is not implemented yet.)
+(Handling of surrogates is not implemented yet, because Perl
+uses UTF-8 and not UTF-16 internally to represent Unicode.)
Additionally, because scripts differ in their directionality
(for example Hebrew is written right to left), all characters
The scripts available for C<\p{In...}> and C<\P{In...}>, for example
\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
- Latin
- Greek
- Cyrillic
- Armenian
- Hebrew
Arabic
- Syriac
- Thaana
- Devanagari
+ Armenian
Bengali
- Gurmukhi
+ Bopomofo
+ Canadian-Aboriginal
+ Cherokee
+ Cyrillic
+ Deseret
+ Devanagari
+ Ethiopic
+ Georgian
+ Gothic
+ Greek
Gujarati
- Oriya
- Tamil
- Telugu
+ Gurmukhi
+ Han
+ Hangul
+ Hebrew
+ Hiragana
+ Inherited
Kannada
- Malayalam
- Sinhala
- Thai
+ Katakana
+ Khmer
Lao
- Tibetan
+ Latin
+ Malayalam
+ Mongolian
Myanmar
- Georgian
- Hangul
- Ethiopic
- Cherokee
- Canadian Aboriginal
Ogham
+ Old-Italic
+ Oriya
Runic
- Khmer
- Mongolian
- Hiragana
- Katakana
- Bopomofo
- Han
+ Sinhala
+ Syriac
+ Tamil
+ Telugu
+ Thaana
+ Thai
+ Tibetan
Yi
- Old Italic
- Gothic
- Deseret
- Inherited
There are also extended property classes that supplement the basic
properties, defined by the F<PropList> Unicode database:
- White_space
+ ASCII_Hex_Digit
Bidi_Control
- Join_Control
Dash
- Hyphen
- Quotation_Mark
- Other_Math
- Hex_Digit
- ASCII_Hex_Digit
- Other_Alphabetic
- Ideographic
Diacritic
Extender
+ Hex_Digit
+ Hyphen
+ Ideographic
+ Join_Control
+ Noncharacter_Code_Point
+ Other_Alphabetic
Other_Lowercase
+ Other_Math
Other_Uppercase
- Noncharacter_Code_Point
+ Quotation_Mark
+ White_Space
and further derived properties:
Any Any character
Assigned Any non-Cn character
Common Any character (or unassigned code point)
- not explicitly assigned to a script.
+ not explicitly assigned to a script
=head2 Blocks
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
-former concept is closer to natural languages, while the latter
+scripts concept is closer to natural languages, while the blocks
concept is more an artificial grouping based on groups of 256 Unicode
characters. For example, the C<Latin> script contains letters from
-many blocks, but it does not contain all the characters from those
-blocks, it does not for example contain digits.
+many blocks. On the other hand, the C<Latin> script does not contain
+all the characters from those blocks, it does not for example contain
+digits because digits are shared across many scripts. Digits and
+other similar groups, like punctuation, are in a category called
+C<Common>.
For more about scripts see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+5.6 only the blocks were used; in Perl 5.8.0 scripts became the
preferential Unicode character class definition; this meant that
the definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).
+ Alphabetic Presentation Forms
+ Arabic Block
+ Arabic Presentation Forms-A
+ Arabic Presentation Forms-B
+ Armenian Block
+ Arrows
Basic Latin
- Latin 1 Supplement
- Latin Extended-A
- Latin Extended-B
- IPA Extensions
- Spacing Modifier Letters
+ Bengali Block
+ Block Elements
+ Bopomofo Block
+ Bopomofo Extended
+ Box Drawing
+ Braille Patterns
+ Byzantine Musical Symbols
+ CJK Compatibility
+ CJK Compatibility Forms
+ CJK Compatibility Ideographs
+ CJK Compatibility Ideographs Supplement
+ CJK Radicals Supplement
+ CJK Symbols and Punctuation
+ CJK Unified Ideographs
+ CJK Unified Ideographs Extension A
+ CJK Unified Ideographs Extension B
+ Cherokee Block
Combining Diacritical Marks
- Greek Block
+ Combining Half Marks
+ Combining Marks for Symbols
+ Control Pictures
+ Currency Symbols
Cyrillic Block
- Armenian Block
- Hebrew Block
- Arabic Block
- Syriac Block
- Thaana Block
+ Deseret Block
Devanagari Block
- Bengali Block
- Gurmukhi Block
- Gujarati Block
- Oriya Block
- Tamil Block
- Telugu Block
- Kannada Block
- Malayalam Block
- Sinhala Block
- Thai Block
- Lao Block
- Tibetan Block
- Myanmar Block
+ Dingbats
+ Enclosed Alphanumerics
+ Enclosed CJK Letters and Months
+ Ethiopic Block
+ General Punctuation
+ Geometric Shapes
Georgian Block
+ Gothic Block
+ Greek Block
+ Greek Extended
+ Gujarati Block
+ Gurmukhi Block
+ Halfwidth and Fullwidth Forms
+ Hangul Compatibility Jamo
Hangul Jamo
- Ethiopic Block
- Cherokee Block
- Unified Canadian Aboriginal Syllabics
- Ogham Block
- Runic Block
+ Hangul Syllables
+ Hebrew Block
+ High Private Use Surrogates
+ High Surrogates
+ Hiragana Block
+ IPA Extensions
+ Ideographic Description Characters
+ Kanbun
+ Kangxi Radicals
+ Kannada Block
+ Katakana Block
Khmer Block
- Mongolian Block
+ Lao Block
+ Latin 1 Supplement
Latin Extended Additional
- Greek Extended
- General Punctuation
- Superscripts and Subscripts
- Currency Symbols
- Combining Marks for Symbols
+ Latin Extended-A
+ Latin Extended-B
Letterlike Symbols
- Number Forms
- Arrows
+ Low Surrogates
+ Malayalam Block
+ Mathematical Alphanumeric Symbols
Mathematical Operators
+ Miscellaneous Symbols
Miscellaneous Technical
- Control Pictures
+ Mongolian Block
+ Musical Symbols
+ Myanmar Block
+ Number Forms
+ Ogham Block
+ Old Italic Block
Optical Character Recognition
- Enclosed Alphanumerics
- Box Drawing
- Block Elements
- Geometric Shapes
- Miscellaneous Symbols
- Dingbats
- Braille Patterns
- CJK Radicals Supplement
- Kangxi Radicals
- Ideographic Description Characters
- CJK Symbols and Punctuation
- Hiragana Block
- Katakana Block
- Bopomofo Block
- Hangul Compatibility Jamo
- Kanbun
- Bopomofo Extended
- Enclosed CJK Letters and Months
- CJK Compatibility
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs
- Yi Syllables
- Yi Radicals
- Hangul Syllables
- High Surrogates
- High Private Use Surrogates
- Low Surrogates
+ Oriya Block
Private Use
- CJK Compatibility Ideographs
- Alphabetic Presentation Forms
- Arabic Presentation Forms-A
- Combining Half Marks
- CJK Compatibility Forms
+ Runic Block
+ Sinhala Block
Small Form Variants
- Arabic Presentation Forms-B
+ Spacing Modifier Letters
Specials
- Halfwidth and Fullwidth Forms
- Old Italic Block
- Gothic Block
- Deseret Block
- Byzantine Musical Symbols
- Musical Symbols
- Mathematical Alphanumeric Symbols
- CJK Unified Ideographs Extension B
- CJK Compatibility Ideographs Supplement
+ Superscripts and Subscripts
+ Syriac Block
Tags
+ Tamil Block
+ Telugu Block
+ Thaana Block
+ Thai Block
+ Tibetan Block
+ Unified Canadian Aboriginal Syllabics
+ Yi Radicals
+ Yi Syllables
=item *
=item *
Case translation operators use the Unicode case translation tables
-when provided character input. Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction). Naturally the corresponding backslash
-sequences have the same semantics.
+when provided character input. Note that C<uc()> (also known as C<\U>
+in doublequoted strings) translates to uppercase, while C<ucfirst>
+(also known as C<\u> in doublequoted strings) translates to titlecase
+(for languages that make the distinction). Naturally the
+corresponding backslash sequences have the same semantics.
=item *
=item *
-lc(), uc(), lcfirst(), and ucfirst() work only for some of the
-simplest cases, where the mapping goes from a single Unicode character
-to another single Unicode character, and where the mapping does not
-depend on surrounding characters, or on locales. More complex cases,
-where for example one character maps into several, are not yet
-implemented. See the Unicode Technical Report #21, Case Mappings,
-for more details. The Unicode::UCD module (part of Perl since 5.8.0)
-casespec() and casefold() interfaces supply information about the more
-complex cases.
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character
+
+=back
+
+What doesn't yet work are the followng cases:
+
+=over 8
+
+=item *
+
+the "final sigma" (Greek)
+
+=item *
+
+anything to with locales (Lithuanian, Turkish, Azeri)
+
+=back
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
=item *