B<NOTE: this should be the only place where an explicit C<use utf8> is
needed>.
+You can also use the C<encoding> pragma to change the default encoding
+of the data in your script; see L<encoding>.
+
=back
=head2 Byte and Character semantics
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the C<bytes> pragma should be used.
-Notice that if you have a string with byte semantics and you then
-add character data into it, the bytes will be upgraded I<as if they
-were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
-to ISO 8859-1).
+Notice that if you concatenate strings with byte semantics and strings
+with Unicode character data, the bytes will by default be upgraded
+I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
+translation to ISO 8859-1). To change this, use the C<encoding>
+pragma, see L<encoding>.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes no
The C<\p{Is...}> test for "general properties" such as "letter",
"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
-The official Unicode script and block names have spaces and
-dashes and separators, but for convenience you can have
-dashes, spaces, and underbars at every word division, and
-you need not care about correct casing. It is recommended,
-however, that for consistency you use the following naming:
-the official Unicode script or block name (see below for
-the additional rules that apply to block names), with the whitespace
-and dashes removed, and the words "uppercase-first-lowercase-otherwise".
-That is, "Latin-1 Supplement" becomes "Latin1Supplement".
+The official Unicode script and block names have spaces and dashes and
+separators, but for convenience you can have dashes, spaces, and
+underbars at every word division, and you need not care about correct
+casing. It is recommended, however, that for consistency you use the
+following naming: the official Unicode script, block, or property name
+(see below for the additional rules that apply to block names),
+with whitespace and dashes replaced with underbar, and the words
+"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
+becomes "Latin_1_Supplement".
You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^InTamil}> is
-equal to C<\P{InTamil}>.
+(^) between the first curly and the property name: C<\p{^In_Tamil}> is
+equal to C<\P{In_Tamil}>.
The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
-C<\p{InGreek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
Short Long
L Letter
- Lu Uppercase Letter
- Ll Lowercase Letter
- Lt Titlecase Letter
- Lm Modifier Letter
- Lo Other Letter
+ Lu Uppercase_Letter
+ Ll Lowercase_Letter
+ Lt Titlecase_Letter
+ Lm Modifier_Letter
+ Lo Other_Letter
M Mark
- Mn Non-Spacing Mark
- Mc Spacing Combining Mark
- Me Enclosing Mark
+ Mn Nonspacing_Mark
+ Mc Spacing_Mark
+ Me Enclosing_Mark
N Number
- Nd Decimal Digit Number
- Nl Letter Number
- No Other Number
+ Nd Decimal_Number
+ Nl Letter_Number
+ No Other_Number
P Punctuation
- Pc Connector Punctuation
- Pd Dash Punctuation
- Ps Open Punctuation
- Pe Close Punctuation
- Pi Initial Punctuation
+ Pc Connector_Punctuation
+ Pd Dash_Punctuation
+ Ps Open_Punctuation
+ Pe Close_Punctuation
+ Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
- Pf Final Punctuation
+ Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
- Po Other Punctuation
+ Po Other_Punctuation
S Symbol
- Sm Math Symbol
- Sc Currency Symbol
- Sk Modifier Symbol
- So Other Symbol
+ Sm Math_Symbol
+ Sc Currency_Symbol
+ Sk Modifier_Symbol
+ So Other_Symbol
Z Separator
- Zs Space Separator
- Zl Line Separator
- Zp Paragraph Separator
+ Zs Space_Separator
+ Zl Line_Separator
+ Zp Paragraph_Separator
C Other
- Cc (Other) Control
- Cf (Other) Format
- Cs (Other) Surrogate
- Co (Other) Private Use
- Cn (Other) Not Assigned
+ Cc Control
+ Cf Format
+ Cs Surrogate
+ Co Private_Use
+ Cn Unassigned
There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
The following reserved ranges have C<In> tests:
- CJK Ideograph Extension A
- CJK Ideograph
- Hangul Syllable
- Non Private Use High Surrogate
- Private Use High Surrogate
- Low Surrogate
- Private Surrogate
- CJK Ideograph Extension B
- Plane 15 Private Use
- Plane 16 Private Use
+ CJK_Ideograph_Extension_A
+ CJK_Ideograph
+ Hangul_Syllable
+ Non_Private_Use_High_Surrogate
+ Private_Use_High_Surrogate
+ Low_Surrogate
+ Private_Surrogate
+ CJK_Ideograph_Extension_B
+ Plane_15_Private_Use
+ Plane_16_Private_Use
For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
(Handling of surrogates is not implemented yet, because Perl
Other_Math
Other_Uppercase
Quotation_Mark
- White_space
+ White_Space
and further derived properties:
Any Any character
Assigned Any non-Cn character
Common Any character (or unassigned code point)
- not explicitly assigned to a script.
+ not explicitly assigned to a script
=head2 Blocks
version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+5.6 only the blocks were used; in Perl 5.8.0 scripts became the
preferential Unicode character class definition; this meant that
the definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).
=item *
Case translation operators use the Unicode case translation tables
-when provided character input. Note that C<uc()> translates to
-uppercase, while C<ucfirst> translates to titlecase (for languages
-that make the distinction). Naturally the corresponding backslash
-sequences have the same semantics.
+when provided character input. Note that C<uc()> (also known as C<\U>
+in doublequoted strings) translates to uppercase, while C<ucfirst>
+(also known as C<\u> in doublequoted strings) translates to titlecase
+(for languages that make the distinction). Naturally the
+corresponding backslash sequences have the same semantics.
=item *
=item *
-lc(), uc(), lcfirst(), and ucfirst() work only for some of the
-simplest cases, where the mapping goes from a single Unicode character
-to another single Unicode character, and where the mapping does not
-depend on surrounding characters, or on locales. More complex cases,
-where for example one character maps into several, are not yet
-implemented. See the Unicode Technical Report #21, Case Mappings,
-for more details. The Unicode::UCD module (part of Perl since 5.8.0)
-casespec() and casefold() interfaces supply information about the more
-complex cases.
+lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
+
+=over 8
+
+=item *
+
+the case mapping is from a single Unicode character to another
+single Unicode character
+
+=item *
+
+the case mapping is from a single Unicode character to more
+than one Unicode character
+
+=back
+
+What doesn't yet work are the followng cases:
+
+=over 8
+
+=item *
+
+the "final sigma" (Greek)
+
+=item *
+
+anything to with locales (Lithuanian, Turkish, Azeri)
+
+=back
+
+See the Unicode Technical Report #21, Case Mappings, for more details.
=item *