=head1 NAME
perlunicode - Unicode support in Perl
=head1 DESCRIPTION
=head2 Important Caveats
Unicode support is an extensive requirement. While Perl does not
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
People who want to learn to use Unicode in Perl, should probably read
L, before reading
this reference document.
=over 4
=item Input and Output Layers
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
the ":utf8" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L.
To indicate that Perl source itself is in UTF-8, use C below.
The C pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op. See L.
Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data.
The decision to use character semantics is made transparently. If
input data comes from a Unicode source--for example, if a character
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The C pragma should
be used to force byte semantics on Unicode data, and the C pragma to force Unicode semantics on byte data (though in
5.12 it isn't fully implemented).
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
character semantics. This can cause surprises: See L, below
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
characters may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
See L for more.
=head2 Effects of Character Semantics
Character semantics have the following effects:
=over 4
=item *
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or C, the latter requires a BOM.)
Unicode characters can also be added to a string by using the C<\x{...}>
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces. For instance, a smiley face is
C<\x{263A}>. This encoding scheme works for all characters, but
for characters under 0x100, note that Perl may use an 8 bit encoding
internally, for optimization and/or backward compatibility.
Additionally, if you
use charnames ':full';
you can use the C<\N{...}> notation and put the official Unicode
character name within the braces, such as C<\N{WHITE SMILING FACE}>.
=item *
If an appropriate L is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs. Perl does not currently attempt to canonicalize variable
names.
=item *
Regular expressions match characters instead of bytes. "." matches
a character instead of a byte.
=item *
Character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
=item *
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
See L"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
See L"User-Defined Character Properties"> for more details.
=item *
The special pattern C<\X> matches a logical character, an "extended grapheme
cluster" in Standardese. In Unicode what appears to the user to be a single
character, for example an accented C, may in fact be composed of a sequence
of characters, in this case a C followed by an accent character. C<\X>
will match the entire sequence.
=item *
The C
operator translates characters instead of bytes. Note
that the C functionality has been removed. For similar
functionality see pack('U0', ...) and pack('C0', ...).
=item *
Case translation operators use the Unicode case translation tables
when character input is provided. Note that C, or C<\U> in
interpolated strings, translates to uppercase, while C,
or C<\u> in interpolated strings, translates to titlecase in languages
that make the distinction (which is equivalent to uppercase in languages
without the distinction).
=item *
Most operators that deal with positions or lengths in a string will
automatically switch to using character positions, including
C, C, C, C, C, C,
C, C, and C. An operator that
specifically does not switch is C. Operators that really don't
care include operators that treat strings as a bucket of bits such as
C, and operators dealing with filenames.
=item *
The C/C letter C does I change, since it is often
used for byte-oriented formats. Again, think C in the C language.
There is a new C specifier that converts between Unicode characters
and code points. There is also a C specifier that is the equivalent of
C/C and properly handles character values even if they are above 255.
=item *
The C and C functions work on characters, similar to
C and C, I C and
C. C and C are methods for
emulating byte-oriented C and C on Unicode strings.
While these methods reveal the internal encoding of Unicode strings,
that is not something one normally needs to care about at all.
=item *
The bit string operators, C<& | ^ ~>, can operate on character data.
However, for backward compatibility, such as when using bit string
operations when characters are all less than 256 in ordinal value, one
should not use C<~> (the bit complement) with characters of both
values less than 256 and values greater than 256. Most importantly,
DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
will not hold. The reason for this mathematical I is that
the complement cannot return B the 8-bit (byte-wide) bit
complement B the full character-wide bit complement.
=item *
You can define your own mappings to be used in lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
See L"User-Defined Case Mappings"> for more details.
=back
=over 4
=item *
And finally, C reverses by character rather than by byte.
=back
=head2 Unicode Character Properties
Most Unicode character properties are accessible by using regular expressions.
They are used like character classes via the C<\p{}> "matches property"
construct and the C<\P{}> negation, "doesn't match property".
For instance, C<\p{Uppercase}> matches any character with the Unicode
"Uppercase" property, while C<\p{L}> matches any character with a
General_Category of "L" (letter) property. Brackets are not
required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
property value is True, and C<\P{Uppercase}> matches any character whose
Uppercase property value is False, and they could have been written as
C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
This formality is needed when properties are not binary, that is if they can
take on more values than just True and False. For example, the Bidi_Class (see
L"Bidirectional Character Types"> below), can take on a number of different
values, such as Left, Right, Whitespace, and others. To match these, one needs
to specify the property name (Bidi_Class), and the value being matched against
(Left, Right, I). This is done, as in the examples above, by having the
two components separated by an equal sign (or interchangeably, a colon), like
C<\p{Bidi_Class: Left}>.
All Unicode-defined character properties may be written in these compound forms
of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
additional properties that are written only in the single form, as well as
single-form short-cuts for all binary properties and certain others described
below, in which you may omit the property name and the equals or colon
separator.
Most Unicode character properties have at least two synonyms (or aliases if you
prefer), a short one that is easier to type, and a longer one which is more
descriptive and hence it is easier to understand what it means. Thus the "L"
and "Letter" above are equivalent and can be used interchangeably. Likewise,
"Upper" is a synonym for "Uppercase", and we could have written
C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
various synonyms for the values the property can be. For binary properties,
"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
"No", and "N". But be careful. A short form of a value for one property may
not mean the same thing as the same short form for another. Thus, for the
General_Category property, "L" means "Letter", but for the Bidi_Class property,
"L" means "Left". A complete list of properties and synonyms is in
L.
Upper/lower case differences in the property names and values are irrelevant,
thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
Similarly, you can add or subtract underscores anywhere in the middle of a
word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
is irrelevant adjacent to non-word characters, such as the braces and the equals
or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
equivalent to these as well. In fact, in most cases, white space and even
hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
equivalent. All this is called "loose-matching" by Unicode. The few places
where stricter matching is employed is in the middle of numbers, and the Perl
extension properties that begin or end with an underscore. Stricter matching
cares about white space (except adjacent to the non-word characters) and
hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
=head3 B
Every Unicode character is assigned a general category, which is the "most
usual categorization of a character" (from
L).
The compound way of writing these is like C<\p{General_Category=Number}>
(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
through the equal or colon separator is omitted. So you can instead just write
C<\pN>.
Here are the short and long forms of the General Category properties:
Short Long
L Letter
LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation (also Punct)
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
C and C are special cases, which are aliases for the set of
C, C, and C.
Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement
the somewhat messy concept of surrogates. C is therefore not
supported.
=head3 B
Because scripts differ in their directionality--Hebrew is
written right to left, for example--Unicode supplies these properties in
the Bidi_Class class:
Property Meaning
L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Arabic Letter
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Separator
ET European Terminator
AN Arabic Number
CS Common Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals
This property is always written in the compound form.
For example, C<\p{Bidi_Class:R}> matches characters that are normally
written right to left.
=head3 B
The world's languages are written in a number of scripts. This sentence
(unless you're reading it in translation) is written in Latin, while Russian is
written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
Hiragana or Katakana. There are many more.
The Unicode Script property gives what script a given character is in,
and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit
everything up through the equals (or colon), and simply write C<\p{Latin}> or
C<\P{Cyrillic}>.
A complete list of scripts and their shortcuts is in L.
=head3 B
For backward compatibility (with Perl 5.6), all properties mentioned
so far may have C or C prepended to their name, so C<\P{Is_Lu}>, for
example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
C<\p{Arabic}>.
=head3 B
In addition to B, Unicode also defines B of
characters. The difference between scripts and blocks is that the
concept of scripts is closer to natural languages, while the concept
of blocks is more of an artificial grouping based on groups of Unicode
characters with consecutive ordinal values. For example, the "Basic Latin"
block is all characters whose ordinals are between 0 and 127, inclusive, in
other words, the ASCII characters. The "Latin" script contains some letters
from this block as well as several more, like "Latin-1 Supplement",
"Latin Extended-A", I, but it does not contain all the characters from
those blocks. It does not, for example, contain digits, because digits are
shared across many scripts. Digits and similar groups, like punctuation, are in
the script called C. There is also a script called C for
characters that modify other characters, and inherit the script value of the
controlling character.
For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
L
The Script property is likely to be the one you want to use when processing
natural language; the Block property may be useful in working with the nuts and
bolts of Unicode.
Block names are matched in the compound form, like C<\p{Block: Arrows}> or
C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a
Unicode-defined short name. But Perl does provide a (slight) shortcut: You
can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
compatibility, the C prefix may be omitted if there is no naming conflict
with a script or any other property, and you can even use an C prefix
instead in those cases. But it is not a good idea to do this, for a couple
reasons:
=over 4
=item 1
It is confusing. There are many naming conflicts, and you may forget some.
For example, C<\p{Hebrew}> means the I