implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading
+this reference document.
+
=over 4
=item Input and Output Layers
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item Regular Expressions
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
=item BOM-marked scripts and UTF-16 scripts autodetected
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
=item C<use encoding> needed to upgrade non-Latin-1 byte strings
-By default, there is a fundamental asymmetry in Perl's unicode model:
+By default, there is a fundamental asymmetry in Perl's Unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
- use encoding 'utf8';
-
See L</"Byte and Character Semantics"> for more details.
=back
character data are concatenated, the new string will be created by
decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
old Unicode string used EBCDIC. This translation is done without
-regard to the system's native 8-bit encoding. To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma. See L<encoding>.
+regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>. This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
Additionally, if you
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
=item *
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
=item *
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-See L</"Unicode Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
sequence--"a combining character sequence" in Standardese--where the
first character is a base character and subsequent characters are mark
characters that apply to the base character. C<\X> is equivalent to
-C<(?:\PM\pM*)>.
+C<< (?>\PM\pM*) >>.
=item *
equal to C<\P{Tamil}>.
B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
-came out in April 2003, and Perl 5.8.1 in September 2003.>
+Unicode 5.0.0 in July 2006.>
=over 4
Arabic
Armenian
+ Balinese
Bengali
Bopomofo
+ Braille
+ Buginese
Buhid
CanadianAboriginal
Cherokee
+ Coptic
+ Cuneiform
+ Cypriot
Cyrillic
Deseret
Devanagari
Ethiopic
Georgian
+ Glagolitic
Gothic
Greek
Gujarati
Inherited
Kannada
Katakana
+ Kharoshthi
Khmer
Lao
Latin
+ Limbu
+ LinearB
Malayalam
Mongolian
Myanmar
+ NewTaiLue
+ Nko
Ogham
OldItalic
+ OldPersian
Oriya
+ Osmanya
+ PhagsPa
+ Phoenician
Runic
+ Shavian
Sinhala
+ SylotiNagri
Syriac
Tagalog
Tagbanwa
+ TaiLe
Tamil
Telugu
Thaana
Thai
Tibetan
+ Tifinagh
+ Ugaritic
Yi
=item Extended property classes
Deprecated
Diacritic
Extender
- GraphemeLink
HexDigit
Hyphen
Ideographic
OtherAlphabetic
OtherDefaultIgnorableCodePoint
OtherGraphemeExtend
+ OtherIDStart
+ OtherIDContinue
OtherLowercase
OtherMath
OtherUppercase
+ PatternSyntax
+ PatternWhiteSpace
QuotationMark
Radical
SoftDotted
+ STerm
TerminalPunctuation
UnifiedIdeograph
+ VariationSelector
WhiteSpace
and there are further derived properties:
- Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
- Lowercase Ll + OtherLowercase
- Uppercase Lu + OtherUppercase
- Math Sm + OtherMath
+ Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
+ Lowercase = Ll + OtherLowercase
+ Uppercase = Lu + OtherUppercase
+ Math = Sm + OtherMath
- ID_Start Lu + Ll + Lt + Lm + Lo + Nl
- ID_Continue ID_Start + Mn + Mc + Nd + Pc
+ IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
+ IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
- Any Any character
- Assigned Any non-Cn character (i.e. synonym for \P{Cn})
- Unassigned Synonym for \p{Cn}
- Common Any character (or unassigned code point)
- not explicitly assigned to a script
+ DefaultIgnorableCodePoint
+ = OtherDefaultIgnorableCodePoint
+ + Cf + Cc + Cs + Noncharacters + VariationSelector
+ - WhiteSpace - FFF9..FFFB (Annotation Characters)
+
+ Any = Any code points (i.e. U+0000 to U+10FFFF)
+ Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
+ Unassigned = Synonym for \p{Cn}
+ ASCII = ASCII (i.e. U+0000 to U+007F)
+
+ Common = Any character (or unassigned code point)
+ not explicitly assigned to a script
=item Use of "Is" Prefix
shared across many scripts. Digits and similar groups, like
punctuation, are in a category called C<Common>.
-For more about scripts, see the UTR #24:
+For more about scripts, see the UAX#24 "Script Names":
- http://www.unicode.org/unicode/reports/tr24/
+ http://www.unicode.org/reports/tr24/
For more about blocks, see:
These block names are supported:
+ InAegeanNumbers
InAlphabeticPresentationForms
+ InAncientGreekMusicalNotation
+ InAncientGreekNumbers
InArabic
InArabicPresentationFormsA
InArabicPresentationFormsB
+ InArabicSupplement
InArmenian
InArrows
+ InBalinese
InBasicLatin
InBengali
InBlockElements
InBopomofoExtended
InBoxDrawing
InBraillePatterns
+ InBuginese
InBuhid
InByzantineMusicalSymbols
InCJKCompatibility
InCJKCompatibilityIdeographs
InCJKCompatibilityIdeographsSupplement
InCJKRadicalsSupplement
+ InCJKStrokes
InCJKSymbolsAndPunctuation
InCJKUnifiedIdeographs
InCJKUnifiedIdeographsExtensionA
InCJKUnifiedIdeographsExtensionB
InCherokee
InCombiningDiacriticalMarks
+ InCombiningDiacriticalMarksSupplement
InCombiningDiacriticalMarksforSymbols
InCombiningHalfMarks
InControlPictures
+ InCoptic
+ InCountingRodNumerals
+ InCuneiform
+ InCuneiformNumbersAndPunctuation
InCurrencySymbols
+ InCypriotSyllabary
InCyrillic
- InCyrillicSupplementary
+ InCyrillicSupplement
InDeseret
InDevanagari
InDingbats
InEnclosedAlphanumerics
InEnclosedCJKLettersAndMonths
InEthiopic
+ InEthiopicExtended
+ InEthiopicSupplement
InGeneralPunctuation
InGeometricShapes
InGeorgian
+ InGeorgianSupplement
+ InGlagolitic
InGothic
InGreekExtended
InGreekAndCoptic
InKannada
InKatakana
InKatakanaPhoneticExtensions
+ InKharoshthi
InKhmer
+ InKhmerSymbols
InLao
InLatin1Supplement
InLatinExtendedA
InLatinExtendedAdditional
InLatinExtendedB
+ InLatinExtendedC
+ InLatinExtendedD
InLetterlikeSymbols
+ InLimbu
+ InLinearBIdeograms
+ InLinearBSyllabary
InLowSurrogates
InMalayalam
InMathematicalAlphanumericSymbols
InMiscellaneousMathematicalSymbolsA
InMiscellaneousMathematicalSymbolsB
InMiscellaneousSymbols
+ InMiscellaneousSymbolsAndArrows
InMiscellaneousTechnical
+ InModifierToneLetters
InMongolian
InMusicalSymbols
InMyanmar
+ InNKo
+ InNewTaiLue
InNumberForms
InOgham
InOldItalic
+ InOldPersian
InOpticalCharacterRecognition
InOriya
+ InOsmanya
+ InPhagspa
+ InPhoenician
+ InPhoneticExtensions
+ InPhoneticExtensionsSupplement
InPrivateUseArea
InRunic
+ InShavian
InSinhala
InSmallFormVariants
InSpacingModifierLetters
InSupplementalArrowsA
InSupplementalArrowsB
InSupplementalMathematicalOperators
+ InSupplementalPunctuation
InSupplementaryPrivateUseAreaA
InSupplementaryPrivateUseAreaB
+ InSylotiNagri
InSyriac
InTagalog
InTagbanwa
InTags
+ InTaiLe
+ InTaiXuanJingSymbols
InTamil
InTelugu
InThaana
InThai
InTibetan
+ InTifinagh
+ InUgaritic
InUnifiedCanadianAboriginalSyllabics
InVariationSelectors
+ InVariationSelectorsSupplement
+ InVerticalForms
InYiRadicals
InYiSyllables
+ InYijingHexagramSymbols
=back
=item *
+A single hexadecimal number denoting a Unicode code point to include.
+
+=item *
+
Two hexadecimal numbers separated by horizontal whitespace (space or
tabular characters) denoting a range of Unicode code points to include.
It's important to remember not to use "&" for the first set -- that
would be intersecting with nothing (resulting in an empty set).
-A final note on the user-defined property tests: they will be used
-only if the scalar has been marked as having Unicode characters.
-Old byte-style strings will not be affected.
-
=head2 User-Defined Case Mappings
You can also define your own mappings to be used in the lc(),
The following list of Unicode support for regular expressions describes
all the features currently supported. The references to "Level N"
-and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
-Perl 5.8.0).
+and the section numbers refer to the Unicode Technical Standard #18,
+"Unicode Regular Expressions", version 11, in May 2005.
=over 4
Level 1 - Basic Unicode Support
- 2.1 Hex Notation - done [1]
- Named Notation - done [2]
- 2.2 Categories - done [3][4]
- 2.3 Subtraction - MISSING [5][6]
- 2.4 Simple Word Boundaries - done [7]
- 2.5 Simple Loose Matches - done [8]
- 2.6 End of Line - MISSING [9][10]
-
- [ 1] \x{...}
- [ 2] \N{...}
- [ 3] . \p{...} \P{...}
- [ 4] support for scripts (see UTR#24 Script Names), blocks,
- binary properties, enumerated non-binary properties, and
- numeric properties (as listed in UTR#18 Other Properties)
- [ 5] have negation
- [ 6] can use regular expression look-ahead [a]
- or user-defined character properties [b] to emulate subtraction
- [ 7] include Letters in word characters
- [ 8] note that Perl does Full case-folding in matching, not Simple:
+ RL1.1 Hex Notation - done [1]
+ RL1.2 Properties - done [2][3]
+ RL1.2a Compatibility Properties - done [4]
+ RL1.3 Subtraction and Intersection - MISSING [5]
+ RL1.4 Simple Word Boundaries - done [6]
+ RL1.5 Simple Loose Matches - done [7]
+ RL1.6 Line Boundaries - MISSING [8]
+ RL1.7 Supplementary Code Points - done [9]
+
+ [1] \x{...}
+ [2] \p{...} \P{...}
+ [3] supports not only minimal list (general category, scripts,
+ Alphabetic, Lowercase, Uppercase, WhiteSpace,
+ NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
+ ASCII, Assigned), but also bidirectional types, blocks, etc.
+ (see L</"Unicode Character Properties">)
+ [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
+ [5] can use regular expression look-ahead [a] or
+ user-defined character properties [b] to emulate set operations
+ [6] \b \B
+ [7] note that Perl does Full case-folding in matching, not Simple:
for example U+1F88 is equivalent with U+1F00 U+03B9,
not with 1F80. This difference matters for certain Greek
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
it to a single character.
- [ 9] see UTR #13 Unicode Newline Guidelines
- [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
- (should also affect <>, $., and script line numbers)
- (the \x{85}, \x{2028} and \x{2029} do match \s)
+ [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
+ CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
+ should also affect <>, $., and script line numbers;
+ should not split lines within CRLF [c] (i.e. there is no empty
+ line between \r and \n)
+ [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
+ but also beyond U+10FFFF [d]
[a] You can mimic class subtraction using lookahead.
-For example, what UTR #18 might write as
+For example, what UTS#18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
Also see the Unicode::Regex::Set module, it does implement the full
-UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
+UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
+
+[b] '+' for union, '-' for removal (set-difference), '&' for intersection
+(see L</"User-Defined Character Properties">)
-[b] See L</"User-Defined Character Properties">.
+[c] Try the C<:crlf> layer (see L<PerlIO>).
+
+[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
+U+FFFF (C<\x{FFFF}>).
=item *
Level 2 - Extended Unicode Support
- 3.1 Surrogates - MISSING [11]
- 3.2 Canonical Equivalents - MISSING [12][13]
- 3.3 Locale-Independent Graphemes - MISSING [14]
- 3.4 Locale-Independent Words - MISSING [15]
- 3.5 Locale-Independent Loose Matches - MISSING [16]
-
- [11] Surrogates are solely a UTF-16 concept and Perl's internal
- representation is UTF-8. The Encode module does UTF-16, though.
- [12] see UTR#15 Unicode Normalization
- [13] have Unicode::Normalize but not integrated to regexes
- [14] have \X but at this level . should equal that
- [15] need three classes, not just \w and \W
- [16] see UTR#21 Case Mappings
+ RL2.1 Canonical Equivalents - MISSING [10][11]
+ RL2.2 Default Grapheme Clusters - MISSING [12][13]
+ RL2.3 Default Word Boundaries - MISSING [14]
+ RL2.4 Default Loose Matches - MISSING [15]
+ RL2.5 Name Properties - MISSING [16]
+ RL2.6 Wildcard Properties - MISSING
+
+ [10] see UAX#15 "Unicode Normalization Forms"
+ [11] have Unicode::Normalize but not integrated to regexes
+ [12] have \X but at this level . should equal that
+ [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
+ clusters as a single grapheme cluster.
+ [14] see UAX#29, Word Boundaries
+ [15] see UAX#21 "Case Mappings"
+ [16] have \N{...} but neither compute names of CJK Ideographs
+ and Hangul Syllables nor use a loose match [e]
+
+[e] C<\N{...}> allows namespaces (see L<charnames>).
=item *
-Level 3 - Locale-Sensitive Support
-
- 4.1 Locale-Dependent Categories - MISSING
- 4.2 Locale-Dependent Graphemes - MISSING [16][17]
- 4.3 Locale-Dependent Words - MISSING
- 4.4 Locale-Dependent Loose Matches - MISSING
- 4.5 Locale-Dependent Ranges - MISSING
-
- [16] see UTR#10 Unicode Collation Algorithms
- [17] have Unicode::Collate but not integrated to regexes
+Level 3 - Tailored Support
+
+ RL3.1 Tailored Punctuation - MISSING
+ RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
+ RL3.3 Tailored Word Boundaries - MISSING
+ RL3.4 Tailored Loose Matches - MISSING
+ RL3.5 Tailored Ranges - MISSING
+ RL3.6 Context Matching - MISSING [19]
+ RL3.7 Incremental Matches - MISSING
+ ( RL3.8 Unicode Set Sharing )
+ RL3.9 Possible Match Sets - MISSING
+ RL3.10 Folded Matching - MISSING [20]
+ RL3.11 Submatchers - MISSING
+
+ [17] see UAX#10 "Unicode Collation Algorithms"
+ [18] have Unicode::Collate but not integrated to regexes
+ [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
+ outside of the target substring
+ [20] need insensitive matching for linguistic features other than case;
+ for example, hiragana to katakana, wide and narrow, simplified Han
+ to traditional Han (see UTR#30 "Character Foldings")
=back
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
extension doesn't know about the flag, it's likely that the extension
will return incorrectly-flagged data.
sub param {
my($self,$name,$value) = @_;
utf8::upgrade($name); # make sure it is UTF-8 encoded
- if (defined $value)
+ if (defined $value) {
utf8::upgrade($value); # make sure it is UTF-8 encoded
return $self->SUPER::param($name,$value);
} else {
A filehandle that should read or write UTF-8
if ($] > 5.007) {
- binmode $fh, ":utf8";
+ binmode $fh, ":encoding(utf8)";
}
=item *
Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.
A scalar we got back from an extension
If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
if ($] > 5.007) {
require Encode;
Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
utf8::downgrade($val) if $] > 5.007;
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
=cut