implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial|perlunitut> before reading this reference
+document.
+
=over 4
=item Input and Output Layers
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item Regular Expressions
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
=item BOM-marked scripts and UTF-16 scripts autodetected
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
- use encoding 'utf8';
-
See L</"Byte and Character Semantics"> for more details.
=back
character data are concatenated, the new string will be created by
decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
old Unicode string used EBCDIC. This translation is done without
-regard to the system's native 8-bit encoding. To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma. See L<encoding>.
+regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>. This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
Additionally, if you
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
=item *
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
=item *
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-See L</"Unicode Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
equal to C<\P{Tamil}>.
B<NOTE: the properties, scripts, and blocks listed here are as of
-Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
-came out in April 2003, and Perl 5.8.1 in September 2003.>
+Unicode 5.0.0 in July 2006.>
=over 4
Arabic
Armenian
+ Balinese
Bengali
Bopomofo
+ Braille
+ Buginese
Buhid
CanadianAboriginal
Cherokee
+ Coptic
+ Cuneiform
+ Cypriot
Cyrillic
Deseret
Devanagari
Ethiopic
Georgian
+ Glagolitic
Gothic
Greek
Gujarati
Inherited
Kannada
Katakana
+ Kharoshthi
Khmer
Lao
Latin
+ Limbu
+ LinearB
Malayalam
Mongolian
Myanmar
+ NewTaiLue
+ Nko
Ogham
OldItalic
+ OldPersian
Oriya
+ Osmanya
+ PhagsPa
+ Phoenician
Runic
+ Shavian
Sinhala
+ SylotiNagri
Syriac
Tagalog
Tagbanwa
+ TaiLe
Tamil
Telugu
Thaana
Thai
Tibetan
+ Tifinagh
+ Ugaritic
Yi
=item Extended property classes
Deprecated
Diacritic
Extender
- GraphemeLink
HexDigit
Hyphen
Ideographic
OtherAlphabetic
OtherDefaultIgnorableCodePoint
OtherGraphemeExtend
+ OtherIDStart
+ OtherIDContinue
OtherLowercase
OtherMath
OtherUppercase
+ PatternSyntax
+ PatternWhiteSpace
QuotationMark
Radical
SoftDotted
+ STerm
TerminalPunctuation
UnifiedIdeograph
+ VariationSelector
WhiteSpace
and there are further derived properties:
- Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
- Lowercase Ll + OtherLowercase
- Uppercase Lu + OtherUppercase
- Math Sm + OtherMath
+ Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic
+ Lowercase = Ll + OtherLowercase
+ Uppercase = Lu + OtherUppercase
+ Math = Sm + OtherMath
+
+ IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart
+ IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue
- ID_Start Lu + Ll + Lt + Lm + Lo + Nl
- ID_Continue ID_Start + Mn + Mc + Nd + Pc
+ DefaultIgnorableCodePoint
+ = OtherDefaultIgnorableCodePoint
+ + Cf + Cc + Cs + Noncharacters + VariationSelector
+ - WhiteSpace - FFF9..FFFB (Annotation Characters)
- Any Any character
- Assigned Any non-Cn character (i.e. synonym for \P{Cn})
- Unassigned Synonym for \p{Cn}
- Common Any character (or unassigned code point)
- not explicitly assigned to a script
+ Any = Any code points (i.e. U+0000 to U+10FFFF)
+ Assigned = Any non-Cn code points (i.e. synonym for \P{Cn})
+ Unassigned = Synonym for \p{Cn}
+ ASCII = ASCII (i.e. U+0000 to U+007F)
+
+ Common = Any character (or unassigned code point)
+ not explicitly assigned to a script
=item Use of "Is" Prefix
shared across many scripts. Digits and similar groups, like
punctuation, are in a category called C<Common>.
-For more about scripts, see the UTR #24:
+For more about scripts, see the UAX#24 "Script Names":
- http://www.unicode.org/unicode/reports/tr24/
+ http://www.unicode.org/reports/tr24/
For more about blocks, see:
These block names are supported:
+ InAegeanNumbers
InAlphabeticPresentationForms
+ InAncientGreekMusicalNotation
+ InAncientGreekNumbers
InArabic
InArabicPresentationFormsA
InArabicPresentationFormsB
+ InArabicSupplement
InArmenian
InArrows
+ InBalinese
InBasicLatin
InBengali
InBlockElements
InBopomofoExtended
InBoxDrawing
InBraillePatterns
+ InBuginese
InBuhid
InByzantineMusicalSymbols
InCJKCompatibility
InCJKCompatibilityIdeographs
InCJKCompatibilityIdeographsSupplement
InCJKRadicalsSupplement
+ InCJKStrokes
InCJKSymbolsAndPunctuation
InCJKUnifiedIdeographs
InCJKUnifiedIdeographsExtensionA
InCJKUnifiedIdeographsExtensionB
InCherokee
InCombiningDiacriticalMarks
+ InCombiningDiacriticalMarksSupplement
InCombiningDiacriticalMarksforSymbols
InCombiningHalfMarks
InControlPictures
+ InCoptic
+ InCountingRodNumerals
+ InCuneiform
+ InCuneiformNumbersAndPunctuation
InCurrencySymbols
+ InCypriotSyllabary
InCyrillic
- InCyrillicSupplementary
+ InCyrillicSupplement
InDeseret
InDevanagari
InDingbats
InEnclosedAlphanumerics
InEnclosedCJKLettersAndMonths
InEthiopic
+ InEthiopicExtended
+ InEthiopicSupplement
InGeneralPunctuation
InGeometricShapes
InGeorgian
+ InGeorgianSupplement
+ InGlagolitic
InGothic
InGreekExtended
InGreekAndCoptic
InKannada
InKatakana
InKatakanaPhoneticExtensions
+ InKharoshthi
InKhmer
+ InKhmerSymbols
InLao
InLatin1Supplement
InLatinExtendedA
InLatinExtendedAdditional
InLatinExtendedB
+ InLatinExtendedC
+ InLatinExtendedD
InLetterlikeSymbols
+ InLimbu
+ InLinearBIdeograms
+ InLinearBSyllabary
InLowSurrogates
InMalayalam
InMathematicalAlphanumericSymbols
InMiscellaneousMathematicalSymbolsA
InMiscellaneousMathematicalSymbolsB
InMiscellaneousSymbols
+ InMiscellaneousSymbolsAndArrows
InMiscellaneousTechnical
+ InModifierToneLetters
InMongolian
InMusicalSymbols
InMyanmar
+ InNKo
+ InNewTaiLue
InNumberForms
InOgham
InOldItalic
+ InOldPersian
InOpticalCharacterRecognition
InOriya
+ InOsmanya
+ InPhagspa
+ InPhoenician
+ InPhoneticExtensions
+ InPhoneticExtensionsSupplement
InPrivateUseArea
InRunic
+ InShavian
InSinhala
InSmallFormVariants
InSpacingModifierLetters
InSupplementalArrowsA
InSupplementalArrowsB
InSupplementalMathematicalOperators
+ InSupplementalPunctuation
InSupplementaryPrivateUseAreaA
InSupplementaryPrivateUseAreaB
+ InSylotiNagri
InSyriac
InTagalog
InTagbanwa
InTags
+ InTaiLe
+ InTaiXuanJingSymbols
InTamil
InTelugu
InThaana
InThai
InTibetan
+ InTifinagh
+ InUgaritic
InUnifiedCanadianAboriginalSyllabics
InVariationSelectors
+ InVariationSelectorsSupplement
+ InVerticalForms
InYiRadicals
InYiSyllables
+ InYijingHexagramSymbols
=back
The following list of Unicode support for regular expressions describes
all the features currently supported. The references to "Level N"
-and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
-Perl 5.8.0).
+and the section numbers refer to the Unicode Technical Standard #18,
+"Unicode Regular Expressions", version 11, in May 2005.
=over 4
Level 1 - Basic Unicode Support
- 2.1 Hex Notation - done [1]
- Named Notation - done [2]
- 2.2 Categories - done [3][4]
- 2.3 Subtraction - MISSING [5][6]
- 2.4 Simple Word Boundaries - done [7]
- 2.5 Simple Loose Matches - done [8]
- 2.6 End of Line - MISSING [9][10]
-
- [ 1] \x{...}
- [ 2] \N{...}
- [ 3] . \p{...} \P{...}
- [ 4] support for scripts (see UTR#24 Script Names), blocks,
- binary properties, enumerated non-binary properties, and
- numeric properties (as listed in UTR#18 Other Properties)
- [ 5] have negation
- [ 6] can use regular expression look-ahead [a]
- or user-defined character properties [b] to emulate subtraction
- [ 7] include Letters in word characters
- [ 8] note that Perl does Full case-folding in matching, not Simple:
+ RL1.1 Hex Notation - done [1]
+ RL1.2 Properties - done [2][3]
+ RL1.2a Compatibility Properties - done [4]
+ RL1.3 Subtraction and Intersection - MISSING [5]
+ RL1.4 Simple Word Boundaries - done [6]
+ RL1.5 Simple Loose Matches - done [7]
+ RL1.6 Line Boundaries - MISSING [8]
+ RL1.7 Supplementary Code Points - done [9]
+
+ [1] \x{...}
+ [2] \p{...} \P{...}
+ [3] supports not only minimal list (general category, scripts,
+ Alphabetic, Lowercase, Uppercase, WhiteSpace,
+ NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
+ ASCII, Assigned), but also bidirectional types, blocks, etc.
+ (see L</"Unicode Character Properties">)
+ [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
+ [5] can use regular expression look-ahead [a] or
+ user-defined character properties [b] to emulate set operations
+ [6] \b \B
+ [7] note that Perl does Full case-folding in matching, not Simple:
for example U+1F88 is equivalent with U+1F00 U+03B9,
not with 1F80. This difference matters for certain Greek
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
it to a single character.
- [ 9] see UTR #13 Unicode Newline Guidelines
- [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
- (should also affect <>, $., and script line numbers)
- (the \x{85}, \x{2028} and \x{2029} do match \s)
+ [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
+ CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
+ should also affect <>, $., and script line numbers;
+ should not split lines within CRLF [c] (i.e. there is no empty
+ line between \r and \n)
+ [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
+ but also beyond U+10FFFF [d]
[a] You can mimic class subtraction using lookahead.
-For example, what UTR #18 might write as
+For example, what UTS#18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
Also see the Unicode::Regex::Set module, it does implement the full
-UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
+UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
+
+[b] '+' for union, '-' for removal (set-difference), '&' for intersection
+(see L</"User-Defined Character Properties">)
+
+[c] Try the C<:crlf> layer (see L<PerlIO>).
-[b] See L</"User-Defined Character Properties">.
+[d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow
+U+FFFF (C<\x{FFFF}>).
=item *
Level 2 - Extended Unicode Support
- 3.1 Surrogates - MISSING [11]
- 3.2 Canonical Equivalents - MISSING [12][13]
- 3.3 Locale-Independent Graphemes - MISSING [14]
- 3.4 Locale-Independent Words - MISSING [15]
- 3.5 Locale-Independent Loose Matches - MISSING [16]
-
- [11] Surrogates are solely a UTF-16 concept and Perl's internal
- representation is UTF-8. The Encode module does UTF-16, though.
- [12] see UTR#15 Unicode Normalization
- [13] have Unicode::Normalize but not integrated to regexes
- [14] have \X but at this level . should equal that
- [15] need three classes, not just \w and \W
- [16] see UTR#21 Case Mappings
+ RL2.1 Canonical Equivalents - MISSING [10][11]
+ RL2.2 Default Grapheme Clusters - MISSING [12][13]
+ RL2.3 Default Word Boundaries - MISSING [14]
+ RL2.4 Default Loose Matches - MISSING [15]
+ RL2.5 Name Properties - MISSING [16]
+ RL2.6 Wildcard Properties - MISSING
+
+ [10] see UAX#15 "Unicode Normalization Forms"
+ [11] have Unicode::Normalize but not integrated to regexes
+ [12] have \X but at this level . should equal that
+ [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
+ clusters as a single grapheme cluster.
+ [14] see UAX#29, Word Boundaries
+ [15] see UAX#21 "Case Mappings"
+ [16] have \N{...} but neither compute names of CJK Ideographs
+ and Hangul Syllables nor use a loose match [e]
+
+[e] C<\N{...}> allows namespaces (see L<charnames>).
=item *
-Level 3 - Locale-Sensitive Support
-
- 4.1 Locale-Dependent Categories - MISSING
- 4.2 Locale-Dependent Graphemes - MISSING [16][17]
- 4.3 Locale-Dependent Words - MISSING
- 4.4 Locale-Dependent Loose Matches - MISSING
- 4.5 Locale-Dependent Ranges - MISSING
-
- [16] see UTR#10 Unicode Collation Algorithms
- [17] have Unicode::Collate but not integrated to regexes
+Level 3 - Tailored Support
+
+ RL3.1 Tailored Punctuation - MISSING
+ RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
+ RL3.3 Tailored Word Boundaries - MISSING
+ RL3.4 Tailored Loose Matches - MISSING
+ RL3.5 Tailored Ranges - MISSING
+ RL3.6 Context Matching - MISSING [19]
+ RL3.7 Incremental Matches - MISSING
+ ( RL3.8 Unicode Set Sharing )
+ RL3.9 Possible Match Sets - MISSING
+ RL3.10 Folded Matching - MISSING [20]
+ RL3.11 Submatchers - MISSING
+
+ [17] see UAX#10 "Unicode Collation Algorithms"
+ [18] have Unicode::Collate but not integrated to regexes
+ [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
+ outside of the target substring
+ [20] need insensitive matching for linguistic features other than case;
+ for example, hiragana to katakana, wide and narrow, simplified Han
+ to traditional Han (see UTR#30 "Character Foldings")
=back
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
extension doesn't know about the flag, it's likely that the extension
will return incorrectly-flagged data.
Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.
A scalar we got back from an extension
If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
if ($] > 5.007) {
require Encode;
Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
utf8::downgrade($val) if $] > 5.007;
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
=cut