3 perlunicode - Unicode support in Perl
7 =head2 Important Caveats
9 WARNING: While the implementation of Unicode support in Perl is now
10 fairly complete it is still evolving to some extent.
12 In particular the way Unicode is handled on EBCDIC platforms is still
13 rather experimental. On such a platform references to UTF-8 encoding
14 in this document and elsewhere should be read as meaning UTF-EBCDIC as
15 specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues
16 are specifically discussed. There is no C<utfebcdic> pragma or
17 ":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean
18 platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> for
19 more discussion of the issues.
21 The following areas are still under development.
25 =item Input and Output Disciplines
27 A filehandle can be marked as containing perl's internal Unicode
28 encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer.
29 Other encodings can be converted to perl's encoding on input, or from
30 perl's encoding on output by use of the ":encoding()" layer. There is
31 not yet a clean way to mark the Perl source itself as being in an
34 =item Regular Expressions
36 The regular expression compiler does now attempt to produce
37 polymorphic opcodes. That is the pattern should now adapt to the data
38 and automatically switch to the Unicode character scheme when
39 presented with Unicode data, or a traditional byte scheme when
40 presented with byte data. The implementation is still new and
41 (particularly on EBCDIC platforms) may need further work.
43 =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
45 The C<utf8> pragma implements the tables used for Unicode support.
46 These tables are automatically loaded on demand, so the C<utf8> pragma
47 need not normally be used.
49 However, as a compatibility measure, this pragma must be explicitly
50 used to enable recognition of UTF-8 in the Perl scripts themselves on
51 ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
52 B<NOTE: this should be the only place where an explicit C<use utf8> is
55 You can also use the C<encoding> pragma to change the default encoding
56 of the data in your script; see L<encoding>. Currently this cannot
57 be combined with C<use utf8>.
61 =head2 Byte and Character semantics
63 Beginning with version 5.6, Perl uses logically wide characters to
64 represent strings internally. This internal representation of strings
65 uses either the UTF-8 or the UTF-EBCDIC encoding.
67 In future, Perl-level operations can be expected to work with
68 characters rather than bytes, in general.
70 However, as strictly an interim compatibility measure, Perl aims to
71 provide a safe migration path from byte semantics to character
72 semantics for programs. For operations where Perl can unambiguously
73 decide that the input data is characters, Perl now switches to
74 character semantics. For operations where this determination cannot
75 be made without additional information from the user, Perl decides in
76 favor of compatibility, and chooses to use byte semantics.
78 This behavior preserves compatibility with earlier versions of Perl,
79 which allowed byte semantics in Perl operations, but only as long as
80 none of the program's inputs are marked as being as source of Unicode
81 character data. Such data may come from filehandles, from calls to
82 external programs, from information provided by the system (such as %ENV),
83 or from literals and constants in the source text.
85 If the C<-C> command line switch is used, (or the
86 ${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls
87 will use the corresponding wide character APIs. Note that this is
88 currently only implemented on Windows since other platforms API
89 standard on this area.
91 Regardless of the above, the C<bytes> pragma can always be used to
92 force byte semantics in a particular lexical scope. See L<bytes>.
94 The C<utf8> pragma is primarily a compatibility device that enables
95 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
96 Note that this pragma is only required until a future version of Perl
97 in which character semantics will become the default. This pragma may
98 then become a no-op. See L<utf8>.
100 Unless mentioned otherwise, Perl operators will use character semantics
101 when they are dealing with Unicode data, and byte semantics otherwise.
102 Thus, character semantics for these operations apply transparently; if
103 the input data came from a Unicode source (for example, by adding a
104 character encoding discipline to the filehandle whence it came, or a
105 literal UTF-8 string constant in the program), character semantics
106 apply; otherwise, byte semantics are in effect. To force byte semantics
107 on Unicode data, the C<bytes> pragma should be used.
109 Notice that if you concatenate strings with byte semantics and strings
110 with Unicode character data, the bytes will by default be upgraded
111 I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
112 translation to ISO 8859-1). To change this, use the C<encoding>
113 pragma, see L<encoding>.
115 Under character semantics, many operations that formerly operated on
116 bytes change to operating on characters. For ASCII data this makes no
117 difference, because UTF-8 stores ASCII in single bytes, but for any
118 character greater than C<chr(127)>, the character B<may> be stored in
119 a sequence of two or more bytes, all of which have the high bit set.
121 For C1 controls or Latin 1 characters on an EBCDIC platform the
122 character may be stored in a UTF-EBCDIC multi byte sequence. But by
123 and large, the user need not worry about this, because Perl hides it
124 from the user. A character in Perl is logically just a number ranging
125 from 0 to 2**32 or so. Larger characters encode to longer sequences
126 of bytes internally, but again, this is just an internal detail which
127 is hidden at the Perl level.
129 =head2 Effects of character semantics
131 Character semantics have the following effects:
137 Strings and patterns may contain characters that have an ordinal value
140 Presuming you use a Unicode editor to edit your program, such
141 characters will typically occur directly within the literal strings as
142 UTF-8 (or UTF-EBCDIC on EBCDIC platforms) characters, but you can also
143 specify a particular character with an extension of the C<\x>
144 notation. UTF-X characters are specified by putting the hexadecimal
145 code within curlies after the C<\x>. For instance, a Unicode smiley
150 Identifiers within the Perl script may contain Unicode alphanumeric
151 characters, including ideographs. (You are currently on your own when
152 it comes to using the canonical forms of characters--Perl doesn't
153 (yet) attempt to canonicalize variable names for you.)
157 Regular expressions match characters instead of bytes. For instance,
158 "." matches a character instead of a byte. (However, the C<\C> pattern
159 is provided to force a match a single byte ("C<char>" in C, hence C<\C>).)
163 Character classes in regular expressions match characters instead of
164 bytes, and match against the character properties specified in the
165 Unicode properties database. So C<\w> can be used to match an
166 ideograph, for instance.
170 Named Unicode properties and block ranges make be used as character
171 classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
172 match property) constructs. For instance, C<\p{Lu}> matches any
173 character with the Unicode uppercase property, while C<\p{M}> matches
174 any mark character. Single letter properties may omit the brackets,
175 so that can be written C<\pM> also. Many predefined character classes
176 are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
178 The C<\p{Is...}> test for "general properties" such as "letter",
179 "digit", while the C<\p{In...}> test for Unicode scripts and blocks.
181 The official Unicode script and block names have spaces and dashes and
182 separators, but for convenience you can have dashes, spaces, and
183 underbars at every word division, and you need not care about correct
184 casing. It is recommended, however, that for consistency you use the
185 following naming: the official Unicode script, block, or property name
186 (see below for the additional rules that apply to block names),
187 with whitespace and dashes replaced with underbar, and the words
188 "uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
189 becomes "Latin_1_Supplement".
191 You can also negate both C<\p{}> and C<\P{}> by introducing a caret
192 (^) between the first curly and the property name: C<\p{^In_Tamil}> is
193 equal to C<\P{In_Tamil}>.
195 The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
196 C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
218 Pc Connector_Punctuation
222 Pi Initial_Punctuation
223 (may behave like Ps or Pe depending on usage)
225 (may behave like Ps or Pe depending on usage)
237 Zp Paragraph_Separator
246 There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
248 The following reserved ranges have C<In> tests:
250 CJK_Ideograph_Extension_A
253 Non_Private_Use_High_Surrogate
254 Private_Use_High_Surrogate
257 CJK_Ideograph_Extension_B
261 For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
262 (Handling of surrogates is not implemented yet, because Perl
263 uses UTF-8 and not UTF-16 internally to represent Unicode.)
265 Additionally, because scripts differ in their directionality
266 (for example Hebrew is written right to left), all characters
267 have their directionality defined:
270 BidiLRE Left-to-Right Embedding
271 BidiLRO Left-to-Right Override
273 BidiAL Right-to-Left Arabic
274 BidiRLE Right-to-Left Embedding
275 BidiRLO Right-to-Left Override
276 BidiPDF Pop Directional Format
277 BidiEN European Number
278 BidiES European Number Separator
279 BidiET European Number Terminator
281 BidiCS Common Number Separator
282 BidiNSM Non-Spacing Mark
283 BidiBN Boundary Neutral
284 BidiB Paragraph Separator
285 BidiS Segment Separator
287 BidiON Other Neutrals
291 The scripts available for C<\p{In...}> and C<\P{In...}>, for example
292 \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
335 There are also extended property classes that supplement the basic
336 properties, defined by the F<PropList> Unicode database:
347 Noncharacter_Code_Point
355 and further derived properties:
357 Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
358 Lowercase Ll + Other_Lowercase
359 Uppercase Lu + Other_Uppercase
362 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
363 ID_Continue ID_Start + Mn + Mc + Nd + Pc
366 Assigned Any non-Cn character
367 Common Any character (or unassigned code point)
368 not explicitly assigned to a script
372 In addition to B<scripts>, Unicode also defines B<blocks> of
373 characters. The difference between scripts and blocks is that the
374 scripts concept is closer to natural languages, while the blocks
375 concept is more an artificial grouping based on groups of 256 Unicode
376 characters. For example, the C<Latin> script contains letters from
377 many blocks. On the other hand, the C<Latin> script does not contain
378 all the characters from those blocks, it does not for example contain
379 digits because digits are shared across many scripts. Digits and
380 other similar groups, like punctuation, are in a category called
383 For more about scripts see the UTR #24:
384 http://www.unicode.org/unicode/reports/tr24/
385 For more about blocks see
386 http://www.unicode.org/Public/UNIDATA/Blocks.txt
388 Because there are overlaps in naming (there are, for example, both
389 a script called C<Katakana> and a block called C<Katakana>, the block
390 version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
392 Notice that this definition was introduced in Perl 5.8.0: in Perl
393 5.6 only the blocks were used; in Perl 5.8.0 scripts became the
394 preferential Unicode character class definition; this meant that
395 the definitions of some character classes changed (the ones in the
396 below list that have the C<Block> appended).
398 Alphabetic Presentation Forms
400 Arabic Presentation Forms-A
401 Arabic Presentation Forms-B
411 Byzantine Musical Symbols
413 CJK Compatibility Forms
414 CJK Compatibility Ideographs
415 CJK Compatibility Ideographs Supplement
416 CJK Radicals Supplement
417 CJK Symbols and Punctuation
418 CJK Unified Ideographs
419 CJK Unified Ideographs Extension A
420 CJK Unified Ideographs Extension B
422 Combining Diacritical Marks
424 Combining Marks for Symbols
431 Enclosed Alphanumerics
432 Enclosed CJK Letters and Months
442 Halfwidth and Fullwidth Forms
443 Hangul Compatibility Jamo
447 High Private Use Surrogates
451 Ideographic Description Characters
459 Latin Extended Additional
465 Mathematical Alphanumeric Symbols
466 Mathematical Operators
467 Miscellaneous Symbols
468 Miscellaneous Technical
475 Optical Character Recognition
481 Spacing Modifier Letters
483 Superscripts and Subscripts
491 Unified Canadian Aboriginal Syllabics
497 The special pattern C<\X> match matches any extended Unicode sequence
498 (a "combining character sequence" in Standardese), where the first
499 character is a base character and subsequent characters are mark
500 characters that apply to the base character. It is equivalent to
505 The C<tr///> operator translates characters instead of bytes. Note
506 that the C<tr///CU> functionality has been removed, as the interface
507 was a mistake. For similar functionality see pack('U0', ...) and
512 Case translation operators use the Unicode case translation tables
513 when provided character input. Note that C<uc()> (also known as C<\U>
514 in doublequoted strings) translates to uppercase, while C<ucfirst>
515 (also known as C<\u> in doublequoted strings) translates to titlecase
516 (for languages that make the distinction). Naturally the
517 corresponding backslash sequences have the same semantics.
521 Most operators that deal with positions or lengths in the string will
522 automatically switch to using character positions, including
523 C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
524 C<sprintf()>, C<write()>, and C<length()>. Operators that
525 specifically don't switch include C<vec()>, C<pack()>, and
526 C<unpack()>. Operators that really don't care include C<chomp()>, as
527 well as any other operator that treats a string as a bucket of bits,
528 such as C<sort()>, and the operators dealing with filenames.
532 The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
533 since they're often used for byte-oriented formats. (Again, think
534 "C<char>" in the C language.) However, there is a new "C<U>" specifier
535 that will convert between UTF-8 characters and integers. (It works
536 outside of the utf8 pragma too.)
540 The C<chr()> and C<ord()> functions work on characters. This is like
541 C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
542 C<unpack("C")>. In fact, the latter are how you now emulate
543 byte-oriented C<chr()> and C<ord()> for Unicode strings.
544 (Note that this reveals the internal UTF-8 encoding of strings and
545 you are not supposed to do that unless you know what you are doing.)
549 The bit string operators C<& | ^ ~> can operate on character data.
550 However, for backward compatibility reasons (bit string operations
551 when the characters all are less than 256 in ordinal value) one should
552 not mix C<~> (the bit complement) and characters both less than 256 and
553 equal or greater than 256. Most importantly, the DeMorgan's laws
554 (C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
555 Another way to look at this is that the complement cannot return
556 B<both> the 8-bit (byte) wide bit complement B<and> the full character
561 lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
567 the case mapping is from a single Unicode character to another
568 single Unicode character
572 the case mapping is from a single Unicode character to more
573 than one Unicode character
577 What doesn't yet work are the followng cases:
583 the "final sigma" (Greek)
587 anything to with locales (Lithuanian, Turkish, Azeri)
591 See the Unicode Technical Report #21, Case Mappings, for more details.
595 And finally, C<scalar reverse()> reverses by character rather than by byte.
599 =head2 Character encodings for input and output
605 As of yet, there is no method for automatically coercing input and
606 output to some encoding other than UTF-8 or UTF-EBCDIC. This is planned
607 in the near future, however.
609 Whether an arbitrary piece of data will be treated as "characters" or
610 "bytes" by internal operations cannot be divined at the current time.
612 Use of locales with utf8 may lead to odd results. Currently there is
613 some attempt to apply 8-bit locale info to characters in the range
614 0..255, but this is demonstrably incorrect for locales that use
615 characters above that range (when mapped into Unicode). It will also
616 tend to run slower. Avoidance of locales is strongly encouraged.
618 =head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL
620 The following list of Unicode regular expression support describes
621 feature by feature the Unicode support implemented in Perl as of Perl
622 5.8.0. The "Level N" and the section numbers refer to the Unicode
623 Technical Report 18, "Unicode Regular Expression Guidelines".
629 Level 1 - Basic Unicode Support
631 2.1 Hex Notation - done [1]
632 Named Notation - done [2]
633 2.2 Categories - done [3][4]
634 2.3 Subtraction - MISSING [5][6]
635 2.4 Simple Word Boundaries - done [7]
636 2.5 Simple Loose Matches - MISSING [8]
637 2.6 End of Line - MISSING [9][10]
641 [ 3] . \p{Is...} \P{Is...}
642 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
644 [ 6] can use look-ahead to emulate subtracion
645 [ 7] include Letters in word characters
646 [ 8] see UTR#21 Case Mappings
647 [ 9] see UTR#13 Unicode Newline Guidelines
648 [10] should do ^ and $ also on \x{2028} and \x{2029}
652 Level 2 - Extended Unicode Support
654 3.1 Surrogates - MISSING
655 3.2 Canonical Equivalents - MISSING [11][12]
656 3.3 Locale-Independent Graphemes - MISSING [13]
657 3.4 Locale-Independent Words - MISSING [14]
658 3.5 Locale-Independent Loose Matches - MISSING [15]
660 [11] see UTR#15 Unicode Normalization
661 [12] have Unicode::Normalize but not integrated to regexes
662 [13] have \X but at this level . should equal that
663 [14] need three classes, not just \w and \W
664 [15] see UTR#21 Case Mappings
668 Level 3 - Locale-Sensitive Support
670 4.1 Locale-Dependent Categories - MISSING
671 4.2 Locale-Dependent Graphemes - MISSING [16][17]
672 4.3 Locale-Dependent Words - MISSING
673 4.4 Locale-Dependent Loose Matches - MISSING
674 4.5 Locale-Dependent Ranges - MISSING
676 [16] see UTR#10 Unicode Collation Algorithms
677 [17] have Unicode::Collate but not integrated to regexes
683 L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">