3 perlunicode - Unicode support in Perl
7 =head2 Important Caveats
9 WARNING: While the implementation of Unicode support in Perl is now
10 fairly complete it is still evolving to some extent.
12 In particular the way Unicode is handled on EBCDIC platforms is still
13 rather experimental. On such a platform references to UTF-8 encoding
14 in this document and elsewhere should be read as meaning UTF-EBCDIC as
15 specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues
16 are specifically discussed. There is no C<utfebcdic> pragma or
17 ":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean
18 platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> for
19 more discussion of the issues.
21 The following areas are still under development.
25 =item Input and Output Disciplines
27 A filehandle can be marked as containing perl's internal Unicode
28 encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer.
29 Other encodings can be converted to perl's encoding on input, or from
30 perl's encoding on output by use of the ":encoding()" layer. There is
31 not yet a clean way to mark the Perl source itself as being in an
34 =item Regular Expressions
36 The regular expression compiler does now attempt to produce
37 polymorphic opcodes. That is the pattern should now adapt to the data
38 and automatically switch to the Unicode character scheme when
39 presented with Unicode data, or a traditional byte scheme when
40 presented with byte data. The implementation is still new and
41 (particularly on EBCDIC platforms) may need further work.
43 =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
45 The C<utf8> pragma implements the tables used for Unicode support.
46 These tables are automatically loaded on demand, so the C<utf8> pragma
47 need not normally be used.
49 However, as a compatibility measure, this pragma must be explicitly
50 used to enable recognition of UTF-8 in the Perl scripts themselves on
51 ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
52 B<NOTE: this should be the only place where an explicit C<use utf8> is
55 You can also use the C<encoding> pragma to change the default encoding
56 of the data in your script; see L<encoding>.
60 =head2 Byte and Character semantics
62 Beginning with version 5.6, Perl uses logically wide characters to
63 represent strings internally. This internal representation of strings
64 uses either the UTF-8 or the UTF-EBCDIC encoding.
66 In future, Perl-level operations can be expected to work with
67 characters rather than bytes, in general.
69 However, as strictly an interim compatibility measure, Perl aims to
70 provide a safe migration path from byte semantics to character
71 semantics for programs. For operations where Perl can unambiguously
72 decide that the input data is characters, Perl now switches to
73 character semantics. For operations where this determination cannot
74 be made without additional information from the user, Perl decides in
75 favor of compatibility, and chooses to use byte semantics.
77 This behavior preserves compatibility with earlier versions of Perl,
78 which allowed byte semantics in Perl operations, but only as long as
79 none of the program's inputs are marked as being as source of Unicode
80 character data. Such data may come from filehandles, from calls to
81 external programs, from information provided by the system (such as %ENV),
82 or from literals and constants in the source text.
84 If the C<-C> command line switch is used, (or the
85 ${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls
86 will use the corresponding wide character APIs. Note that this is
87 currently only implemented on Windows since other platforms API
88 standard on this area.
90 Regardless of the above, the C<bytes> pragma can always be used to
91 force byte semantics in a particular lexical scope. See L<bytes>.
93 The C<utf8> pragma is primarily a compatibility device that enables
94 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
95 Note that this pragma is only required until a future version of Perl
96 in which character semantics will become the default. This pragma may
97 then become a no-op. See L<utf8>.
99 Unless mentioned otherwise, Perl operators will use character semantics
100 when they are dealing with Unicode data, and byte semantics otherwise.
101 Thus, character semantics for these operations apply transparently; if
102 the input data came from a Unicode source (for example, by adding a
103 character encoding discipline to the filehandle whence it came, or a
104 literal UTF-8 string constant in the program), character semantics
105 apply; otherwise, byte semantics are in effect. To force byte semantics
106 on Unicode data, the C<bytes> pragma should be used.
108 Notice that if you concatenate strings with byte semantics and strings
109 with Unicode character data, the bytes will by default be upgraded
110 I<as if they were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a
111 translation to ISO 8859-1). To change this, use the C<encoding>
112 pragma, see L<encoding>.
114 Under character semantics, many operations that formerly operated on
115 bytes change to operating on characters. For ASCII data this makes no
116 difference, because UTF-8 stores ASCII in single bytes, but for any
117 character greater than C<chr(127)>, the character B<may> be stored in
118 a sequence of two or more bytes, all of which have the high bit set.
120 For C1 controls or Latin 1 characters on an EBCDIC platform the
121 character may be stored in a UTF-EBCDIC multi byte sequence. But by
122 and large, the user need not worry about this, because Perl hides it
123 from the user. A character in Perl is logically just a number ranging
124 from 0 to 2**32 or so. Larger characters encode to longer sequences
125 of bytes internally, but again, this is just an internal detail which
126 is hidden at the Perl level.
128 =head2 Effects of character semantics
130 Character semantics have the following effects:
136 Strings and patterns may contain characters that have an ordinal value
139 Presuming you use a Unicode editor to edit your program, such
140 characters will typically occur directly within the literal strings as
141 UTF-8 (or UTF-EBCDIC on EBCDIC platforms) characters, but you can also
142 specify a particular character with an extension of the C<\x>
143 notation. UTF-X characters are specified by putting the hexadecimal
144 code within curlies after the C<\x>. For instance, a Unicode smiley
149 Identifiers within the Perl script may contain Unicode alphanumeric
150 characters, including ideographs. (You are currently on your own when
151 it comes to using the canonical forms of characters--Perl doesn't
152 (yet) attempt to canonicalize variable names for you.)
156 Regular expressions match characters instead of bytes. For instance,
157 "." matches a character instead of a byte. (However, the C<\C> pattern
158 is provided to force a match a single byte ("C<char>" in C, hence C<\C>).)
162 Character classes in regular expressions match characters instead of
163 bytes, and match against the character properties specified in the
164 Unicode properties database. So C<\w> can be used to match an
165 ideograph, for instance.
169 Named Unicode properties and block ranges make be used as character
170 classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
171 match property) constructs. For instance, C<\p{Lu}> matches any
172 character with the Unicode uppercase property, while C<\p{M}> matches
173 any mark character. Single letter properties may omit the brackets,
174 so that can be written C<\pM> also. Many predefined character classes
175 are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
177 The C<\p{Is...}> test for "general properties" such as "letter",
178 "digit", while the C<\p{In...}> test for Unicode scripts and blocks.
180 The official Unicode script and block names have spaces and dashes and
181 separators, but for convenience you can have dashes, spaces, and
182 underbars at every word division, and you need not care about correct
183 casing. It is recommended, however, that for consistency you use the
184 following naming: the official Unicode script, block, or property name
185 (see below for the additional rules that apply to block names),
186 with whitespace and dashes replaced with underbar, and the words
187 "uppercase-first-lowercase-rest". That is, "Latin-1 Supplement"
188 becomes "Latin_1_Supplement".
190 You can also negate both C<\p{}> and C<\P{}> by introducing a caret
191 (^) between the first curly and the property name: C<\p{^In_Tamil}> is
192 equal to C<\P{In_Tamil}>.
194 The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
195 C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
217 Pc Connector_Punctuation
221 Pi Initial_Punctuation
222 (may behave like Ps or Pe depending on usage)
224 (may behave like Ps or Pe depending on usage)
236 Zp Paragraph_Separator
245 There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
247 The following reserved ranges have C<In> tests:
249 CJK_Ideograph_Extension_A
252 Non_Private_Use_High_Surrogate
253 Private_Use_High_Surrogate
256 CJK_Ideograph_Extension_B
260 For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
261 (Handling of surrogates is not implemented yet, because Perl
262 uses UTF-8 and not UTF-16 internally to represent Unicode.)
264 Additionally, because scripts differ in their directionality
265 (for example Hebrew is written right to left), all characters
266 have their directionality defined:
269 BidiLRE Left-to-Right Embedding
270 BidiLRO Left-to-Right Override
272 BidiAL Right-to-Left Arabic
273 BidiRLE Right-to-Left Embedding
274 BidiRLO Right-to-Left Override
275 BidiPDF Pop Directional Format
276 BidiEN European Number
277 BidiES European Number Separator
278 BidiET European Number Terminator
280 BidiCS Common Number Separator
281 BidiNSM Non-Spacing Mark
282 BidiBN Boundary Neutral
283 BidiB Paragraph Separator
284 BidiS Segment Separator
286 BidiON Other Neutrals
290 The scripts available for C<\p{In...}> and C<\P{In...}>, for example
291 \p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
334 There are also extended property classes that supplement the basic
335 properties, defined by the F<PropList> Unicode database:
346 Noncharacter_Code_Point
354 and further derived properties:
356 Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
357 Lowercase Ll + Other_Lowercase
358 Uppercase Lu + Other_Uppercase
361 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
362 ID_Continue ID_Start + Mn + Mc + Nd + Pc
365 Assigned Any non-Cn character
366 Common Any character (or unassigned code point)
367 not explicitly assigned to a script
371 In addition to B<scripts>, Unicode also defines B<blocks> of
372 characters. The difference between scripts and blocks is that the
373 scripts concept is closer to natural languages, while the blocks
374 concept is more an artificial grouping based on groups of 256 Unicode
375 characters. For example, the C<Latin> script contains letters from
376 many blocks. On the other hand, the C<Latin> script does not contain
377 all the characters from those blocks, it does not for example contain
378 digits because digits are shared across many scripts. Digits and
379 other similar groups, like punctuation, are in a category called
382 For more about scripts see the UTR #24:
383 http://www.unicode.org/unicode/reports/tr24/
384 For more about blocks see
385 http://www.unicode.org/Public/UNIDATA/Blocks.txt
387 Because there are overlaps in naming (there are, for example, both
388 a script called C<Katakana> and a block called C<Katakana>, the block
389 version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
391 Notice that this definition was introduced in Perl 5.8.0: in Perl
392 5.6 only the blocks were used; in Perl 5.8.0 scripts became the
393 preferential Unicode character class definition; this meant that
394 the definitions of some character classes changed (the ones in the
395 below list that have the C<Block> appended).
397 Alphabetic Presentation Forms
399 Arabic Presentation Forms-A
400 Arabic Presentation Forms-B
410 Byzantine Musical Symbols
412 CJK Compatibility Forms
413 CJK Compatibility Ideographs
414 CJK Compatibility Ideographs Supplement
415 CJK Radicals Supplement
416 CJK Symbols and Punctuation
417 CJK Unified Ideographs
418 CJK Unified Ideographs Extension A
419 CJK Unified Ideographs Extension B
421 Combining Diacritical Marks
423 Combining Marks for Symbols
430 Enclosed Alphanumerics
431 Enclosed CJK Letters and Months
441 Halfwidth and Fullwidth Forms
442 Hangul Compatibility Jamo
446 High Private Use Surrogates
450 Ideographic Description Characters
458 Latin Extended Additional
464 Mathematical Alphanumeric Symbols
465 Mathematical Operators
466 Miscellaneous Symbols
467 Miscellaneous Technical
474 Optical Character Recognition
480 Spacing Modifier Letters
482 Superscripts and Subscripts
490 Unified Canadian Aboriginal Syllabics
496 The special pattern C<\X> match matches any extended Unicode sequence
497 (a "combining character sequence" in Standardese), where the first
498 character is a base character and subsequent characters are mark
499 characters that apply to the base character. It is equivalent to
504 The C<tr///> operator translates characters instead of bytes. Note
505 that the C<tr///CU> functionality has been removed, as the interface
506 was a mistake. For similar functionality see pack('U0', ...) and
511 Case translation operators use the Unicode case translation tables
512 when provided character input. Note that C<uc()> (also known as C<\U>
513 in doublequoted strings) translates to uppercase, while C<ucfirst>
514 (also known as C<\u> in doublequoted strings) translates to titlecase
515 (for languages that make the distinction). Naturally the
516 corresponding backslash sequences have the same semantics.
520 Most operators that deal with positions or lengths in the string will
521 automatically switch to using character positions, including
522 C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
523 C<sprintf()>, C<write()>, and C<length()>. Operators that
524 specifically don't switch include C<vec()>, C<pack()>, and
525 C<unpack()>. Operators that really don't care include C<chomp()>, as
526 well as any other operator that treats a string as a bucket of bits,
527 such as C<sort()>, and the operators dealing with filenames.
531 The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
532 since they're often used for byte-oriented formats. (Again, think
533 "C<char>" in the C language.) However, there is a new "C<U>" specifier
534 that will convert between UTF-8 characters and integers. (It works
535 outside of the utf8 pragma too.)
539 The C<chr()> and C<ord()> functions work on characters. This is like
540 C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
541 C<unpack("C")>. In fact, the latter are how you now emulate
542 byte-oriented C<chr()> and C<ord()> for Unicode strings.
543 (Note that this reveals the internal UTF-8 encoding of strings and
544 you are not supposed to do that unless you know what you are doing.)
548 The bit string operators C<& | ^ ~> can operate on character data.
549 However, for backward compatibility reasons (bit string operations
550 when the characters all are less than 256 in ordinal value) one should
551 not mix C<~> (the bit complement) and characters both less than 256 and
552 equal or greater than 256. Most importantly, the DeMorgan's laws
553 (C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
554 Another way to look at this is that the complement cannot return
555 B<both> the 8-bit (byte) wide bit complement B<and> the full character
560 lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
566 the case mapping is from a single Unicode character to another
567 single Unicode character
571 the case mapping is from a single Unicode character to more
572 than one Unicode character
576 What doesn't yet work are the followng cases:
582 the "final sigma" (Greek)
586 anything to with locales (Lithuanian, Turkish, Azeri)
590 See the Unicode Technical Report #21, Case Mappings, for more details.
594 And finally, C<scalar reverse()> reverses by character rather than by byte.
598 =head2 Character encodings for input and output
604 As of yet, there is no method for automatically coercing input and
605 output to some encoding other than UTF-8 or UTF-EBCDIC. This is planned
606 in the near future, however.
608 Whether an arbitrary piece of data will be treated as "characters" or
609 "bytes" by internal operations cannot be divined at the current time.
611 Use of locales with utf8 may lead to odd results. Currently there is
612 some attempt to apply 8-bit locale info to characters in the range
613 0..255, but this is demonstrably incorrect for locales that use
614 characters above that range (when mapped into Unicode). It will also
615 tend to run slower. Avoidance of locales is strongly encouraged.
617 =head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL
619 The following list of Unicode regular expression support describes
620 feature by feature the Unicode support implemented in Perl as of Perl
621 5.8.0. The "Level N" and the section numbers refer to the Unicode
622 Technical Report 18, "Unicode Regular Expression Guidelines".
628 Level 1 - Basic Unicode Support
630 2.1 Hex Notation - done [1]
631 Named Notation - done [2]
632 2.2 Categories - done [3][4]
633 2.3 Subtraction - MISSING [5][6]
634 2.4 Simple Word Boundaries - done [7]
635 2.5 Simple Loose Matches - MISSING [8]
636 2.6 End of Line - MISSING [9][10]
640 [ 3] . \p{Is...} \P{Is...}
641 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
643 [ 6] can use look-ahead to emulate subtracion
644 [ 7] include Letters in word characters
645 [ 8] see UTR#21 Case Mappings
646 [ 9] see UTR#13 Unicode Newline Guidelines
647 [10] should do ^ and $ also on \x{2028} and \x{2029}
651 Level 2 - Extended Unicode Support
653 3.1 Surrogates - MISSING
654 3.2 Canonical Equivalents - MISSING [11][12]
655 3.3 Locale-Independent Graphemes - MISSING [13]
656 3.4 Locale-Independent Words - MISSING [14]
657 3.5 Locale-Independent Loose Matches - MISSING [15]
659 [11] see UTR#15 Unicode Normalization
660 [12] have Unicode::Normalize but not integrated to regexes
661 [13] have \X but at this level . should equal that
662 [14] need three classes, not just \w and \W
663 [15] see UTR#21 Case Mappings
667 Level 3 - Locale-Sensitive Support
669 4.1 Locale-Dependent Categories - MISSING
670 4.2 Locale-Dependent Graphemes - MISSING [16][17]
671 4.3 Locale-Dependent Words - MISSING
672 4.4 Locale-Dependent Loose Matches - MISSING
673 4.5 Locale-Dependent Ranges - MISSING
675 [16] see UTR#10 Unicode Collation Algorithms
676 [17] have Unicode::Collate but not integrated to regexes
682 L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">