Mention there are places /x modifier is ineffective
[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod
CommitLineData
393fec97 1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9 10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
2575c402 13People who want to learn to use Unicode in Perl, should probably read
e4911a48 14L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading
15this reference document.
2575c402 16
13a2d996 17=over 4
21bad921 18
fae2c0fb 19=item Input and Output Layers
21bad921 20
376d9008 21Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 22(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
23the ":utf8" layer. Other encodings can be converted to Perl's
24encoding on input or from Perl's encoding on output by use of the
25":encoding(...)" layer. See L<open>.
c349b1b9 26
2575c402 27To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
21bad921 28
29=item Regular Expressions
30
c349b1b9 31The regular expression compiler produces polymorphic opcodes. That is,
376d9008 32the pattern adapts to the data and automatically switches to the Unicode
2575c402 33character scheme when presented with data that is internally encoded in
ac036724 34UTF-8, or instead uses a traditional byte scheme when presented with
2575c402 35byte data.
21bad921 36
ad0029c4 37=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 38
376d9008 39As a compatibility measure, the C<use utf8> pragma must be explicitly
40included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4 41(in string or regular expression literals, or in identifier names) on
42ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 43machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 44is needed.> See L<utf8>.
21bad921 45
7aa207d6 46=item BOM-marked scripts and UTF-16 scripts autodetected
47
48If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
49or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
50endianness, Perl will correctly read in the script as Unicode.
51(BOMless UTF-8 cannot be effectively recognized or differentiated from
52ISO 8859-1 or other eight-bit encodings.)
53
990e18f7 54=item C<use encoding> needed to upgrade non-Latin-1 byte strings
55
38a44b82 56By default, there is a fundamental asymmetry in Perl's Unicode model:
990e18f7 57implicit upgrading from byte strings to Unicode strings assumes that
58they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
59downgraded with UTF-8 encoding. This happens because the first 256
51f494cc 60codepoints in Unicode happens to agree with Latin-1.
990e18f7 61
990e18f7 62See L</"Byte and Character Semantics"> for more details.
63
21bad921 64=back
65
376d9008 66=head2 Byte and Character Semantics
393fec97 67
376d9008 68Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 69represent strings internally.
393fec97 70
376d9008 71In future, Perl-level operations will be expected to work with
72characters rather than bytes.
393fec97 73
376d9008 74However, as an interim compatibility measure, Perl aims to
75daf61c 75provide a safe migration path from byte semantics to character
76semantics for programs. For operations where Perl can unambiguously
376d9008 77decide that the input data are characters, Perl switches to
75daf61c 78character semantics. For operations where this determination cannot
79be made without additional information from the user, Perl decides in
376d9008 80favor of compatibility and chooses to use byte semantics.
8cbd9a7a 81
51f494cc 82Under byte semantics, when C<use locale> is in effect, Perl uses the
e1b711da 83semantics associated with the current locale. Absent a C<use locale>, and
84absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII
85(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
86whose ordinal numbers are in the range 128 - 255 are undefined except for their
87ordinal numbers. This means that none have case (upper and lower), nor are any
88a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
89to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
2bbc8d55 90
8cbd9a7a 91This behavior preserves compatibility with earlier versions of Perl,
376d9008 92which allowed byte semantics in Perl operations only if
e1b711da 93none of the program's inputs were marked as being a source of Unicode
8cbd9a7a 94character data. Such data may come from filehandles, from calls to
95external programs, from information provided by the system (such as %ENV),
21bad921 96or from literals and constants in the source text.
8cbd9a7a 97
376d9008 98The C<bytes> pragma will always, regardless of platform, force byte
99semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a 100
e1b711da 101The C<use feature 'unicode_strings'> pragma is intended to always, regardless
102of platform, force Unicode semantics in a particular lexical scope. In
103release 5.12, it is partially implemented, applying only to case changes.
104See L</The "Unicode Bug"> below.
105
8cbd9a7a 106The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 107recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008 108Note that this pragma is only required while Perl defaults to byte
109semantics; when character semantics become the default, this pragma
110may become a no-op. See L<utf8>.
111
112Unless explicitly stated, Perl operators use character semantics
113for Unicode data and byte semantics for non-Unicode data.
114The decision to use character semantics is made transparently. If
115input data comes from a Unicode source--for example, if a character
fae2c0fb 116encoding layer is added to a filehandle or a literal Unicode
376d9008 117string constant appears in a program--character semantics apply.
118Otherwise, byte semantics are in effect. The C<bytes> pragma should
e1b711da 119be used to force byte semantics on Unicode data, and the C<use feature
120'unicode_strings'> pragma to force Unicode semantics on byte data (though in
1215.12 it isn't fully implemented).
376d9008 122
123If strings operating under byte semantics and strings with Unicode
51f494cc 124character data are concatenated, the new string will have
42bde815 125character semantics. This can cause surprises: See L</BUGS>, below
7dedd01f 126
feda178f 127Under character semantics, many operations that formerly operated on
376d9008 128bytes now operate on characters. A character in Perl is
feda178f 129logically just a number ranging from 0 to 2**31 or so. Larger
376d9008 130characters may encode into longer sequences of bytes internally, but
131this internal detail is mostly hidden for Perl code.
132See L<perluniintro> for more.
393fec97 133
376d9008 134=head2 Effects of Character Semantics
393fec97 135
136Character semantics have the following effects:
137
138=over 4
139
140=item *
141
376d9008 142Strings--including hash keys--and regular expression patterns may
574c8022 143contain characters that have an ordinal value larger than 255.
393fec97 144
2575c402 145If you use a Unicode editor to edit your program, Unicode characters may
146occur directly within the literal strings in UTF-8 encoding, or UTF-16.
147(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
3e4dbfed 148
195e542a 149Unicode characters can also be added to a string by using the C<\N{U+...}>
150notation. The Unicode code for the desired character, in hexadecimal,
151should be placed in the braces, after the C<U>. For instance, a smiley face is
6f335b04 152C<\N{U+263A}>.
153
195e542a 154Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
155above. For characters below 0x100 you may get byte semantics instead of
6f335b04 156character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
195e542a 157the additional problem that the value for such characters gives the EBCDIC
6f335b04 158character rather than the Unicode one.
3e4dbfed 159
160Additionally, if you
574c8022 161
3e4dbfed 162 use charnames ':full';
574c8022 163
1bfb14c4 164you can use the C<\N{...}> notation and put the official Unicode
165character name within the braces, such as C<\N{WHITE SMILING FACE}>.
6f335b04 166See L<charnames>.
376d9008 167
393fec97 168=item *
169
574c8022 170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
376d9008 172ideographs. Perl does not currently attempt to canonicalize variable
173names.
393fec97 174
393fec97 175=item *
176
1bfb14c4 177Regular expressions match characters instead of bytes. "." matches
2575c402 178a character instead of a byte.
393fec97 179
393fec97 180=item *
181
182Character classes in regular expressions match characters instead of
376d9008 183bytes and match against the character properties specified in the
1bfb14c4 184Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 185ideograph, for instance.
393fec97 186
393fec97 187=item *
188
eb0cc9e3 189Named Unicode properties, scripts, and block ranges may be used like
376d9008 190character classes via the C<\p{}> "matches property" construct and
822502e5 191the C<\P{}> negation, "doesn't match property".
2575c402 192See L</"Unicode Character Properties"> for more details.
822502e5 193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
822502e5 196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
9f815e24 200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese. In Unicode what appears to the user to be a single
51f494cc 202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character. C<\X>
204will match the entire sequence.
822502e5 205
206=item *
207
208The C<tr///> operator translates characters instead of bytes. Note
209that the C<tr///CU> functionality has been removed. For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided. Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
e1b711da 218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
822502e5 220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>. An operator that
51f494cc 227specifically does not switch is C<vec()>. Operators that really don't
228care include operators that treat strings as a bucket of bits such as
822502e5 229C<sort()>, and operators dealing with filenames.
230
231=item *
232
51f494cc 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
822502e5 234used for byte-oriented formats. Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256. Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold. The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
e1b711da 263You can define your own mappings to be used in lc(),
822502e5 264lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
822502e5 265See L</"User-Defined Case Mappings"> for more details.
266
267=back
268
269=over 4
270
271=item *
272
273And finally, C<scalar reverse()> reverses by character rather than by byte.
274
275=back
276
277=head2 Unicode Character Properties
278
51f494cc 279Most Unicode character properties are accessible by using regular expressions.
280They are used like character classes via the C<\p{}> "matches property"
281construct and the C<\P{}> negation, "doesn't match property".
282
283For instance, C<\p{Uppercase}> matches any character with the Unicode
284"Uppercase" property, while C<\p{L}> matches any character with a
285General_Category of "L" (letter) property. Brackets are not
286required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
287
e1b711da 288More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
289property value is True, and C<\P{Uppercase}> matches any character whose
290Uppercase property value is False, and they could have been written as
51f494cc 291C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
292
293This formality is needed when properties are not binary, that is if they can
294take on more values than just True and False. For example, the Bidi_Class (see
295L</"Bidirectional Character Types"> below), can take on a number of different
296values, such as Left, Right, Whitespace, and others. To match these, one needs
e1b711da 297to specify the property name (Bidi_Class), and the value being matched against
9f815e24 298(Left, Right, I<etc.>). This is done, as in the examples above, by having the
299two components separated by an equal sign (or interchangeably, a colon), like
51f494cc 300C<\p{Bidi_Class: Left}>.
301
302All Unicode-defined character properties may be written in these compound forms
303of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some
304additional properties that are written only in the single form, as well as
305single-form short-cuts for all binary properties and certain others described
306below, in which you may omit the property name and the equals or colon
307separator.
308
309Most Unicode character properties have at least two synonyms (or aliases if you
310prefer), a short one that is easier to type, and a longer one which is more
311descriptive and hence it is easier to understand what it means. Thus the "L"
312and "Letter" above are equivalent and can be used interchangeably. Likewise,
313"Upper" is a synonym for "Uppercase", and we could have written
314C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
315various synonyms for the values the property can be. For binary properties,
316"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
317"No", and "N". But be careful. A short form of a value for one property may
e1b711da 318not mean the same thing as the same short form for another. Thus, for the
51f494cc 319General_Category property, "L" means "Letter", but for the Bidi_Class property,
320"L" means "Left". A complete list of properties and synonyms is in
321L<perluniprops>.
322
323Upper/lower case differences in the property names and values are irrelevant,
324thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
325Similarly, you can add or subtract underscores anywhere in the middle of a
326word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
327is irrelevant adjacent to non-word characters, such as the braces and the equals
328or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
329equivalent to these as well. In fact, in most cases, white space and even
330hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
331equivalent. All this is called "loose-matching" by Unicode. The few places
332where stricter matching is employed is in the middle of numbers, and the Perl
333extension properties that begin or end with an underscore. Stricter matching
334cares about white space (except adjacent to the non-word characters) and
335hyphens, and non-interior underscores.
4193bef7 336
376d9008 337You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
338(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 339equal to C<\P{Tamil}>.
4193bef7 340
51f494cc 341=head3 B<General_Category>
14bb0a9a 342
51f494cc 343Every Unicode character is assigned a general category, which is the "most
344usual categorization of a character" (from
345L<http://www.unicode.org/reports/tr44>).
822502e5 346
9f815e24 347The compound way of writing these is like C<\p{General_Category=Number}>
51f494cc 348(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up
349through the equal or colon separator is omitted. So you can instead just write
350C<\pN>.
822502e5 351
51f494cc 352Here are the short and long forms of the General Category properties:
393fec97 353
d73e5302 354 Short Long
355
356 L Letter
51f494cc 357 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
358 Lu Uppercase_Letter
359 Ll Lowercase_Letter
360 Lt Titlecase_Letter
361 Lm Modifier_Letter
362 Lo Other_Letter
d73e5302 363
364 M Mark
51f494cc 365 Mn Nonspacing_Mark
366 Mc Spacing_Mark
367 Me Enclosing_Mark
d73e5302 368
369 N Number
51f494cc 370 Nd Decimal_Number (also Digit)
371 Nl Letter_Number
372 No Other_Number
373
374 P Punctuation (also Punct)
375 Pc Connector_Punctuation
376 Pd Dash_Punctuation
377 Ps Open_Punctuation
378 Pe Close_Punctuation
379 Pi Initial_Punctuation
d73e5302 380 (may behave like Ps or Pe depending on usage)
51f494cc 381 Pf Final_Punctuation
d73e5302 382 (may behave like Ps or Pe depending on usage)
51f494cc 383 Po Other_Punctuation
d73e5302 384
385 S Symbol
51f494cc 386 Sm Math_Symbol
387 Sc Currency_Symbol
388 Sk Modifier_Symbol
389 So Other_Symbol
d73e5302 390
391 Z Separator
51f494cc 392 Zs Space_Separator
393 Zl Line_Separator
394 Zp Paragraph_Separator
d73e5302 395
396 C Other
51f494cc 397 Cc Control (also Cntrl)
e150c829 398 Cf Format
eb0cc9e3 399 Cs Surrogate (not usable)
51f494cc 400 Co Private_Use
e150c829 401 Cn Unassigned
1ac13f9a 402
376d9008 403Single-letter properties match all characters in any of the
3e4dbfed 404two-letter sub-properties starting with the same letter.
12ac2576 405C<LC> and C<L&> are special cases, which are aliases for the set of
406C<Ll>, C<Lu>, and C<Lt>.
32293815 407
eb0cc9e3 408Because Perl hides the need for the user to understand the internal
1bfb14c4 409representation of Unicode characters, there is no need to implement
410the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 411supported.
d73e5302 412
51f494cc 413=head3 B<Bidirectional Character Types>
822502e5 414
376d9008 415Because scripts differ in their directionality--Hebrew is
12ac2576 416written right to left, for example--Unicode supplies these properties in
51f494cc 417the Bidi_Class class:
32293815 418
eb0cc9e3 419 Property Meaning
92e830a9 420
12ac2576 421 L Left-to-Right
422 LRE Left-to-Right Embedding
423 LRO Left-to-Right Override
424 R Right-to-Left
51f494cc 425 AL Arabic Letter
12ac2576 426 RLE Right-to-Left Embedding
427 RLO Right-to-Left Override
428 PDF Pop Directional Format
429 EN European Number
51f494cc 430 ES European Separator
431 ET European Terminator
12ac2576 432 AN Arabic Number
51f494cc 433 CS Common Separator
12ac2576 434 NSM Non-Spacing Mark
435 BN Boundary Neutral
436 B Paragraph Separator
437 S Segment Separator
438 WS Whitespace
439 ON Other Neutrals
440
51f494cc 441This property is always written in the compound form.
442For example, C<\p{Bidi_Class:R}> matches characters that are normally
eb0cc9e3 443written right to left.
444
51f494cc 445=head3 B<Scripts>
446
e1b711da 447The world's languages are written in a number of scripts. This sentence
448(unless you're reading it in translation) is written in Latin, while Russian is
449written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
450Hiragana or Katakana. There are many more.
51f494cc 451
452The Unicode Script property gives what script a given character is in,
453and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
454C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit
455everything up through the equals (or colon), and simply write C<\p{Latin}> or
456C<\P{Cyrillic}>.
457
458A complete list of scripts and their shortcuts is in L<perluniprops>.
459
51f494cc 460=head3 B<Use of "Is" Prefix>
822502e5 461
1bfb14c4 462For backward compatibility (with Perl 5.6), all properties mentioned
51f494cc 463so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
464example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
465C<\p{Arabic}>.
eb0cc9e3 466
51f494cc 467=head3 B<Blocks>
2796c109 468
1bfb14c4 469In addition to B<scripts>, Unicode also defines B<blocks> of
470characters. The difference between scripts and blocks is that the
471concept of scripts is closer to natural languages, while the concept
51f494cc 472of blocks is more of an artificial grouping based on groups of Unicode
9f815e24 473characters with consecutive ordinal values. For example, the "Basic Latin"
51f494cc 474block is all characters whose ordinals are between 0 and 127, inclusive, in
9f815e24 475other words, the ASCII characters. The "Latin" script contains some letters
476from this block as well as several more, like "Latin-1 Supplement",
477"Latin Extended-A", I<etc.>, but it does not contain all the characters from
51f494cc 478those blocks. It does not, for example, contain digits, because digits are
479shared across many scripts. Digits and similar groups, like punctuation, are in
480the script called C<Common>. There is also a script called C<Inherited> for
481characters that modify other characters, and inherit the script value of the
482controlling character.
483
484For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
485L<http://www.unicode.org/reports/tr24>
486
487The Script property is likely to be the one you want to use when processing
488natural language; the Block property may be useful in working with the nuts and
489bolts of Unicode.
490
491Block names are matched in the compound form, like C<\p{Block: Arrows}> or
492C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a
493Unicode-defined short name. But Perl does provide a (slight) shortcut: You
494can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards
495compatibility, the C<In> prefix may be omitted if there is no naming conflict
496with a script or any other property, and you can even use an C<Is> prefix
497instead in those cases. But it is not a good idea to do this, for a couple
498reasons:
499
500=over 4
501
502=item 1
503
504It is confusing. There are many naming conflicts, and you may forget some.
9f815e24 505For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
51f494cc 506Hebrew. But would you remember that 6 months from now?
507
508=item 2
509
510It is unstable. A new version of Unicode may pre-empt the current meaning by
511creating a property with the same name. There was a time in very early Unicode
9f815e24 512releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
51f494cc 513doesn't.
32293815 514
393fec97 515=back
516
51f494cc 517Some people just prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
518instead of the shortcuts, for clarity, and because they can't remember the
519difference between 'In' and 'Is' anyway (or aren't confident that those who
520eventually will read their code will know).
521
522A complete list of blocks and their shortcuts is in L<perluniprops>.
523
9f815e24 524=head3 B<Other Properties>
525
526There are many more properties than the very basic ones described here.
527A complete list is in L<perluniprops>.
528
529Unicode defines all its properties in the compound form, so all single-form
530properties are Perl extensions. A number of these are just synonyms for the
531Unicode ones, but some are genunine extensions, including a couple that are in
532the compound form. And quite a few of these are actually recommended by Unicode
533(in L<http://www.unicode.org/reports/tr18>).
534
535This section gives some details on all the extensions that aren't synonyms for
536compound-form Unicode properties (for those, you'll have to refer to the
537L<Unicode Standard|http://www.unicode.org/reports/tr44>.
538
539=over
540
541=item B<C<\p{All}>>
542
543This matches any of the 1_114_112 Unicode code points. It is a synonym for
544C<\p{Any}>.
545
546=item B<C<\p{Alnum}>>
547
548This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
549
550=item B<C<\p{Any}>>
551
552This matches any of the 1_114_112 Unicode code points. It is a synonym for
553C<\p{All}>.
554
555=item B<C<\p{Assigned}>>
556
557This matches any assigned code point; that is, any code point whose general
558category is not Unassigned (or equivalently, not Cn).
559
560=item B<C<\p{Blank}>>
561
562This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the
563spacing horizontally.
564
565=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>)
566
567Matches a character that has a non-canonical decomposition.
568
569To understand the use of this rarely used property=value combination, it is
570necessary to know some basics about decomposition.
571Consider a character, say H. It could appear with various marks around it,
572such as an acute accent, or a circumflex, or various hooks, circles, arrows,
573I<etc.>, above, below, to one side and/or the other, I<etc.> There are many
574possibilities among the world's languages. The number of combinations is
575astronomical, and if there were a character for each combination, it would
576soon exhaust Unicode's more than a million possible characters. So Unicode
577took a different approach: there is a character for the base H, and a
578character for each of the possible marks, and they can be combined variously
579to get a final logical character. So a logical character--what appears to be a
580single character--can be a sequence of more than one individual characters.
581This is called an "extended grapheme cluster". (Perl furnishes the C<\X>
582construct to match such sequences.)
583
584But Unicode's intent is to unify the existing character set standards and
585practices, and a number of pre-existing standards have single characters that
586mean the same thing as some of these combinations. An example is ISO-8859-1,
587which has quite a few of these in the Latin-1 range, an example being "LATIN
588CAPITAL LETTER E WITH ACUTE". Because this character was in this pre-existing
589standard, Unicode added it to its repertoire. But this character is considered
590by Unicode to be equivalent to the sequence consisting of first the character
591"LATIN CAPITAL LETTER E", then the character "COMBINING ACUTE ACCENT".
592
593"LATIN CAPITAL LETTER E WITH ACUTE" is called a "pre-composed" character, and
594the equivalence with the sequence is called canonical equivalence. All
595pre-composed characters are said to have a decomposition (into the equivalent
596sequence) and the decomposition type is also called canonical.
597
598However, many more characters have a different type of decomposition, a
599"compatible" or "non-canonical" decomposition. The sequences that form these
600decompositions are not considered canonically equivalent to the pre-composed
601character. An example, again in the Latin-1 range, is the "SUPERSCRIPT ONE".
602It is kind of like a regular digit 1, but not exactly; its decomposition
603into the digit 1 is called a "compatible" decomposition, specifically a
604"super" decomposition. There are several such compatibility
605decompositions (see L<http://www.unicode.org/reports/tr44>), including one
606called "compat" which means some miscellaneous type of decomposition
607that doesn't fit into the decomposition categories that Unicode has chosen.
608
609Note that most Unicode characters don't have a decomposition, so their
610decomposition type is "None".
611
612Perl has added the C<Non_Canonical> type, for your convenience, to mean any of
613the compatibility decompositions.
614
615=item B<C<\p{Graph}>>
616
617Matches any character that is graphic. Theoretically, this means a character
618that on a printer would cause ink to be used.
619
620=item B<C<\p{HorizSpace}>>
621
622This is the same as C<\h> and C<\p{Blank}>: A character that changes the
623spacing horizontally.
624
625=item B<C<\p{In=*}>>
626
627This is a synonym for C<\p{Present_In=*}>
628
629=item B<C<\p{PerlSpace}>>
630
631This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>.
632
633Mnemonic: Perl's (original) space
634
635=item B<C<\p{PerlWord}>>
636
637This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
638
639Mnemonic: Perl's (original) word.
640
641=item B<C<\p{PosixAlnum}>>
642
643This matches any alphanumeric character in the ASCII range, namely
644C<[A-Za-z0-9]>.
645
646=item B<C<\p{PosixAlpha}>>
647
648This matches any alphabetic character in the ASCII range, namely C<[A-Za-z]>.
649
650=item B<C<\p{PosixBlank}>>
651
652This matches any blank character in the ASCII range, namely C<S<[ \t]>>.
653
654=item B<C<\p{PosixCntrl}>>
655
656This matches any control character in the ASCII range, namely C<[\x00-\x1F\x7F]>
657
658=item B<C<\p{PosixDigit}>>
659
660This matches any digit character in the ASCII range, namely C<[0-9]>.
661
662=item B<C<\p{PosixGraph}>>
663
664This matches any graphical character in the ASCII range, namely C<[\x21-\x7E]>.
665
666=item B<C<\p{PosixLower}>>
667
668This matches any lowercase character in the ASCII range, namely C<[a-z]>.
669
670=item B<C<\p{PosixPrint}>>
671
672This matches any printable character in the ASCII range, namely C<[\x20-\x7E]>.
673These are the graphical characters plus SPACE.
674
675=item B<C<\p{PosixPunct}>>
676
677This matches any punctuation character in the ASCII range, namely
678C<[\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E]>. These are the
679graphical characters that aren't word characters. Note that the Posix standard
680includes in its definition of punctuation, those characters that Unicode calls
681"symbols."
682
683=item B<C<\p{PosixSpace}>>
684
685This matches any space character in the ASCII range, namely
686C<S<[ \f\n\r\t\x0B]>> (the last being a vertical tab).
687
688=item B<C<\p{PosixUpper}>>
689
690This matches any uppercase character in the ASCII range, namely C<[A-Z]>.
691
692=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>)
693
694This property is used when you need to know in what Unicode version(s) a
695character is.
696
697The "*" above stands for some two digit Unicode version number, such as
698C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will
699match the code points whose final disposition has been settled as of the
700Unicode release given by the version number; C<\p{Present_In: Unassigned}>
701will match those code points whose meaning has yet to be assigned.
702
703For example, C<U+0041> "LATIN CAPITAL LETTER A" was present in the very first
704Unicode release available, which is C<1.1>, so this property is true for all
705valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version
7065.1 when it became "LATIN SMALL LETTER Y WITH LOOP", so the only "*" that
707would match it are 5.1, 5.2, and later.
708
709Unicode furnishes the C<Age> property from which this is derived. The problem
710with Age is that a strict interpretation of it (which Perl takes) has it
711matching the precise release a code point's meaning is introduced in. Thus
712C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what
713you want.
714
715Some non-Perl implementations of the Age property may change its meaning to be
716the same as the Perl Present_In property; just be aware of that.
717
718Another confusion with both these properties is that the definition is not
719that the code point has been assigned, but that the meaning of the code point
720has been determined. This is because 66 code points will always be
721unassigned, and, so the Age for them is the Unicode version the decision to
722make them so was made in. For example, C<U+FDD0> is to be permanently
723unassigned to a character, and the decision to do that was made in version 3.1,
724so C<\p{Age=3.1}> matches this character and C<\p{Present_In: 3.1}> and up
725matches as well.
726
727=item B<C<\p{Print}>>
728
ae5b72c8 729This matches any character that is graphical or blank, except controls.
9f815e24 730
731=item B<C<\p{SpacePerl}>>
732
733This is the same as C<\s>, including beyond ASCII.
734
4d4acfba 735Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab
736which both the Posix standard and Unicode consider to be space.)
9f815e24 737
738=item B<C<\p{VertSpace}>>
739
740This is the same as C<\v>: A character that changes the spacing vertically.
741
742=item B<C<\p{Word}>>
743
744This is the same as C<\w>, including beyond ASCII.
745
746=back
747
376d9008 748=head2 User-Defined Character Properties
491fd90a 749
51f494cc 750You can define your own binary character properties by defining subroutines
751whose names begin with "In" or "Is". The subroutines can be defined in any
752package. The user-defined properties can be used in the regular expression
753C<\p> and C<\P> constructs; if you are using a user-defined property from a
754package other than the one you are in, you must specify its package in the
755C<\p> or C<\P> construct.
bac0b425 756
51f494cc 757 # assuming property Is_Foreign defined in Lang::
bac0b425 758 package main; # property package name required
759 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
760
761 package Lang; # property package name not required
762 if ($txt =~ /\p{IsForeign}+/) { ... }
763
764
765Note that the effect is compile-time and immutable once defined.
491fd90a 766
376d9008 767The subroutines must return a specially-formatted string, with one
768or more newline-separated lines. Each line must be one of the following:
491fd90a 769
770=over 4
771
772=item *
773
510254c9 774A single hexadecimal number denoting a Unicode code point to include.
775
776=item *
777
99a6b1f0 778Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 779tabular characters) denoting a range of Unicode code points to include.
491fd90a 780
781=item *
782
376d9008 783Something to include, prefixed by "+": a built-in character
bac0b425 784property (prefixed by "utf8::") or a user-defined character property,
785to represent all the characters in that property; two hexadecimal code
786points for a range; or a single hexadecimal code point.
491fd90a 787
788=item *
789
376d9008 790Something to exclude, prefixed by "-": an existing character
bac0b425 791property (prefixed by "utf8::") or a user-defined character property,
792to represent all the characters in that property; two hexadecimal code
793points for a range; or a single hexadecimal code point.
491fd90a 794
795=item *
796
376d9008 797Something to negate, prefixed "!": an existing character
bac0b425 798property (prefixed by "utf8::") or a user-defined character property,
799to represent all the characters in that property; two hexadecimal code
800points for a range; or a single hexadecimal code point.
801
802=item *
803
804Something to intersect with, prefixed by "&": an existing character
805property (prefixed by "utf8::") or a user-defined character property,
806for all the characters except the characters in the property; two
807hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a 808
809=back
810
811For example, to define a property that covers both the Japanese
812syllabaries (hiragana and katakana), you can define
813
814 sub InKana {
d5822f25 815 return <<END;
816 3040\t309F
817 30A0\t30FF
491fd90a 818 END
819 }
820
d5822f25 821Imagine that the here-doc end marker is at the beginning of the line.
822Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a 823
824You could also have used the existing block property names:
825
826 sub InKana {
827 return <<'END';
828 +utf8::InHiragana
829 +utf8::InKatakana
830 END
831 }
832
833Suppose you wanted to match only the allocated characters,
d5822f25 834not the raw block ranges: in other words, you want to remove
491fd90a 835the non-characters:
836
837 sub InKana {
838 return <<'END';
839 +utf8::InHiragana
840 +utf8::InKatakana
841 -utf8::IsCn
842 END
843 }
844
845The negation is useful for defining (surprise!) negated classes.
846
847 sub InNotKana {
848 return <<'END';
849 !utf8::InHiragana
850 -utf8::InKatakana
851 +utf8::IsCn
852 END
853 }
854
bac0b425 855Intersection is useful for getting the common characters matched by
856two (or more) classes.
857
858 sub InFooAndBar {
859 return <<'END';
860 +main::Foo
861 &main::Bar
862 END
863 }
864
ac036724 865It's important to remember not to use "&" for the first set; that
bac0b425 866would be intersecting with nothing (resulting in an empty set).
867
822502e5 868=head2 User-Defined Case Mappings
869
3a2263fe 870You can also define your own mappings to be used in the lc(),
871lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
822502e5 872The principle is similar to that of user-defined character
51f494cc 873properties: to define subroutines
3a2263fe 874with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
875the first character in ucfirst()), and C<ToUpper> (for uc(), and the
876rest of the characters in ucfirst()).
877
51f494cc 878The string returned by the subroutines needs to be two hexadecimal numbers
e1b711da 879separated by two tabulators: the two numbers being, respectively, the source
880code point and the destination code point. For example:
3a2263fe 881
882 sub ToUpper {
883 return <<END;
51f494cc 884 0061\t\t0041
3a2263fe 885 END
886 }
887
51f494cc 888defines an uc() mapping that causes only the character "a"
889to be mapped to "A"; all other characters will remain unchanged.
3a2263fe 890
51f494cc 891(For serious hackers only) The above means you have to furnish a complete
892mapping; you can't just override a couple of characters and leave the rest
893unchanged. You can find all the mappings in the directory
894C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as the
895here-document, and the C<utf8::ToSpecFoo> are special exception mappings
9f815e24 896derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. The "Digit" and
897"Fold" mappings that one can see in the directory are not directly
51f494cc 898user-accessible, one can use either the C<Unicode::UCD> module, or just match
9f815e24 899case-insensitively (that's when the "Fold" mapping is used).
3a2263fe 900
51f494cc 901The mappings will only take effect on scalars that have been marked as having
902Unicode characters, for example by using C<utf8::upgrade()>.
903Old byte-style strings are not affected.
3a2263fe 904
51f494cc 905The mappings are in effect for the package they are defined in.
3a2263fe 906
376d9008 907=head2 Character Encodings for Input and Output
8cbd9a7a 908
7221edc9 909See L<Encode>.
8cbd9a7a 910
c29a771d 911=head2 Unicode Regular Expression Support Level
776f8809 912
376d9008 913The following list of Unicode support for regular expressions describes
914all the features currently supported. The references to "Level N"
8158862b 915and the section numbers refer to the Unicode Technical Standard #18,
916"Unicode Regular Expressions", version 11, in May 2005.
776f8809 917
918=over 4
919
920=item *
921
922Level 1 - Basic Unicode Support
923
8158862b 924 RL1.1 Hex Notation - done [1]
925 RL1.2 Properties - done [2][3]
926 RL1.2a Compatibility Properties - done [4]
927 RL1.3 Subtraction and Intersection - MISSING [5]
928 RL1.4 Simple Word Boundaries - done [6]
929 RL1.5 Simple Loose Matches - done [7]
930 RL1.6 Line Boundaries - MISSING [8]
931 RL1.7 Supplementary Code Points - done [9]
932
933 [1] \x{...}
934 [2] \p{...} \P{...}
e1b711da 935 [3] supports not only minimal list, but all Unicode character
936 properties (see L</Unicode Character Properties>)
8158862b 937 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
938 [5] can use regular expression look-ahead [a] or
939 user-defined character properties [b] to emulate set operations
940 [6] \b \B
e1b711da 941 [7] note that Perl does Full case-folding in matching (but with bugs),
942 not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9,
2bbc8d55 943 not with 1F80. This difference matters mainly for certain Greek
376d9008 944 capital letters with certain modifiers: the Full case-folding
945 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 946 it to a single character.
8158862b 947 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r),
948 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029);
949 should also affect <>, $., and script line numbers;
950 should not split lines within CRLF [c] (i.e. there is no empty
951 line between \r and \n)
952 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF
953 but also beyond U+10FFFF [d]
7207e29d 954
237bad5b 955[a] You can mimic class subtraction using lookahead.
8158862b 956For example, what UTS#18 might write as
29bdacb8 957
dbe420b4 958 [{Greek}-[{UNASSIGNED}]]
959
960in Perl can be written as:
961
1d81abf3 962 (?!\p{Unassigned})\p{InGreekAndCoptic}
963 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4 964
965But in this particular example, you probably really want
966
1bfb14c4 967 \p{GreekAndCoptic}
dbe420b4 968
969which will match assigned characters known to be part of the Greek script.
29bdacb8 970
5ca1ac52 971Also see the Unicode::Regex::Set module, it does implement the full
8158862b 972UTS#18 grouping, intersection, union, and removal (subtraction) syntax.
973
974[b] '+' for union, '-' for removal (set-difference), '&' for intersection
975(see L</"User-Defined Character Properties">)
976
977[c] Try the C<:crlf> layer (see L<PerlIO>).
5ca1ac52 978
c670e63a 979[d] U+FFFF will currently generate a warning message if 'utf8' warnings are
980 enabled
237bad5b 981
776f8809 982=item *
983
984Level 2 - Extended Unicode Support
985
8158862b 986 RL2.1 Canonical Equivalents - MISSING [10][11]
c670e63a 987 RL2.2 Default Grapheme Clusters - MISSING [12]
8158862b 988 RL2.3 Default Word Boundaries - MISSING [14]
989 RL2.4 Default Loose Matches - MISSING [15]
990 RL2.5 Name Properties - MISSING [16]
991 RL2.6 Wildcard Properties - MISSING
992
993 [10] see UAX#15 "Unicode Normalization Forms"
994 [11] have Unicode::Normalize but not integrated to regexes
e1b711da 995 [12] have \X but we don't have a "Grapheme Cluster Mode"
8158862b 996 [14] see UAX#29, Word Boundaries
997 [15] see UAX#21 "Case Mappings"
998 [16] have \N{...} but neither compute names of CJK Ideographs
999 and Hangul Syllables nor use a loose match [e]
1000
1001[e] C<\N{...}> allows namespaces (see L<charnames>).
776f8809 1002
1003=item *
1004
8158862b 1005Level 3 - Tailored Support
1006
1007 RL3.1 Tailored Punctuation - MISSING
1008 RL3.2 Tailored Grapheme Clusters - MISSING [17][18]
1009 RL3.3 Tailored Word Boundaries - MISSING
1010 RL3.4 Tailored Loose Matches - MISSING
1011 RL3.5 Tailored Ranges - MISSING
1012 RL3.6 Context Matching - MISSING [19]
1013 RL3.7 Incremental Matches - MISSING
1014 ( RL3.8 Unicode Set Sharing )
1015 RL3.9 Possible Match Sets - MISSING
1016 RL3.10 Folded Matching - MISSING [20]
1017 RL3.11 Submatchers - MISSING
1018
1019 [17] see UAX#10 "Unicode Collation Algorithms"
1020 [18] have Unicode::Collate but not integrated to regexes
1021 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see
1022 outside of the target substring
1023 [20] need insensitive matching for linguistic features other than case;
1024 for example, hiragana to katakana, wide and narrow, simplified Han
1025 to traditional Han (see UTR#30 "Character Foldings")
776f8809 1026
1027=back
1028
c349b1b9 1029=head2 Unicode Encodings
1030
376d9008 1031Unicode characters are assigned to I<code points>, which are abstract
1032numbers. To use these numbers, various encodings are needed.
c349b1b9 1033
1034=over 4
1035
c29a771d 1036=item *
5cb3728c 1037
1038UTF-8
c349b1b9 1039
3e4dbfed 1040UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008 1041require 4 bytes), byte-order independent encoding. For ASCII (and we
1042really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
1043transparent.
c349b1b9 1044
8c007b5a 1045The following table is from Unicode 3.2.
05632f9a 1046
e1b711da 1047 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
05632f9a 1048
e1b711da 1049 U+0000..U+007F 00..7F
1050 U+0080..U+07FF * C2..DF 80..BF
1051 U+0800..U+0FFF E0 * A0..BF 80..BF
ec90690f 1052 U+1000..U+CFFF E1..EC 80..BF 80..BF
1053 U+D000..U+D7FF ED 80..9F 80..BF
e1b711da 1054 U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
ec90690f 1055 U+E000..U+FFFF EE..EF 80..BF 80..BF
e1b711da 1056 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
1057 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
1058 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
1059
1060Note the gaps before several of the byte entries above marked by '*'. These are
1061caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1062possible to UTF-8-encode a single code point in different ways, but that is
1063explicitly forbidden, and the shortest possible encoding should always be used
1064(and that is what Perl does).
37361303 1065
376d9008 1066Another way to look at it is via bits:
05632f9a 1067
1068 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
1069
1070 0aaaaaaa 0aaaaaaa
1071 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
1072 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
1073 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
1074
9f815e24 1075As you can see, the continuation bytes all begin with "10", and the
e1b711da 1076leading bits of the start byte tell how many bytes there are in the
05632f9a 1077encoded character.
1078
c29a771d 1079=item *
5cb3728c 1080
1081UTF-EBCDIC
dbe420b4 1082
376d9008 1083Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 1084
c29a771d 1085=item *
5cb3728c 1086
1e54db1a 1087UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 1088
1bfb14c4 1089The followings items are mostly for reference and general Unicode
1090knowledge, Perl doesn't use these constructs internally.
dbe420b4 1091
c349b1b9 1092UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4 1093C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
1094points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9 1095using I<surrogates>, the first 16-bit unit being the I<high
1096surrogate>, and the second being the I<low surrogate>.
1097
376d9008 1098Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 1099range of Unicode code points in pairs of 16-bit units. The I<high
9f815e24 1100surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
376d9008 1101are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 1102
1103 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1104 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1105
1106and the decoding is
1107
1a3fa709 1108 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 1109
feda178f 1110If you try to generate surrogates (for example by using chr()), you
e1b711da 1111will get a warning, if warnings are turned on, because those code
376d9008 1112points are not valid for a Unicode character.
9466bab6 1113
376d9008 1114Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 1115itself can be used for in-memory computations, but if storage or
376d9008 1116transfer is required either UTF-16BE (big-endian) or UTF-16LE
1117(little-endian) encodings must be chosen.
c349b1b9 1118
1119This introduces another problem: what if you just know that your data
376d9008 1120is UTF-16, but you don't know which endianness? Byte Order Marks, or
1121BOMs, are a solution to this. A special character has been reserved
86bbd6d1 1122in Unicode to function as a byte order marker: the character with the
376d9008 1123code point C<U+FEFF> is the BOM.
042da322 1124
c349b1b9 1125The trick is that if you read a BOM, you will know the byte order,
376d9008 1126since if it was written on a big-endian platform, you will read the
1127bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1128you will read the bytes C<0xFF 0xFE>. (And if the originating platform
1129was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1130
86bbd6d1 1131The way this trick works is that the character with the code point
376d9008 1132C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
1133sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1134little-endian format" and cannot be C<U+FFFE>, represented in big-endian
e1b711da 1135format". (Actually, C<U+FFFE> is legal for use by your program, even for
1136input/output, but better not use it if you need a BOM. But it is "illegal for
1137interchange", so that an unsuspecting program won't get confused.)
c349b1b9 1138
c29a771d 1139=item *
5cb3728c 1140
1e54db1a 1141UTF-32, UTF-32BE, UTF-32LE
c349b1b9 1142
1143The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1144the units are 32-bit, and therefore the surrogate scheme is not
376d9008 1145needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
1146C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1147
c29a771d 1148=item *
5cb3728c 1149
1150UCS-2, UCS-4
c349b1b9 1151
86bbd6d1 1152Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1153encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1154because it does not use surrogates. UCS-4 is a 32-bit encoding,
1155functionally identical to UTF-32.
c349b1b9 1156
c29a771d 1157=item *
5cb3728c 1158
1159UTF-7
c349b1b9 1160
376d9008 1161A seven-bit safe (non-eight-bit) encoding, which is useful if the
1162transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1163
95a1a48b 1164=back
1165
0d7c09bb 1166=head2 Security Implications of Unicode
1167
e1b711da 1168Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1169Also, note the following:
1170
0d7c09bb 1171=over 4
1172
1173=item *
1174
1175Malformed UTF-8
bf0fa0b2 1176
1177Unfortunately, the specification of UTF-8 leaves some room for
1178interpretation of how many bytes of encoded output one should generate
376d9008 1179from one input Unicode character. Strictly speaking, the shortest
1180possible sequence of UTF-8 bytes should be generated,
1181because otherwise there is potential for an input buffer overflow at
feda178f 1182the receiving end of a UTF-8 connection. Perl always generates the
e1b711da 1183shortest length UTF-8, and with warnings on, Perl will warn about
376d9008 1184non-shortest length UTF-8 along with other malformations, such as the
1185surrogates, which are not real Unicode code points.
bf0fa0b2 1186
0d7c09bb 1187=item *
1188
1189Regular expressions behave slightly differently between byte data and
376d9008 1190character (Unicode) data. For example, the "word character" character
1191class C<\w> will work differently depending on if data is eight-bit bytes
1192or Unicode.
0d7c09bb 1193
376d9008 1194In the first case, the set of C<\w> characters is either small--the
1195default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb 1196are using a locale (see L<perllocale>), the C<\w> might contain a few
1197more letters according to your language and country.
1198
376d9008 1199In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4 1200Most importantly, even in the set of the first 256 characters, it will
1201probably match different characters: unlike most locales, which are
1202specific to a language and country pair, Unicode classifies all the
1203characters that are letters I<somewhere> as C<\w>. For example, your
1204locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1205you happen to speak Icelandic), but Unicode does.
0d7c09bb 1206
376d9008 1207As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4 1208each of two worlds: the old world of bytes and the new world of
1209characters, upgrading from bytes to characters when necessary.
376d9008 1210If your legacy code does not explicitly use Unicode, no automatic
1211switch-over to characters should happen. Characters shouldn't get
1bfb14c4 1212downgraded to bytes, either. It is possible to accidentally mix bytes
1213and characters, however (see L<perluniintro>), in which case C<\w> in
1214regular expressions might start behaving differently. Review your
1215code. Use warnings and the C<strict> pragma.
0d7c09bb 1216
1217=back
1218
c349b1b9 1219=head2 Unicode in Perl on EBCDIC
1220
376d9008 1221The way Unicode is handled on EBCDIC platforms is still
1222experimental. On such platforms, references to UTF-8 encoding in this
1223document and elsewhere should be read as meaning the UTF-EBCDIC
1224specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1225are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1226":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1 1227the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1228for more discussion of the issues.
c349b1b9 1229
b310b053 1230=head2 Locales
1231
4616122b 1232Usually locale settings and Unicode do not affect each other, but
b310b053 1233there are a couple of exceptions:
1234
1235=over 4
1236
1237=item *
1238
8aa8f774 1239You can enable automatic UTF-8-ification of your standard file
1240handles, default C<open()> layer, and C<@ARGV> by using either
1241the C<-C> command line switch or the C<PERL_UNICODE> environment
1242variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053 1243
1244=item *
1245
376d9008 1246Perl tries really hard to work both with Unicode and the old
1247byte-oriented world. Most often this is nice, but sometimes Perl's
1248straddling of the proverbial fence causes problems.
b310b053 1249
1250=back
1251
1aad1664 1252=head2 When Unicode Does Not Happen
1253
1254While Perl does have extensive ways to input and output in Unicode,
1255and few other 'entry points' like the @ARGV which can be interpreted
1256as Unicode (UTF-8), there still are many places where Unicode (in some
1257encoding or another) could be given as arguments or received as
1258results, or both, but it is not.
1259
e1b711da 1260The following are such interfaces. Also, see L</The "Unicode Bug">.
1261For all of these interfaces Perl
6cd4dd6c 1262currently (as of 5.8.3) simply assumes byte strings both as arguments
1263and results, or UTF-8 strings if the C<encoding> pragma has been used.
1aad1664 1264
1265One reason why Perl does not attempt to resolve the role of Unicode in
e1b711da 1266these cases is that the answers are highly dependent on the operating
1aad1664 1267system and the file system(s). For example, whether filenames can be
1268in Unicode, and in exactly what kind of encoding, is not exactly a
1269portable concept. Similarly for the qx and system: how well will the
1270'command line interface' (and which of them?) handle Unicode?
1271
1272=over 4
1273
557a2462 1274=item *
1275
51f494cc 1276chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
1e8e8236 1277rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462 1278
1279=item *
1280
1281%ENV
1282
1283=item *
1284
1285glob (aka the <*>)
1286
1287=item *
1aad1664 1288
557a2462 1289open, opendir, sysopen
1aad1664 1290
557a2462 1291=item *
1aad1664 1292
557a2462 1293qx (aka the backtick operator), system
1aad1664 1294
557a2462 1295=item *
1aad1664 1296
557a2462 1297readdir, readlink
1aad1664 1298
1299=back
1300
e1b711da 1301=head2 The "Unicode Bug"
1302
1303The term, the "Unicode bug" has been applied to an inconsistency with the
6f335b04 1304Unicode characters whose ordinals are in the Latin-1 Supplement block, that
e1b711da 1305is, between 128 and 255. Without a locale specified, unlike all other
1306characters or code points, these characters have very different semantics in
1307byte semantics versus character semantics.
1308
1309In character semantics they are interpreted as Unicode code points, which means
1310they have the same semantics as Latin-1 (ISO-8859-1).
1311
1312In byte semantics, they are considered to be unassigned characters, meaning
1313that the only semantics they have is their ordinal numbers, and that they are
1314not members of various character classes. None are considered to match C<\w>
1315for example, but all match C<\W>. (On EBCDIC platforms, the behavior may
1316be different from this, depending on the underlying C language library
1317functions.)
1318
1319The behavior is known to have effects on these areas:
1320
1321=over 4
1322
1323=item *
1324
1325Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1326and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
1327substitutions.
1328
1329=item *
1330
1331Using caseless (C</i>) regular expression matching
1332
1333=item *
1334
1335Matching a number of properties in regular expressions, such as C<\w>
1336
1337=item *
1338
1339User-defined case change mappings. You can create a C<ToUpper()> function, for
1340example, which overrides Perl's built-in case mappings. The scalar must be
1341encoded in utf8 for your function to actually be invoked.
1342
1343=back
1344
1345This behavior can lead to unexpected results in which a string's semantics
1346suddenly change if a code point above 255 is appended to or removed from it,
1347which changes the string's semantics from byte to character or vice versa. As
1348an example, consider the following program and its output:
1349
1350 $ perl -le'
1351 $s1 = "\xC2";
1352 $s2 = "\x{2660}";
1353 for ($s1, $s2, $s1.$s2) {
1354 print /\w/ || 0;
1355 }
1356 '
1357 0
1358 0
1359 1
1360
9f815e24 1361If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
e1b711da 1362
1363This anomaly stems from Perl's attempt to not disturb older programs that
1364didn't use Unicode, and hence had no semantics for characters outside of the
1365ASCII range (except in a locale), along with Perl's desire to add Unicode
1366support seamlessly. The result wasn't seamless: these characters were
1367orphaned.
1368
1369Work is being done to correct this, but only some of it was complete in time
1370for the 5.12 release. What has been finished is the important part of the case
1371changing component. Due to concerns, and some evidence, that older code might
1372have come to rely on the existing behavior, the new behavior must be explicitly
1373enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
1374no new syntax is involved.
1375
1376See L<perlfunc/lc> for details on how this pragma works in combination with
1377various others for casing. Even though the pragma only affects casing
1378operations in the 5.12 release, it is planned to have it affect all the
1379problematic behaviors in later releases: you can't have one without them all.
1380
1381In the meantime, a workaround is to always call utf8::upgrade($string), or to
6f335b04 1382use the standard module L<Encode>. Also, a scalar that has any characters
1383whose ordinal is above 0x100, or which were specified using either of the
1384C<\N{...}> notations will automatically have character semantics.
e1b711da 1385
1aad1664 1386=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1387
e1b711da 1388Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1389there are situations where you simply need to force a byte
2bbc8d55 1390string into UTF-8, or vice versa. The low-level calls
1391utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
1aad1664 1392the answers.
1393
2bbc8d55 1394Note that utf8::downgrade() can fail if the string contains characters
1395that don't fit into a byte.
1aad1664 1396
e1b711da 1397Calling either function on a string that already is in the desired state is a
1398no-op.
1399
95a1a48b 1400=head2 Using Unicode in XS
1401
3a2263fe 1402If you want to handle Perl Unicode in XS extensions, you may find the
1403following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1404explanation about Unicode at the XS level, and L<perlapi> for the API
1405details.
95a1a48b 1406
1407=over 4
1408
1409=item *
1410
1bfb14c4 1411C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
2bbc8d55 1412pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8>
1bfb14c4 1413flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1414does B<not> mean that there are any characters of code points greater
1415than 255 (or 127) in the scalar or that there are even any characters
1416in the scalar. What the C<UTF8> flag means is that the sequence of
1417octets in the representation of the scalar is the sequence of UTF-8
1418encoded code points of the characters of a string. The C<UTF8> flag
1419being off means that each octet in this representation encodes a
1420single character with code point 0..255 within the string. Perl's
1421Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b 1422
1423=item *
1424
2bbc8d55 1425C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1426a buffer encoding the code point as UTF-8, and returns a pointer
2bbc8d55 1427pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines.
95a1a48b 1428
1429=item *
1430
2bbc8d55 1431C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
376d9008 1432returns the Unicode character code point and, optionally, the length of
2bbc8d55 1433the UTF-8 byte sequence. It works appropriately on EBCDIC machines.
95a1a48b 1434
1435=item *
1436
376d9008 1437C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1438in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b 1439scalar.
1440
1441=item *
1442
376d9008 1443C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1444encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1445possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1446it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1447opposite of C<sv_utf8_encode()>. Note that none of these are to be
1448used as general-purpose encoding or decoding interfaces: C<use Encode>
1449for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1450but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1451designed to be a one-way street).
95a1a48b 1452
1453=item *
1454
376d9008 1455C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1456character.
95a1a48b 1457
1458=item *
1459
376d9008 1460C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b 1461are valid UTF-8.
1462
1463=item *
1464
376d9008 1465C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1466character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1467required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1468is useful for example for iterating over the characters of a UTF-8
376d9008 1469encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1470the size required for a UTF-8 encoded buffer.
95a1a48b 1471
1472=item *
1473
376d9008 1474C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b 1475two pointers pointing to the same UTF-8 encoded buffer.
1476
1477=item *
1478
2bbc8d55 1479C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
376d9008 1480that is C<off> (positive or negative) Unicode characters displaced
1481from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1482C<utf8_hop()> will merrily run off the end or the beginning of the
1483buffer if told to do so.
95a1a48b 1484
d2cc3551 1485=item *
1486
376d9008 1487C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1488C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1489output of Unicode strings and scalars. By default they are useful
1490only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4 1491points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1492C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1493output more readable.
d2cc3551 1494
1495=item *
1496
2bbc8d55 1497C<ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
376d9008 1498compare two strings case-insensitively in Unicode. For case-sensitive
1499comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1500
c349b1b9 1501=back
1502
95a1a48b 1503For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1504in the Perl source code distribution.
1505
e1b711da 1506=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1507
1508Perl by default comes with the latest supported Unicode version built in, but
1509you can change to use any earlier one.
1510
1511Download the files in the version of Unicode that you want from the Unicode web
1512site L<http://www.unicode.org>). These should replace the existing files in
1513C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
1514module.) Follow the instructions in F<README.perl> in that directory to change
1515some of their names, and then run F<make>.
1516
1517It is even possible to download them to a different directory, and then change
1518F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
1519directory, or maybe make a copy of that directory before making the change, and
1520using C<@INC> or the C<-I> run-time flag to switch between versions at will
1521(but because of caching, not in the middle of a process), but all this is
1522beyond the scope of these instructions.
1523
c29a771d 1524=head1 BUGS
1525
376d9008 1526=head2 Interaction with Locales
7eabb34d 1527
376d9008 1528Use of locales with Unicode data may lead to odd results. Currently,
1529Perl attempts to attach 8-bit locale info to characters in the range
15300..255, but this technique is demonstrably incorrect for locales that
1531use characters above that range when mapped into Unicode. Perl's
1532Unicode support will also tend to run slower. Use of locales with
1533Unicode is discouraged.
c29a771d 1534
9f815e24 1535=head2 Problems with characters in the Latin-1 Supplement range
2bbc8d55 1536
e1b711da 1537See L</The "Unicode Bug">
1538
1539=head2 Problems with case-insensitive regular expression matching
1540
1541There are problems with case-insensitive matches, including those involving
1542character classes (enclosed in [square brackets]), characters whose fold
9f815e24 1543is to multiple characters (such as the single character LATIN SMALL LIGATURE
1544FFL matches case-insensitively with the 3-character string C<ffl>), and
1545characters in the Latin-1 Supplement.
2bbc8d55 1546
376d9008 1547=head2 Interaction with Extensions
7eabb34d 1548
376d9008 1549When Perl exchanges data with an extension, the extension should be
2575c402 1550able to understand the UTF8 flag and act accordingly. If the
376d9008 1551extension doesn't know about the flag, it's likely that the extension
1552will return incorrectly-flagged data.
7eabb34d 1553
1554So if you're working with Unicode data, consult the documentation of
1555every module you're using if there are any issues with Unicode data
1556exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1557suspect the worst and probably look at the source to learn how the
376d9008 1558module is implemented. Modules written completely in Perl shouldn't
a73d23f6 1559cause problems. Modules that directly or indirectly access code written
1560in other programming languages are at risk.
7eabb34d 1561
376d9008 1562For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1563to always make the encoding of the exchanged data explicit. Choose an
376d9008 1564encoding that you know the extension can handle. Convert arguments passed
7eabb34d 1565to the extensions to that encoding and convert results back from that
1566encoding. Write wrapper functions that do the conversions for you, so
1567you can later change the functions when the extension catches up.
1568
376d9008 1569To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d 1570function doesn't deal with Unicode data yet. The wrapper function
1571would convert the argument to raw UTF-8 and convert the result back to
376d9008 1572Perl's internal representation like so:
7eabb34d 1573
1574 sub my_escape_html ($) {
1575 my($what) = shift;
1576 return unless defined $what;
1577 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1578 }
1579
1580Sometimes, when the extension does not convert data but just stores
1581and retrieves them, you will be in a position to use the otherwise
1582dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1583C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d 1584lets you store and retrieve data according to these prototypes:
1585
1586 $self->param($name, $value); # set a scalar
1587 $value = $self->param($name); # retrieve a scalar
1588
1589If it does not yet provide support for any encoding, one could write a
1590derived class with such a C<param> method:
1591
1592 sub param {
1593 my($self,$name,$value) = @_;
1594 utf8::upgrade($name); # make sure it is UTF-8 encoded
af55fc6a 1595 if (defined $value) {
7eabb34d 1596 utf8::upgrade($value); # make sure it is UTF-8 encoded
1597 return $self->SUPER::param($name,$value);
1598 } else {
1599 my $ret = $self->SUPER::param($name);
1600 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1601 return $ret;
1602 }
1603 }
1604
a73d23f6 1605Some extensions provide filters on data entry/exit points, such as
1606DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1607the documentation of your extensions, they can make the transition to
7eabb34d 1608Unicode data much easier.
1609
376d9008 1610=head2 Speed
7eabb34d 1611
c29a771d 1612Some functions are slower when working on UTF-8 encoded strings than
574c8022 1613on byte encoded strings. All functions that need to hop over
7c17141f 1614characters such as length(), substr() or index(), or matching regular
1615expressions can work B<much> faster when the underlying data are
1616byte-encoded.
1617
1618In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1619a caching scheme was introduced which will hopefully make the slowness
a104b433 1620somewhat less spectacular, at least for some operations. In general,
1621operations with UTF-8 encoded strings are still slower. As an example,
1622the Unicode properties (character classes) like C<\p{Nd}> are known to
1623be quite a bit slower (5-20 times) than their simpler counterparts
1624like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1625compared with the 10 ASCII characters matching C<d>).
666f95b9 1626
e1b711da 1627=head2 Problems on EBCDIC platforms
1628
1629There are a number of known problems with Perl on EBCDIC platforms. If you
1630want to use Perl there, send email to perlbug@perl.org.
fe749c9a 1631
1632In earlier versions, when byte and character data were concatenated,
1633the new string was sometimes created by
1634decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1635old Unicode string used EBCDIC.
1636
1637If you find any of these, please report them as bugs.
1638
c8d992ba 1639=head2 Porting code from perl-5.6.X
1640
1641Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1642was required to use the C<utf8> pragma to declare that a given scope
1643expected to deal with Unicode data and had to make sure that only
1644Unicode data were reaching that scope. If you have code that is
1645working with 5.6, you will need some of the following adjustments to
1646your code. The examples are written such that the code will continue
1647to work under 5.6, so you should be safe to try them out.
1648
1649=over 4
1650
1651=item *
1652
1653A filehandle that should read or write UTF-8
1654
1655 if ($] > 5.007) {
740d4bb2 1656 binmode $fh, ":encoding(utf8)";
c8d992ba 1657 }
1658
1659=item *
1660
1661A scalar that is going to be passed to some extension
1662
1663Be it Compress::Zlib, Apache::Request or any extension that has no
1664mention of Unicode in the manpage, you need to make sure that the
2575c402 1665UTF8 flag is stripped off. Note that at the time of this writing
c8d992ba 1666(October 2002) the mentioned modules are not UTF-8-aware. Please
1667check the documentation to verify if this is still true.
1668
1669 if ($] > 5.007) {
1670 require Encode;
1671 $val = Encode::encode_utf8($val); # make octets
1672 }
1673
1674=item *
1675
1676A scalar we got back from an extension
1677
1678If you believe the scalar comes back as UTF-8, you will most likely
2575c402 1679want the UTF8 flag restored:
c8d992ba 1680
1681 if ($] > 5.007) {
1682 require Encode;
1683 $val = Encode::decode_utf8($val);
1684 }
1685
1686=item *
1687
1688Same thing, if you are really sure it is UTF-8
1689
1690 if ($] > 5.007) {
1691 require Encode;
1692 Encode::_utf8_on($val);
1693 }
1694
1695=item *
1696
1697A wrapper for fetchrow_array and fetchrow_hashref
1698
1699When the database contains only UTF-8, a wrapper function or method is
1700a convenient way to replace all your fetchrow_array and
1701fetchrow_hashref calls. A wrapper function will also make it easier to
1702adapt to future enhancements in your database driver. Note that at the
1703time of this writing (October 2002), the DBI has no standardized way
1704to deal with UTF-8 data. Please check the documentation to verify if
1705that is still true.
1706
1707 sub fetchrow {
1708 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1709 if ($] < 5.007) {
1710 return $sth->$what;
1711 } else {
1712 require Encode;
1713 if (wantarray) {
1714 my @arr = $sth->$what;
1715 for (@arr) {
1716 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1717 }
1718 return @arr;
1719 } else {
1720 my $ret = $sth->$what;
1721 if (ref $ret) {
1722 for my $k (keys %$ret) {
1723 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1724 }
1725 return $ret;
1726 } else {
1727 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1728 return $ret;
1729 }
1730 }
1731 }
1732 }
1733
1734
1735=item *
1736
1737A large scalar that you know can only contain ASCII
1738
1739Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1740a drag to your program. If you recognize such a situation, just remove
2575c402 1741the UTF8 flag:
c8d992ba 1742
1743 utf8::downgrade($val) if $] > 5.007;
1744
1745=back
1746
393fec97 1747=head1 SEE ALSO
1748
51f494cc 1749L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1750L<perlretut>, L<perlvar/"${^UNICODE}">
51f494cc 1751L<http://www.unicode.org/reports/tr44>).
393fec97 1752
1753=cut