usemallocwrap works on AIX, but not with vac-5
[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod
CommitLineData
393fec97 1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
0a1f2d14 7=head2 Important Caveats
21bad921 8
376d9008 9Unicode support is an extensive requirement. While Perl does not
c349b1b9 10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
21bad921 12
13a2d996 13=over 4
21bad921 14
fae2c0fb 15=item Input and Output Layers
21bad921 16
376d9008 17Perl knows when a filehandle uses Perl's internal Unicode encodings
1bfb14c4 18(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
19the ":utf8" layer. Other encodings can be converted to Perl's
20encoding on input or from Perl's encoding on output by use of the
21":encoding(...)" layer. See L<open>.
c349b1b9 22
376d9008 23To indicate that Perl source itself is using a particular encoding,
c349b1b9 24see L<encoding>.
21bad921 25
26=item Regular Expressions
27
c349b1b9 28The regular expression compiler produces polymorphic opcodes. That is,
376d9008 29the pattern adapts to the data and automatically switches to the Unicode
30character scheme when presented with Unicode data--or instead uses
31a traditional byte scheme when presented with byte data.
21bad921 32
ad0029c4 33=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921 34
376d9008 35As a compatibility measure, the C<use utf8> pragma must be explicitly
36included to enable recognition of UTF-8 in the Perl scripts themselves
1bfb14c4 37(in string or regular expression literals, or in identifier names) on
38ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
376d9008 39machines. B<These are the only times when an explicit C<use utf8>
8f8cf39c 40is needed.> See L<utf8>.
21bad921 41
1768d7eb 42You can also use the C<encoding> pragma to change the default encoding
6ec9efec 43of the data in your script; see L<encoding>.
1768d7eb 44
990e18f7 45=item C<use encoding> needed to upgrade non-Latin-1 byte strings
46
47By default, there is a fundamental asymmetry in Perl's unicode model:
48implicit upgrading from byte strings to Unicode strings assumes that
49they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
50downgraded with UTF-8 encoding. This happens because the first 256
51codepoints in Unicode happens to agree with Latin-1.
52
53If you wish to interpret byte strings as UTF-8 instead, use the
54C<encoding> pragma:
55
56 use encoding 'utf8';
57
58See L</"Byte and Character Semantics"> for more details.
59
21bad921 60=back
61
376d9008 62=head2 Byte and Character Semantics
393fec97 63
376d9008 64Beginning with version 5.6, Perl uses logically-wide characters to
3e4dbfed 65represent strings internally.
393fec97 66
376d9008 67In future, Perl-level operations will be expected to work with
68characters rather than bytes.
393fec97 69
376d9008 70However, as an interim compatibility measure, Perl aims to
75daf61c 71provide a safe migration path from byte semantics to character
72semantics for programs. For operations where Perl can unambiguously
376d9008 73decide that the input data are characters, Perl switches to
75daf61c 74character semantics. For operations where this determination cannot
75be made without additional information from the user, Perl decides in
376d9008 76favor of compatibility and chooses to use byte semantics.
8cbd9a7a 77
78This behavior preserves compatibility with earlier versions of Perl,
376d9008 79which allowed byte semantics in Perl operations only if
80none of the program's inputs were marked as being as source of Unicode
8cbd9a7a 81character data. Such data may come from filehandles, from calls to
82external programs, from information provided by the system (such as %ENV),
21bad921 83or from literals and constants in the source text.
8cbd9a7a 84
376d9008 85The C<bytes> pragma will always, regardless of platform, force byte
86semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a 87
88The C<utf8> pragma is primarily a compatibility device that enables
75daf61c 89recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
376d9008 90Note that this pragma is only required while Perl defaults to byte
91semantics; when character semantics become the default, this pragma
92may become a no-op. See L<utf8>.
93
94Unless explicitly stated, Perl operators use character semantics
95for Unicode data and byte semantics for non-Unicode data.
96The decision to use character semantics is made transparently. If
97input data comes from a Unicode source--for example, if a character
fae2c0fb 98encoding layer is added to a filehandle or a literal Unicode
376d9008 99string constant appears in a program--character semantics apply.
100Otherwise, byte semantics are in effect. The C<bytes> pragma should
101be used to force byte semantics on Unicode data.
102
103If strings operating under byte semantics and strings with Unicode
990e18f7 104character data are concatenated, the new string will be created by
105decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
106old Unicode string used EBCDIC. This translation is done without
107regard to the system's native 8-bit encoding. To change this for
108systems with non-Latin-1 and non-EBCDIC native encodings, use the
109C<encoding> pragma. See L<encoding>.
7dedd01f 110
feda178f 111Under character semantics, many operations that formerly operated on
376d9008 112bytes now operate on characters. A character in Perl is
feda178f 113logically just a number ranging from 0 to 2**31 or so. Larger
376d9008 114characters may encode into longer sequences of bytes internally, but
115this internal detail is mostly hidden for Perl code.
116See L<perluniintro> for more.
393fec97 117
376d9008 118=head2 Effects of Character Semantics
393fec97 119
120Character semantics have the following effects:
121
122=over 4
123
124=item *
125
376d9008 126Strings--including hash keys--and regular expression patterns may
574c8022 127contain characters that have an ordinal value larger than 255.
393fec97 128
feda178f 129If you use a Unicode editor to edit your program, Unicode characters
130may occur directly within the literal strings in one of the various
376d9008 131Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
132as such and converted to Perl's internal representation only if the
feda178f 133appropriate L<encoding> is specified.
3e4dbfed 134
1bfb14c4 135Unicode characters can also be added to a string by using the
136C<\x{...}> notation. The Unicode code for the desired character, in
376d9008 137hexadecimal, should be placed in the braces. For instance, a smiley
138face is C<\x{263A}>. This encoding scheme only works for characters
139with a code of 0x100 or above.
3e4dbfed 140
141Additionally, if you
574c8022 142
3e4dbfed 143 use charnames ':full';
574c8022 144
1bfb14c4 145you can use the C<\N{...}> notation and put the official Unicode
146character name within the braces, such as C<\N{WHITE SMILING FACE}>.
376d9008 147
393fec97 148
149=item *
150
574c8022 151If an appropriate L<encoding> is specified, identifiers within the
152Perl script may contain Unicode alphanumeric characters, including
376d9008 153ideographs. Perl does not currently attempt to canonicalize variable
154names.
393fec97 155
393fec97 156=item *
157
1bfb14c4 158Regular expressions match characters instead of bytes. "." matches
159a character instead of a byte. The C<\C> pattern is provided to force
160a match a single byte--a C<char> in C, hence C<\C>.
393fec97 161
393fec97 162=item *
163
164Character classes in regular expressions match characters instead of
376d9008 165bytes and match against the character properties specified in the
1bfb14c4 166Unicode properties database. C<\w> can be used to match a Japanese
75daf61c 167ideograph, for instance.
393fec97 168
b08eb2a8 169(However, and as a limitation of the current implementation, using
170C<\w> or C<\W> I<inside> a C<[...]> character class will still match
171with byte semantics.)
172
393fec97 173=item *
174
eb0cc9e3 175Named Unicode properties, scripts, and block ranges may be used like
376d9008 176character classes via the C<\p{}> "matches property" construct and
177the C<\P{}> negation, "doesn't match property".
1bfb14c4 178
179For instance, C<\p{Lu}> matches any character with the Unicode "Lu"
180(Letter, uppercase) property, while C<\p{M}> matches any character
181with an "M" (mark--accents and such) property. Brackets are not
182required for single letter properties, so C<\p{M}> is equivalent to
183C<\pM>. Many predefined properties are available, such as
184C<\p{Mirrored}> and C<\p{Tibetan}>.
4193bef7 185
cfc01aea 186The official Unicode script and block names have spaces and dashes as
376d9008 187separators, but for convenience you can use dashes, spaces, or
1bfb14c4 188underbars, and case is unimportant. It is recommended, however, that
189for consistency you use the following naming: the official Unicode
190script, property, or block name (see below for the additional rules
191that apply to block names) with whitespace and dashes removed, and the
192words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus
193becomes C<Latin1Supplement>.
4193bef7 194
376d9008 195You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
196(^) between the first brace and the property name: C<\p{^Tamil}> is
eb0cc9e3 197equal to C<\P{Tamil}>.
4193bef7 198
14bb0a9a 199B<NOTE: the properties, scripts, and blocks listed here are as of
200Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
201came out in April 2003, and Perl 5.8.1 in September 2003.>
202
eb0cc9e3 203Here are the basic Unicode General Category properties, followed by their
68cd2d32 204long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
376d9008 205for instance, are identical.
393fec97 206
d73e5302 207 Short Long
208
209 L Letter
eb0cc9e3 210 Lu UppercaseLetter
211 Ll LowercaseLetter
212 Lt TitlecaseLetter
213 Lm ModifierLetter
214 Lo OtherLetter
d73e5302 215
216 M Mark
eb0cc9e3 217 Mn NonspacingMark
218 Mc SpacingMark
219 Me EnclosingMark
d73e5302 220
221 N Number
eb0cc9e3 222 Nd DecimalNumber
223 Nl LetterNumber
224 No OtherNumber
d73e5302 225
226 P Punctuation
eb0cc9e3 227 Pc ConnectorPunctuation
228 Pd DashPunctuation
229 Ps OpenPunctuation
230 Pe ClosePunctuation
231 Pi InitialPunctuation
d73e5302 232 (may behave like Ps or Pe depending on usage)
eb0cc9e3 233 Pf FinalPunctuation
d73e5302 234 (may behave like Ps or Pe depending on usage)
eb0cc9e3 235 Po OtherPunctuation
d73e5302 236
237 S Symbol
eb0cc9e3 238 Sm MathSymbol
239 Sc CurrencySymbol
240 Sk ModifierSymbol
241 So OtherSymbol
d73e5302 242
243 Z Separator
eb0cc9e3 244 Zs SpaceSeparator
245 Zl LineSeparator
246 Zp ParagraphSeparator
d73e5302 247
248 C Other
e150c829 249 Cc Control
250 Cf Format
eb0cc9e3 251 Cs Surrogate (not usable)
252 Co PrivateUse
e150c829 253 Cn Unassigned
1ac13f9a 254
376d9008 255Single-letter properties match all characters in any of the
3e4dbfed 256two-letter sub-properties starting with the same letter.
376d9008 257C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
32293815 258
eb0cc9e3 259Because Perl hides the need for the user to understand the internal
1bfb14c4 260representation of Unicode characters, there is no need to implement
261the somewhat messy concept of surrogates. C<Cs> is therefore not
eb0cc9e3 262supported.
d73e5302 263
376d9008 264Because scripts differ in their directionality--Hebrew is
265written right to left, for example--Unicode supplies these properties:
32293815 266
eb0cc9e3 267 Property Meaning
92e830a9 268
d73e5302 269 BidiL Left-to-Right
270 BidiLRE Left-to-Right Embedding
271 BidiLRO Left-to-Right Override
272 BidiR Right-to-Left
273 BidiAL Right-to-Left Arabic
274 BidiRLE Right-to-Left Embedding
275 BidiRLO Right-to-Left Override
276 BidiPDF Pop Directional Format
277 BidiEN European Number
278 BidiES European Number Separator
279 BidiET European Number Terminator
280 BidiAN Arabic Number
281 BidiCS Common Number Separator
282 BidiNSM Non-Spacing Mark
283 BidiBN Boundary Neutral
284 BidiB Paragraph Separator
285 BidiS Segment Separator
286 BidiWS Whitespace
287 BidiON Other Neutrals
32293815 288
376d9008 289For example, C<\p{BidiR}> matches characters that are normally
eb0cc9e3 290written right to left.
291
210b36aa 292=back
293
2796c109 294=head2 Scripts
295
376d9008 296The script names which can be used by C<\p{...}> and C<\P{...}>,
297such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
2796c109 298
1ac13f9a 299 Arabic
e9ad1727 300 Armenian
1ac13f9a 301 Bengali
e9ad1727 302 Bopomofo
1d81abf3 303 Buhid
eb0cc9e3 304 CanadianAboriginal
e9ad1727 305 Cherokee
306 Cyrillic
307 Deseret
308 Devanagari
309 Ethiopic
310 Georgian
311 Gothic
312 Greek
1ac13f9a 313 Gujarati
e9ad1727 314 Gurmukhi
315 Han
316 Hangul
1d81abf3 317 Hanunoo
e9ad1727 318 Hebrew
319 Hiragana
320 Inherited
1ac13f9a 321 Kannada
e9ad1727 322 Katakana
323 Khmer
1ac13f9a 324 Lao
e9ad1727 325 Latin
326 Malayalam
327 Mongolian
1ac13f9a 328 Myanmar
1ac13f9a 329 Ogham
eb0cc9e3 330 OldItalic
e9ad1727 331 Oriya
1ac13f9a 332 Runic
e9ad1727 333 Sinhala
334 Syriac
1d81abf3 335 Tagalog
336 Tagbanwa
e9ad1727 337 Tamil
338 Telugu
339 Thaana
340 Thai
341 Tibetan
1ac13f9a 342 Yi
1ac13f9a 343
376d9008 344Extended property classes can supplement the basic
1ac13f9a 345properties, defined by the F<PropList> Unicode database:
346
1d81abf3 347 ASCIIHexDigit
eb0cc9e3 348 BidiControl
1ac13f9a 349 Dash
1d81abf3 350 Deprecated
1ac13f9a 351 Diacritic
352 Extender
1d81abf3 353 GraphemeLink
eb0cc9e3 354 HexDigit
e9ad1727 355 Hyphen
356 Ideographic
1d81abf3 357 IDSBinaryOperator
358 IDSTrinaryOperator
eb0cc9e3 359 JoinControl
1d81abf3 360 LogicalOrderException
eb0cc9e3 361 NoncharacterCodePoint
362 OtherAlphabetic
1d81abf3 363 OtherDefaultIgnorableCodePoint
364 OtherGraphemeExtend
eb0cc9e3 365 OtherLowercase
366 OtherMath
367 OtherUppercase
368 QuotationMark
1d81abf3 369 Radical
370 SoftDotted
371 TerminalPunctuation
372 UnifiedIdeograph
eb0cc9e3 373 WhiteSpace
1ac13f9a 374
376d9008 375and there are further derived properties:
1ac13f9a 376
eb0cc9e3 377 Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
378 Lowercase Ll + OtherLowercase
379 Uppercase Lu + OtherUppercase
380 Math Sm + OtherMath
1ac13f9a 381
382 ID_Start Lu + Ll + Lt + Lm + Lo + Nl
383 ID_Continue ID_Start + Mn + Mc + Nd + Pc
384
385 Any Any character
66b79f27 386 Assigned Any non-Cn character (i.e. synonym for \P{Cn})
387 Unassigned Synonym for \p{Cn}
1ac13f9a 388 Common Any character (or unassigned code point)
e150c829 389 not explicitly assigned to a script
2796c109 390
1bfb14c4 391For backward compatibility (with Perl 5.6), all properties mentioned
392so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for
393example, is equal to C<\P{Lu}>.
eb0cc9e3 394
2796c109 395=head2 Blocks
396
1bfb14c4 397In addition to B<scripts>, Unicode also defines B<blocks> of
398characters. The difference between scripts and blocks is that the
399concept of scripts is closer to natural languages, while the concept
400of blocks is more of an artificial grouping based on groups of 256
376d9008 401Unicode characters. For example, the C<Latin> script contains letters
1bfb14c4 402from many blocks but does not contain all the characters from those
376d9008 403blocks. It does not, for example, contain digits, because digits are
404shared across many scripts. Digits and similar groups, like
405punctuation, are in a category called C<Common>.
2796c109 406
cfc01aea 407For more about scripts, see the UTR #24:
408
409 http://www.unicode.org/unicode/reports/tr24/
410
411For more about blocks, see:
412
413 http://www.unicode.org/Public/UNIDATA/Blocks.txt
2796c109 414
376d9008 415Block names are given with the C<In> prefix. For example, the
416Katakana block is referenced via C<\p{InKatakana}>. The C<In>
7eabb34d 417prefix may be omitted if there is no naming conflict with a script
eb0cc9e3 418or any other property, but it is recommended that C<In> always be used
1bfb14c4 419for block tests to avoid confusion.
eb0cc9e3 420
421These block names are supported:
422
1d81abf3 423 InAlphabeticPresentationForms
424 InArabic
425 InArabicPresentationFormsA
426 InArabicPresentationFormsB
427 InArmenian
428 InArrows
429 InBasicLatin
430 InBengali
431 InBlockElements
432 InBopomofo
433 InBopomofoExtended
434 InBoxDrawing
435 InBraillePatterns
436 InBuhid
437 InByzantineMusicalSymbols
438 InCJKCompatibility
439 InCJKCompatibilityForms
440 InCJKCompatibilityIdeographs
441 InCJKCompatibilityIdeographsSupplement
442 InCJKRadicalsSupplement
443 InCJKSymbolsAndPunctuation
444 InCJKUnifiedIdeographs
445 InCJKUnifiedIdeographsExtensionA
446 InCJKUnifiedIdeographsExtensionB
447 InCherokee
448 InCombiningDiacriticalMarks
449 InCombiningDiacriticalMarksforSymbols
450 InCombiningHalfMarks
451 InControlPictures
452 InCurrencySymbols
453 InCyrillic
454 InCyrillicSupplementary
455 InDeseret
456 InDevanagari
457 InDingbats
458 InEnclosedAlphanumerics
459 InEnclosedCJKLettersAndMonths
460 InEthiopic
461 InGeneralPunctuation
462 InGeometricShapes
463 InGeorgian
464 InGothic
465 InGreekExtended
466 InGreekAndCoptic
467 InGujarati
468 InGurmukhi
469 InHalfwidthAndFullwidthForms
470 InHangulCompatibilityJamo
471 InHangulJamo
472 InHangulSyllables
473 InHanunoo
474 InHebrew
475 InHighPrivateUseSurrogates
476 InHighSurrogates
477 InHiragana
478 InIPAExtensions
479 InIdeographicDescriptionCharacters
480 InKanbun
481 InKangxiRadicals
482 InKannada
483 InKatakana
484 InKatakanaPhoneticExtensions
485 InKhmer
486 InLao
487 InLatin1Supplement
488 InLatinExtendedA
489 InLatinExtendedAdditional
490 InLatinExtendedB
491 InLetterlikeSymbols
492 InLowSurrogates
493 InMalayalam
494 InMathematicalAlphanumericSymbols
495 InMathematicalOperators
496 InMiscellaneousMathematicalSymbolsA
497 InMiscellaneousMathematicalSymbolsB
498 InMiscellaneousSymbols
499 InMiscellaneousTechnical
500 InMongolian
501 InMusicalSymbols
502 InMyanmar
503 InNumberForms
504 InOgham
505 InOldItalic
506 InOpticalCharacterRecognition
507 InOriya
508 InPrivateUseArea
509 InRunic
510 InSinhala
511 InSmallFormVariants
512 InSpacingModifierLetters
513 InSpecials
514 InSuperscriptsAndSubscripts
515 InSupplementalArrowsA
516 InSupplementalArrowsB
517 InSupplementalMathematicalOperators
518 InSupplementaryPrivateUseAreaA
519 InSupplementaryPrivateUseAreaB
520 InSyriac
521 InTagalog
522 InTagbanwa
523 InTags
524 InTamil
525 InTelugu
526 InThaana
527 InThai
528 InTibetan
529 InUnifiedCanadianAboriginalSyllabics
530 InVariationSelectors
531 InYiRadicals
532 InYiSyllables
32293815 533
210b36aa 534=over 4
535
393fec97 536=item *
537
376d9008 538The special pattern C<\X> matches any extended Unicode
539sequence--"a combining character sequence" in Standardese--where the
540first character is a base character and subsequent characters are mark
541characters that apply to the base character. C<\X> is equivalent to
393fec97 542C<(?:\PM\pM*)>.
543
393fec97 544=item *
545
383e7cdd 546The C<tr///> operator translates characters instead of bytes. Note
376d9008 547that the C<tr///CU> functionality has been removed. For similar
548functionality see pack('U0', ...) and pack('C0', ...).
393fec97 549
393fec97 550=item *
551
552Case translation operators use the Unicode case translation tables
376d9008 553when character input is provided. Note that C<uc()>, or C<\U> in
554interpolated strings, translates to uppercase, while C<ucfirst>,
555or C<\u> in interpolated strings, translates to titlecase in languages
556that make the distinction.
393fec97 557
558=item *
559
376d9008 560Most operators that deal with positions or lengths in a string will
75daf61c 561automatically switch to using character positions, including
f5b005ca 562C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
75daf61c 563C<sprintf()>, C<write()>, and C<length()>. Operators that
376d9008 564specifically do not switch include C<vec()>, C<pack()>, and
f5b005ca 565C<unpack()>. Operators that really don't care include
376d9008 566operators that treats strings as a bucket of bits such as C<sort()>,
567and operators dealing with filenames.
393fec97 568
569=item *
570
1bfb14c4 571The C<pack()>/C<unpack()> letters C<c> and C<C> do I<not> change,
376d9008 572since they are often used for byte-oriented formats. Again, think
1bfb14c4 573C<char> in the C language.
574
575There is a new C<U> specifier that converts between Unicode characters
576and code points.
393fec97 577
578=item *
579
376d9008 580The C<chr()> and C<ord()> functions work on characters, similar to
581C<pack("U")> and C<unpack("U")>, I<not> C<pack("C")> and
582C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for
583emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
584While these methods reveal the internal encoding of Unicode strings,
585that is not something one normally needs to care about at all.
393fec97 586
587=item *
588
376d9008 589The bit string operators, C<& | ^ ~>, can operate on character data.
590However, for backward compatibility, such as when using bit string
591operations when characters are all less than 256 in ordinal value, one
592should not use C<~> (the bit complement) with characters of both
593values less than 256 and values greater than 256. Most importantly,
594DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
595will not hold. The reason for this mathematical I<faux pas> is that
596the complement cannot return B<both> the 8-bit (byte-wide) bit
597complement B<and> the full character-wide bit complement.
a1ca4561 598
599=item *
600
983ffd37 601lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
602
603=over 8
604
605=item *
606
607the case mapping is from a single Unicode character to another
376d9008 608single Unicode character, or
983ffd37 609
610=item *
611
612the case mapping is from a single Unicode character to more
376d9008 613than one Unicode character.
983ffd37 614
615=back
616
63de3cb2 617Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
618since Perl does not understand the concept of Unicode locales.
983ffd37 619
dc33ebcf 620See the Unicode Technical Report #21, Case Mappings, for more details.
621
983ffd37 622=back
623
dc33ebcf 624=over 4
ac1256e8 625
626=item *
627
393fec97 628And finally, C<scalar reverse()> reverses by character rather than by byte.
629
630=back
631
376d9008 632=head2 User-Defined Character Properties
491fd90a 633
634You can define your own character properties by defining subroutines
bac0b425 635whose names begin with "In" or "Is". The subroutines can be defined in
636any package. The user-defined properties can be used in the regular
637expression C<\p> and C<\P> constructs; if you are using a user-defined
638property from a package other than the one you are in, you must specify
639its package in the C<\p> or C<\P> construct.
640
641 # assuming property IsForeign defined in Lang::
642 package main; # property package name required
643 if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
644
645 package Lang; # property package name not required
646 if ($txt =~ /\p{IsForeign}+/) { ... }
647
648
649Note that the effect is compile-time and immutable once defined.
491fd90a 650
376d9008 651The subroutines must return a specially-formatted string, with one
652or more newline-separated lines. Each line must be one of the following:
491fd90a 653
654=over 4
655
656=item *
657
99a6b1f0 658Two hexadecimal numbers separated by horizontal whitespace (space or
376d9008 659tabular characters) denoting a range of Unicode code points to include.
491fd90a 660
661=item *
662
376d9008 663Something to include, prefixed by "+": a built-in character
bac0b425 664property (prefixed by "utf8::") or a user-defined character property,
665to represent all the characters in that property; two hexadecimal code
666points for a range; or a single hexadecimal code point.
491fd90a 667
668=item *
669
376d9008 670Something to exclude, prefixed by "-": an existing character
bac0b425 671property (prefixed by "utf8::") or a user-defined character property,
672to represent all the characters in that property; two hexadecimal code
673points for a range; or a single hexadecimal code point.
491fd90a 674
675=item *
676
376d9008 677Something to negate, prefixed "!": an existing character
bac0b425 678property (prefixed by "utf8::") or a user-defined character property,
679to represent all the characters in that property; two hexadecimal code
680points for a range; or a single hexadecimal code point.
681
682=item *
683
684Something to intersect with, prefixed by "&": an existing character
685property (prefixed by "utf8::") or a user-defined character property,
686for all the characters except the characters in the property; two
687hexadecimal code points for a range; or a single hexadecimal code point.
491fd90a 688
689=back
690
691For example, to define a property that covers both the Japanese
692syllabaries (hiragana and katakana), you can define
693
694 sub InKana {
d5822f25 695 return <<END;
696 3040\t309F
697 30A0\t30FF
491fd90a 698 END
699 }
700
d5822f25 701Imagine that the here-doc end marker is at the beginning of the line.
702Now you can use C<\p{InKana}> and C<\P{InKana}>.
491fd90a 703
704You could also have used the existing block property names:
705
706 sub InKana {
707 return <<'END';
708 +utf8::InHiragana
709 +utf8::InKatakana
710 END
711 }
712
713Suppose you wanted to match only the allocated characters,
d5822f25 714not the raw block ranges: in other words, you want to remove
491fd90a 715the non-characters:
716
717 sub InKana {
718 return <<'END';
719 +utf8::InHiragana
720 +utf8::InKatakana
721 -utf8::IsCn
722 END
723 }
724
725The negation is useful for defining (surprise!) negated classes.
726
727 sub InNotKana {
728 return <<'END';
729 !utf8::InHiragana
730 -utf8::InKatakana
731 +utf8::IsCn
732 END
733 }
734
bac0b425 735Intersection is useful for getting the common characters matched by
736two (or more) classes.
737
738 sub InFooAndBar {
739 return <<'END';
740 +main::Foo
741 &main::Bar
742 END
743 }
744
745It's important to remember not to use "&" for the first set -- that
746would be intersecting with nothing (resulting in an empty set).
747
3a2263fe 748You can also define your own mappings to be used in the lc(),
749lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
750The principle is the same: define subroutines in the C<main> package
751with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
752the first character in ucfirst()), and C<ToUpper> (for uc(), and the
753rest of the characters in ucfirst()).
754
755The string returned by the subroutines needs now to be three
756hexadecimal numbers separated by tabulators: start of the source
757range, end of the source range, and start of the destination range.
758For example:
759
760 sub ToUpper {
761 return <<END;
762 0061\t0063\t0041
763 END
764 }
765
766defines an uc() mapping that causes only the characters "a", "b", and
767"c" to be mapped to "A", "B", "C", all other characters will remain
768unchanged.
769
770If there is no source range to speak of, that is, the mapping is from
771a single character to another single character, leave the end of the
772source range empty, but the two tabulator characters are still needed.
773For example:
774
775 sub ToLower {
776 return <<END;
777 0041\t\t0061
778 END
779 }
780
781defines a lc() mapping that causes only "A" to be mapped to "a", all
782other characters will remain unchanged.
783
784(For serious hackers only) If you want to introspect the default
785mappings, you can find the data in the directory
786C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as
787the here-document, and the C<utf8::ToSpecFoo> are special exception
788mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
789The C<Digit> and C<Fold> mappings that one can see in the directory
790are not directly user-accessible, one can use either the
791C<Unicode::UCD> module, or just match case-insensitively (that's when
792the C<Fold> mapping is used).
793
794A final note on the user-defined property tests and mappings: they
795will be used only if the scalar has been marked as having Unicode
796characters. Old byte-style strings will not be affected.
797
376d9008 798=head2 Character Encodings for Input and Output
8cbd9a7a 799
7221edc9 800See L<Encode>.
8cbd9a7a 801
c29a771d 802=head2 Unicode Regular Expression Support Level
776f8809 803
376d9008 804The following list of Unicode support for regular expressions describes
805all the features currently supported. The references to "Level N"
806and the section numbers refer to the Unicode Technical Report 18,
965cd703 807"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
808Perl 5.8.0).
776f8809 809
810=over 4
811
812=item *
813
814Level 1 - Basic Unicode Support
815
816 2.1 Hex Notation - done [1]
3bfdc84c 817 Named Notation - done [2]
776f8809 818 2.2 Categories - done [3][4]
819 2.3 Subtraction - MISSING [5][6]
820 2.4 Simple Word Boundaries - done [7]
78d3e1bf 821 2.5 Simple Loose Matches - done [8]
776f8809 822 2.6 End of Line - MISSING [9][10]
823
824 [ 1] \x{...}
825 [ 2] \N{...}
eb0cc9e3 826 [ 3] . \p{...} \P{...}
29bdacb8 827 [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
776f8809 828 [ 5] have negation
237bad5b 829 [ 6] can use regular expression look-ahead [a]
830 or user-defined character properties [b] to emulate subtraction
776f8809 831 [ 7] include Letters in word characters
376d9008 832 [ 8] note that Perl does Full case-folding in matching, not Simple:
835863de 833 for example U+1F88 is equivalent with U+1F00 U+03B9,
e0f9d4a8 834 not with 1F80. This difference matters for certain Greek
376d9008 835 capital letters with certain modifiers: the Full case-folding
836 decomposes the letter, while the Simple case-folding would map
e0f9d4a8 837 it to a single character.
5ca1ac52 838 [ 9] see UTR #13 Unicode Newline Guidelines
835863de 839 [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
ec83e909 840 (should also affect <>, $., and script line numbers)
3bfdc84c 841 (the \x{85}, \x{2028} and \x{2029} do match \s)
7207e29d 842
237bad5b 843[a] You can mimic class subtraction using lookahead.
5ca1ac52 844For example, what UTR #18 might write as
29bdacb8 845
dbe420b4 846 [{Greek}-[{UNASSIGNED}]]
847
848in Perl can be written as:
849
1d81abf3 850 (?!\p{Unassigned})\p{InGreekAndCoptic}
851 (?=\p{Assigned})\p{InGreekAndCoptic}
dbe420b4 852
853But in this particular example, you probably really want
854
1bfb14c4 855 \p{GreekAndCoptic}
dbe420b4 856
857which will match assigned characters known to be part of the Greek script.
29bdacb8 858
5ca1ac52 859Also see the Unicode::Regex::Set module, it does implement the full
860UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
861
818c4caa 862[b] See L</"User-Defined Character Properties">.
237bad5b 863
776f8809 864=item *
865
866Level 2 - Extended Unicode Support
867
63de3cb2 868 3.1 Surrogates - MISSING [11]
869 3.2 Canonical Equivalents - MISSING [12][13]
870 3.3 Locale-Independent Graphemes - MISSING [14]
871 3.4 Locale-Independent Words - MISSING [15]
872 3.5 Locale-Independent Loose Matches - MISSING [16]
873
874 [11] Surrogates are solely a UTF-16 concept and Perl's internal
875 representation is UTF-8. The Encode module does UTF-16, though.
876 [12] see UTR#15 Unicode Normalization
877 [13] have Unicode::Normalize but not integrated to regexes
878 [14] have \X but at this level . should equal that
879 [15] need three classes, not just \w and \W
880 [16] see UTR#21 Case Mappings
776f8809 881
882=item *
883
884Level 3 - Locale-Sensitive Support
885
886 4.1 Locale-Dependent Categories - MISSING
887 4.2 Locale-Dependent Graphemes - MISSING [16][17]
888 4.3 Locale-Dependent Words - MISSING
889 4.4 Locale-Dependent Loose Matches - MISSING
890 4.5 Locale-Dependent Ranges - MISSING
891
892 [16] see UTR#10 Unicode Collation Algorithms
893 [17] have Unicode::Collate but not integrated to regexes
894
895=back
896
c349b1b9 897=head2 Unicode Encodings
898
376d9008 899Unicode characters are assigned to I<code points>, which are abstract
900numbers. To use these numbers, various encodings are needed.
c349b1b9 901
902=over 4
903
c29a771d 904=item *
5cb3728c 905
906UTF-8
c349b1b9 907
3e4dbfed 908UTF-8 is a variable-length (1 to 6 bytes, current character allocations
376d9008 909require 4 bytes), byte-order independent encoding. For ASCII (and we
910really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
911transparent.
c349b1b9 912
8c007b5a 913The following table is from Unicode 3.2.
05632f9a 914
915 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
916
8c007b5a 917 U+0000..U+007F 00..7F
918 U+0080..U+07FF C2..DF 80..BF
ec90690f 919 U+0800..U+0FFF E0 A0..BF 80..BF
920 U+1000..U+CFFF E1..EC 80..BF 80..BF
921 U+D000..U+D7FF ED 80..9F 80..BF
8c007b5a 922 U+D800..U+DFFF ******* ill-formed *******
ec90690f 923 U+E000..U+FFFF EE..EF 80..BF 80..BF
05632f9a 924 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
925 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
926 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
927
376d9008 928Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
929C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
930C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
931UTF-8 avoiding non-shortest encodings: it is technically possible to
932UTF-8-encode a single code point in different ways, but that is
933explicitly forbidden, and the shortest possible encoding should always
934be used. So that's what Perl does.
37361303 935
376d9008 936Another way to look at it is via bits:
05632f9a 937
938 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
939
940 0aaaaaaa 0aaaaaaa
941 00000bbbbbaaaaaa 110bbbbb 10aaaaaa
942 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
943 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
944
945As you can see, the continuation bytes all begin with C<10>, and the
8c007b5a 946leading bits of the start byte tell how many bytes the are in the
05632f9a 947encoded character.
948
c29a771d 949=item *
5cb3728c 950
951UTF-EBCDIC
dbe420b4 952
376d9008 953Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
dbe420b4 954
c29a771d 955=item *
5cb3728c 956
1e54db1a 957UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
c349b1b9 958
1bfb14c4 959The followings items are mostly for reference and general Unicode
960knowledge, Perl doesn't use these constructs internally.
dbe420b4 961
c349b1b9 962UTF-16 is a 2 or 4 byte encoding. The Unicode code points
1bfb14c4 963C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code
964points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
c349b1b9 965using I<surrogates>, the first 16-bit unit being the I<high
966surrogate>, and the second being the I<low surrogate>.
967
376d9008 968Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
c349b1b9 969range of Unicode code points in pairs of 16-bit units. The I<high
376d9008 970surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates>
971are the range C<U+DC00..U+DFFF>. The surrogate encoding is
c349b1b9 972
973 $hi = ($uni - 0x10000) / 0x400 + 0xD800;
974 $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
975
976and the decoding is
977
1a3fa709 978 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
c349b1b9 979
feda178f 980If you try to generate surrogates (for example by using chr()), you
376d9008 981will get a warning if warnings are turned on, because those code
982points are not valid for a Unicode character.
9466bab6 983
376d9008 984Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
c349b1b9 985itself can be used for in-memory computations, but if storage or
376d9008 986transfer is required either UTF-16BE (big-endian) or UTF-16LE
987(little-endian) encodings must be chosen.
c349b1b9 988
989This introduces another problem: what if you just know that your data
376d9008 990is UTF-16, but you don't know which endianness? Byte Order Marks, or
991BOMs, are a solution to this. A special character has been reserved
86bbd6d1 992in Unicode to function as a byte order marker: the character with the
376d9008 993code point C<U+FEFF> is the BOM.
042da322 994
c349b1b9 995The trick is that if you read a BOM, you will know the byte order,
376d9008 996since if it was written on a big-endian platform, you will read the
997bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
998you will read the bytes C<0xFF 0xFE>. (And if the originating platform
999was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
042da322 1000
86bbd6d1 1001The way this trick works is that the character with the code point
376d9008 1002C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
1003sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
1bfb14c4 1004little-endian format" and cannot be C<U+FFFE>, represented in big-endian
042da322 1005format".
c349b1b9 1006
c29a771d 1007=item *
5cb3728c 1008
1e54db1a 1009UTF-32, UTF-32BE, UTF-32LE
c349b1b9 1010
1011The UTF-32 family is pretty much like the UTF-16 family, expect that
042da322 1012the units are 32-bit, and therefore the surrogate scheme is not
376d9008 1013needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
1014C<0xFF 0xFE 0x00 0x00> for LE.
c349b1b9 1015
c29a771d 1016=item *
5cb3728c 1017
1018UCS-2, UCS-4
c349b1b9 1019
86bbd6d1 1020Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
376d9008 1021encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
339cfa0e 1022because it does not use surrogates. UCS-4 is a 32-bit encoding,
1023functionally identical to UTF-32.
c349b1b9 1024
c29a771d 1025=item *
5cb3728c 1026
1027UTF-7
c349b1b9 1028
376d9008 1029A seven-bit safe (non-eight-bit) encoding, which is useful if the
1030transport or storage is not eight-bit safe. Defined by RFC 2152.
c349b1b9 1031
95a1a48b 1032=back
1033
0d7c09bb 1034=head2 Security Implications of Unicode
1035
1036=over 4
1037
1038=item *
1039
1040Malformed UTF-8
bf0fa0b2 1041
1042Unfortunately, the specification of UTF-8 leaves some room for
1043interpretation of how many bytes of encoded output one should generate
376d9008 1044from one input Unicode character. Strictly speaking, the shortest
1045possible sequence of UTF-8 bytes should be generated,
1046because otherwise there is potential for an input buffer overflow at
feda178f 1047the receiving end of a UTF-8 connection. Perl always generates the
376d9008 1048shortest length UTF-8, and with warnings on Perl will warn about
1049non-shortest length UTF-8 along with other malformations, such as the
1050surrogates, which are not real Unicode code points.
bf0fa0b2 1051
0d7c09bb 1052=item *
1053
1054Regular expressions behave slightly differently between byte data and
376d9008 1055character (Unicode) data. For example, the "word character" character
1056class C<\w> will work differently depending on if data is eight-bit bytes
1057or Unicode.
0d7c09bb 1058
376d9008 1059In the first case, the set of C<\w> characters is either small--the
1060default set of alphabetic characters, digits, and the "_"--or, if you
0d7c09bb 1061are using a locale (see L<perllocale>), the C<\w> might contain a few
1062more letters according to your language and country.
1063
376d9008 1064In the second case, the C<\w> set of characters is much, much larger.
1bfb14c4 1065Most importantly, even in the set of the first 256 characters, it will
1066probably match different characters: unlike most locales, which are
1067specific to a language and country pair, Unicode classifies all the
1068characters that are letters I<somewhere> as C<\w>. For example, your
1069locale might not think that LATIN SMALL LETTER ETH is a letter (unless
1070you happen to speak Icelandic), but Unicode does.
0d7c09bb 1071
376d9008 1072As discussed elsewhere, Perl has one foot (two hooves?) planted in
1bfb14c4 1073each of two worlds: the old world of bytes and the new world of
1074characters, upgrading from bytes to characters when necessary.
376d9008 1075If your legacy code does not explicitly use Unicode, no automatic
1076switch-over to characters should happen. Characters shouldn't get
1bfb14c4 1077downgraded to bytes, either. It is possible to accidentally mix bytes
1078and characters, however (see L<perluniintro>), in which case C<\w> in
1079regular expressions might start behaving differently. Review your
1080code. Use warnings and the C<strict> pragma.
0d7c09bb 1081
1082=back
1083
c349b1b9 1084=head2 Unicode in Perl on EBCDIC
1085
376d9008 1086The way Unicode is handled on EBCDIC platforms is still
1087experimental. On such platforms, references to UTF-8 encoding in this
1088document and elsewhere should be read as meaning the UTF-EBCDIC
1089specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
c349b1b9 1090are specifically discussed. There is no C<utfebcdic> pragma or
376d9008 1091":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
86bbd6d1 1092the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1093for more discussion of the issues.
c349b1b9 1094
b310b053 1095=head2 Locales
1096
4616122b 1097Usually locale settings and Unicode do not affect each other, but
b310b053 1098there are a couple of exceptions:
1099
1100=over 4
1101
1102=item *
1103
8aa8f774 1104You can enable automatic UTF-8-ification of your standard file
1105handles, default C<open()> layer, and C<@ARGV> by using either
1106the C<-C> command line switch or the C<PERL_UNICODE> environment
1107variable, see L<perlrun> for the documentation of the C<-C> switch.
b310b053 1108
1109=item *
1110
376d9008 1111Perl tries really hard to work both with Unicode and the old
1112byte-oriented world. Most often this is nice, but sometimes Perl's
1113straddling of the proverbial fence causes problems.
b310b053 1114
1115=back
1116
1aad1664 1117=head2 When Unicode Does Not Happen
1118
1119While Perl does have extensive ways to input and output in Unicode,
1120and few other 'entry points' like the @ARGV which can be interpreted
1121as Unicode (UTF-8), there still are many places where Unicode (in some
1122encoding or another) could be given as arguments or received as
1123results, or both, but it is not.
1124
6cd4dd6c 1125The following are such interfaces. For all of these interfaces Perl
1126currently (as of 5.8.3) simply assumes byte strings both as arguments
1127and results, or UTF-8 strings if the C<encoding> pragma has been used.
1aad1664 1128
1129One reason why Perl does not attempt to resolve the role of Unicode in
1130this cases is that the answers are highly dependent on the operating
1131system and the file system(s). For example, whether filenames can be
1132in Unicode, and in exactly what kind of encoding, is not exactly a
1133portable concept. Similarly for the qx and system: how well will the
1134'command line interface' (and which of them?) handle Unicode?
1135
1136=over 4
1137
557a2462 1138=item *
1139
1e8e8236 1140chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
1141rename, rmdir, stat, symlink, truncate, unlink, utime, -X
557a2462 1142
1143=item *
1144
1145%ENV
1146
1147=item *
1148
1149glob (aka the <*>)
1150
1151=item *
1aad1664 1152
557a2462 1153open, opendir, sysopen
1aad1664 1154
557a2462 1155=item *
1aad1664 1156
557a2462 1157qx (aka the backtick operator), system
1aad1664 1158
557a2462 1159=item *
1aad1664 1160
557a2462 1161readdir, readlink
1aad1664 1162
1163=back
1164
1165=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1166
1167Sometimes (see L</"When Unicode Does Not Happen">) there are
1168situations where you simply need to force Perl to believe that a byte
1169string is UTF-8, or vice versa. The low-level calls
1170utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
1171the answers.
1172
1173Do not use them without careful thought, though: Perl may easily get
1174very confused, angry, or even crash, if you suddenly change the 'nature'
1175of scalar like that. Especially careful you have to be if you use the
1176utf8::upgrade(): any random byte string is not valid UTF-8.
1177
95a1a48b 1178=head2 Using Unicode in XS
1179
3a2263fe 1180If you want to handle Perl Unicode in XS extensions, you may find the
1181following C APIs useful. See also L<perlguts/"Unicode Support"> for an
1182explanation about Unicode at the XS level, and L<perlapi> for the API
1183details.
95a1a48b 1184
1185=over 4
1186
1187=item *
1188
1bfb14c4 1189C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1190pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8>
1191flag is on; the bytes pragma is ignored. The C<UTF8> flag being on
1192does B<not> mean that there are any characters of code points greater
1193than 255 (or 127) in the scalar or that there are even any characters
1194in the scalar. What the C<UTF8> flag means is that the sequence of
1195octets in the representation of the scalar is the sequence of UTF-8
1196encoded code points of the characters of a string. The C<UTF8> flag
1197being off means that each octet in this representation encodes a
1198single character with code point 0..255 within the string. Perl's
1199Unicode model is not to use UTF-8 until it is absolutely necessary.
95a1a48b 1200
1201=item *
1202
fb9cc174 1203C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
1bfb14c4 1204a buffer encoding the code point as UTF-8, and returns a pointer
95a1a48b 1205pointing after the UTF-8 bytes.
1206
1207=item *
1208
376d9008 1209C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and
1210returns the Unicode character code point and, optionally, the length of
1211the UTF-8 byte sequence.
95a1a48b 1212
1213=item *
1214
376d9008 1215C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1216in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
95a1a48b 1217scalar.
1218
1219=item *
1220
376d9008 1221C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1222encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if
1223possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1224it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the
1225opposite of C<sv_utf8_encode()>. Note that none of these are to be
1226used as general-purpose encoding or decoding interfaces: C<use Encode>
1227for that. C<sv_utf8_upgrade()> is affected by the encoding pragma
1228but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1229designed to be a one-way street).
95a1a48b 1230
1231=item *
1232
376d9008 1233C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8
90f968e0 1234character.
95a1a48b 1235
1236=item *
1237
376d9008 1238C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
95a1a48b 1239are valid UTF-8.
1240
1241=item *
1242
376d9008 1243C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1244character in the buffer. C<UNISKIP(chr)> will return the number of bytes
1245required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()>
90f968e0 1246is useful for example for iterating over the characters of a UTF-8
376d9008 1247encoded buffer; C<UNISKIP()> is useful, for example, in computing
90f968e0 1248the size required for a UTF-8 encoded buffer.
95a1a48b 1249
1250=item *
1251
376d9008 1252C<utf8_distance(a, b)> will tell the distance in characters between the
95a1a48b 1253two pointers pointing to the same UTF-8 encoded buffer.
1254
1255=item *
1256
376d9008 1257C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer
1258that is C<off> (positive or negative) Unicode characters displaced
1259from the UTF-8 buffer C<s>. Be careful not to overstep the buffer:
1260C<utf8_hop()> will merrily run off the end or the beginning of the
1261buffer if told to do so.
95a1a48b 1262
d2cc3551 1263=item *
1264
376d9008 1265C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1266C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1267output of Unicode strings and scalars. By default they are useful
1268only for debugging--they display B<all> characters as hexadecimal code
1bfb14c4 1269points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1270C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1271output more readable.
d2cc3551 1272
1273=item *
1274
376d9008 1275C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to
1276compare two strings case-insensitively in Unicode. For case-sensitive
1277comparisons you can just use C<memEQ()> and C<memNE()> as usual.
d2cc3551 1278
c349b1b9 1279=back
1280
95a1a48b 1281For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1282in the Perl source code distribution.
1283
c29a771d 1284=head1 BUGS
1285
376d9008 1286=head2 Interaction with Locales
7eabb34d 1287
376d9008 1288Use of locales with Unicode data may lead to odd results. Currently,
1289Perl attempts to attach 8-bit locale info to characters in the range
12900..255, but this technique is demonstrably incorrect for locales that
1291use characters above that range when mapped into Unicode. Perl's
1292Unicode support will also tend to run slower. Use of locales with
1293Unicode is discouraged.
c29a771d 1294
376d9008 1295=head2 Interaction with Extensions
7eabb34d 1296
376d9008 1297When Perl exchanges data with an extension, the extension should be
7eabb34d 1298able to understand the UTF-8 flag and act accordingly. If the
376d9008 1299extension doesn't know about the flag, it's likely that the extension
1300will return incorrectly-flagged data.
7eabb34d 1301
1302So if you're working with Unicode data, consult the documentation of
1303every module you're using if there are any issues with Unicode data
1304exchange. If the documentation does not talk about Unicode at all,
a73d23f6 1305suspect the worst and probably look at the source to learn how the
376d9008 1306module is implemented. Modules written completely in Perl shouldn't
a73d23f6 1307cause problems. Modules that directly or indirectly access code written
1308in other programming languages are at risk.
7eabb34d 1309
376d9008 1310For affected functions, the simple strategy to avoid data corruption is
7eabb34d 1311to always make the encoding of the exchanged data explicit. Choose an
376d9008 1312encoding that you know the extension can handle. Convert arguments passed
7eabb34d 1313to the extensions to that encoding and convert results back from that
1314encoding. Write wrapper functions that do the conversions for you, so
1315you can later change the functions when the extension catches up.
1316
376d9008 1317To provide an example, let's say the popular Foo::Bar::escape_html
7eabb34d 1318function doesn't deal with Unicode data yet. The wrapper function
1319would convert the argument to raw UTF-8 and convert the result back to
376d9008 1320Perl's internal representation like so:
7eabb34d 1321
1322 sub my_escape_html ($) {
1323 my($what) = shift;
1324 return unless defined $what;
1325 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
1326 }
1327
1328Sometimes, when the extension does not convert data but just stores
1329and retrieves them, you will be in a position to use the otherwise
1330dangerous Encode::_utf8_on() function. Let's say the popular
66b79f27 1331C<Foo::Bar> extension, written in C, provides a C<param> method that
7eabb34d 1332lets you store and retrieve data according to these prototypes:
1333
1334 $self->param($name, $value); # set a scalar
1335 $value = $self->param($name); # retrieve a scalar
1336
1337If it does not yet provide support for any encoding, one could write a
1338derived class with such a C<param> method:
1339
1340 sub param {
1341 my($self,$name,$value) = @_;
1342 utf8::upgrade($name); # make sure it is UTF-8 encoded
1343 if (defined $value)
1344 utf8::upgrade($value); # make sure it is UTF-8 encoded
1345 return $self->SUPER::param($name,$value);
1346 } else {
1347 my $ret = $self->SUPER::param($name);
1348 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1349 return $ret;
1350 }
1351 }
1352
a73d23f6 1353Some extensions provide filters on data entry/exit points, such as
1354DB_File::filter_store_key and family. Look out for such filters in
66b79f27 1355the documentation of your extensions, they can make the transition to
7eabb34d 1356Unicode data much easier.
1357
376d9008 1358=head2 Speed
7eabb34d 1359
c29a771d 1360Some functions are slower when working on UTF-8 encoded strings than
574c8022 1361on byte encoded strings. All functions that need to hop over
7c17141f 1362characters such as length(), substr() or index(), or matching regular
1363expressions can work B<much> faster when the underlying data are
1364byte-encoded.
1365
1366In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1367a caching scheme was introduced which will hopefully make the slowness
a104b433 1368somewhat less spectacular, at least for some operations. In general,
1369operations with UTF-8 encoded strings are still slower. As an example,
1370the Unicode properties (character classes) like C<\p{Nd}> are known to
1371be quite a bit slower (5-20 times) than their simpler counterparts
1372like C<\d> (then again, there 268 Unicode characters matching C<Nd>
1373compared with the 10 ASCII characters matching C<d>).
666f95b9 1374
c8d992ba 1375=head2 Porting code from perl-5.6.X
1376
1377Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1378was required to use the C<utf8> pragma to declare that a given scope
1379expected to deal with Unicode data and had to make sure that only
1380Unicode data were reaching that scope. If you have code that is
1381working with 5.6, you will need some of the following adjustments to
1382your code. The examples are written such that the code will continue
1383to work under 5.6, so you should be safe to try them out.
1384
1385=over 4
1386
1387=item *
1388
1389A filehandle that should read or write UTF-8
1390
1391 if ($] > 5.007) {
1392 binmode $fh, ":utf8";
1393 }
1394
1395=item *
1396
1397A scalar that is going to be passed to some extension
1398
1399Be it Compress::Zlib, Apache::Request or any extension that has no
1400mention of Unicode in the manpage, you need to make sure that the
1401UTF-8 flag is stripped off. Note that at the time of this writing
1402(October 2002) the mentioned modules are not UTF-8-aware. Please
1403check the documentation to verify if this is still true.
1404
1405 if ($] > 5.007) {
1406 require Encode;
1407 $val = Encode::encode_utf8($val); # make octets
1408 }
1409
1410=item *
1411
1412A scalar we got back from an extension
1413
1414If you believe the scalar comes back as UTF-8, you will most likely
1415want the UTF-8 flag restored:
1416
1417 if ($] > 5.007) {
1418 require Encode;
1419 $val = Encode::decode_utf8($val);
1420 }
1421
1422=item *
1423
1424Same thing, if you are really sure it is UTF-8
1425
1426 if ($] > 5.007) {
1427 require Encode;
1428 Encode::_utf8_on($val);
1429 }
1430
1431=item *
1432
1433A wrapper for fetchrow_array and fetchrow_hashref
1434
1435When the database contains only UTF-8, a wrapper function or method is
1436a convenient way to replace all your fetchrow_array and
1437fetchrow_hashref calls. A wrapper function will also make it easier to
1438adapt to future enhancements in your database driver. Note that at the
1439time of this writing (October 2002), the DBI has no standardized way
1440to deal with UTF-8 data. Please check the documentation to verify if
1441that is still true.
1442
1443 sub fetchrow {
1444 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
1445 if ($] < 5.007) {
1446 return $sth->$what;
1447 } else {
1448 require Encode;
1449 if (wantarray) {
1450 my @arr = $sth->$what;
1451 for (@arr) {
1452 defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1453 }
1454 return @arr;
1455 } else {
1456 my $ret = $sth->$what;
1457 if (ref $ret) {
1458 for my $k (keys %$ret) {
1459 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
1460 }
1461 return $ret;
1462 } else {
1463 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1464 return $ret;
1465 }
1466 }
1467 }
1468 }
1469
1470
1471=item *
1472
1473A large scalar that you know can only contain ASCII
1474
1475Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1476a drag to your program. If you recognize such a situation, just remove
1477the UTF-8 flag:
1478
1479 utf8::downgrade($val) if $] > 5.007;
1480
1481=back
1482
393fec97 1483=head1 SEE ALSO
1484
72ff2908 1485L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
a05d7ebb 1486L<perlretut>, L<perlvar/"${^UNICODE}">
393fec97 1487
1488=cut