pod/perlunicode.pod

   1 =head1 NAME
   2
   3 perlunicode - Unicode support in Perl
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Important Caveats
   8
   9 WARNING: While the implementation of Unicode support in Perl is now fairly
  10 complete it is still evolving to some extent.
  11
  12 In particular the way Unicode is handled on EBCDIC platforms is still rather
  13 experimental. On such a platform references to UTF-8 encoding in this
  14 document and elsewhere should be read as meaning UTF-EBCDIC as specified
  15 in Unicode Technical Report 16 unless ASCII vs EBCDIC issues are specifically
  16 discussed. There is no C<utfebcdic> pragma or ":utfebcdic" layer, rather
  17 "utf8" and ":utf8" are re-used to mean platform's "natural" 8-bit encoding
  18 of Unicode. See L<perlebcdic> for more discussion of the issues.
  19
  20 The following areas are still under development.
  21
  22 =over 4
  23
  24 =item Input and Output Disciplines
  25
  26 A filehandle can be marked as containing perl's internal Unicode encoding
  27 (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer.
  28 Other encodings can be converted to perl's encoding on input, or from
  29 perl's encoding on output by use of the ":encoding()" layer.
  30 There is not yet a clean way to mark the perl source itself as being
  31 in an particular encoding.
  32
  33 =item Regular Expressions
  34
  35 The regular expression compiler does now attempt to produce
  36 polymorphic opcodes.  That is the pattern should now adapt to the data
  37 and automatically switch to the Unicode character scheme when presented
  38 with Unicode data, or a traditional byte scheme when presented with
  39 byte data.  The implementation is still new and (particularly on
  40 EBCDIC platforms) may need further work.
  41
  42 =item C<use utf8> still needed to enable a few features
  43
  44 The C<utf8> pragma implements the tables used for Unicode support.  These
  45 tables are automatically loaded on demand, so the C<utf8> pragma need not
  46 normally be used.
  47
  48 However, as a compatibility measure, this pragma must be explicitly used
  49 to enable recognition of UTF-8 encoded literals and identifiers in the
  50 source text on ASCII based machines or recognize UTF-EBCDIC encoded literals
  51 and identifiers on EBCDIC based machines.
  52
  53 =back
  54
  55 =head2 Byte and Character semantics
  56
  57 Beginning with version 5.6, Perl uses logically wide characters to
  58 represent strings internally.  This internal representation of strings
  59 uses either the UTF-8 or the UTF-EBCDIC encoding.
  60
  61 In future, Perl-level operations can be expected to work with characters
  62 rather than bytes, in general.
  63
  64 However, as strictly an interim compatibility measure, Perl v5.6 aims to
  65 provide a safe migration path from byte semantics to character semantics
  66 for programs.  For operations where Perl can unambiguously decide that the
  67 input data is characters, Perl now switches to character semantics.
  68 For operations where this determination cannot be made without additional
  69 information from the user, Perl decides in favor of compatibility, and
  70 chooses to use byte semantics.
  71
  72 This behavior preserves compatibility with earlier versions of Perl,
  73 which allowed byte semantics in Perl operations, but only as long as
  74 none of the program's inputs are marked as being as source of Unicode
  75 character data.  Such data may come from filehandles, from calls to
  76 external programs, from information provided by the system (such as %ENV),
  77 or from literals and constants in the source text.
  78
  79 If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
  80 global flag is set to C<1>), all system calls will use the
  81 corresponding wide character APIs.  This is currently only implemented
  82 on Windows since UNIXes lack API standard on this area.
  83
  84 Regardless of the above, the C<bytes> pragma can always be used to force
  85 byte semantics in a particular lexical scope.  See L<bytes>.
  86
  87 The C<utf8> pragma is primarily a compatibility device that enables
  88 recognition of UTF-(8|EBCDIC) in literals encountered by the parser.  It may also
  89 be used for enabling some of the more experimental Unicode support features.
  90 Note that this pragma is only required until a future version of Perl
  91 in which character semantics will become the default.  This pragma may
  92 then become a no-op.  See L<utf8>.
  93
  94 Unless mentioned otherwise, Perl operators will use character semantics
  95 when they are dealing with Unicode data, and byte semantics otherwise.
  96 Thus, character semantics for these operations apply transparently; if
  97 the input data came from a Unicode source (for example, by adding a
  98 character encoding discipline to the filehandle whence it came, or a
  99 literal UTF-8 string constant in the program), character semantics
 100 apply; otherwise, byte semantics are in effect.  To force byte semantics
 101 on Unicode data, the C<bytes> pragma should be used.
 102
 103 Under character semantics, many operations that formerly operated on
 104 bytes change to operating on characters.  For ASCII data this makes
 105 no difference, because UTF-8 stores ASCII in single bytes, but for
 106 any character greater than C<chr(127)>, the character may be stored in
 107 a sequence of two or more bytes, all of which have the high bit set.
 108
 109 For C1 controls or Latin 1 characters on an EBCDIC platform the
 110 character may be stored in a UTF-EBCDIC multi byte sequence.  But by
 111 and large, the user need not worry about this, because Perl hides it
 112 from the user.  A character in Perl is logically just a number ranging
 113 from 0 to 2**32 or so.  Larger characters encode to longer sequences
 114 of bytes internally, but again, this is just an internal detail which
 115 is hidden at the Perl level.
 116
 117 =head2 Effects of character semantics
 118
 119 Character semantics have the following effects:
 120
 121 =over 4
 122
 123 =item *
 124
 125 Strings and patterns may contain characters that have an ordinal value
 126 larger than 255.
 127
 128 Presuming you use a Unicode editor to edit your program, such characters
 129 will typically occur directly within the literal strings as UTF-(8|EBCDIC)
 130 characters, but you can also specify a particular character with an
 131 extension of the C<\x> notation.  UTF-X characters are specified by
 132 putting the hexadecimal code within curlies after the C<\x>.  For instance,
 133 a Unicode smiley face is C<\x{263A}>.
 134
 135 =item *
 136
 137 Identifiers within the Perl script may contain Unicode alphanumeric
 138 characters, including ideographs.  (You are currently on your own when
 139 it comes to using the canonical forms of characters--Perl doesn't (yet)
 140 attempt to canonicalize variable names for you.)
 141
 142 =item *
 143
 144 Regular expressions match characters instead of bytes.  For instance,
 145 "." matches a character instead of a byte.  (However, the C<\C> pattern
 146 is provided to force a match a single byte ("C<char>" in C, hence
 147 C<\C>).)
 148
 149 =item *
 150
 151 Character classes in regular expressions match characters instead of
 152 bytes, and match against the character properties specified in the
 153 Unicode properties database.  So C<\w> can be used to match an ideograph,
 154 for instance.
 155
 156 =item *
 157
 158 Named Unicode properties and block ranges make be used as character
 159 classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
 160 match property) constructs.  For instance, C<\p{Lu}> matches any
 161 character with the Unicode uppercase property, while C<\p{M}> matches
 162 any mark character.  Single letter properties may omit the brackets,
 163 so that can be written C<\pM> also.  Many predefined character classes
 164 are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.  The
 165 names of the C<In> classes are the official Unicode block names but
 166 with all non-alphanumeric characters removed, for example the block
 167 name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
 168
 169 Here is the list as of Unicode 3.1.0 (the two-letter classes) and
 170 as defined by Perl (the one-letter classes) (in Unicode materials
 171 what Perl calls C<L> is often called C<L&>):
 172
 173    L  Letter
 174    Lu Letter, Uppercase
 175    Ll Letter, Lowercase
 176    Lt Letter, Titlecase
 177    Lm Letter, Modifier
 178    Lo Letter, Other
 179    M  Mark
 180    Mn Mark, Non-Spacing
 181    Mc Mark, Spacing Combining
 182    Me Mark, Enclosing
 183    N  Number
 184    Nd Number, Decimal Digit
 185    Nl Number, Letter
 186    No Number, Other
 187    P  Punctuation
 188    Pc Punctuation, Connector
 189    Pd Punctuation, Dash
 190    Ps Punctuation, Open
 191    Pe Punctuation, Close
 192    Pi Punctuation, Initial quote
 193        (may behave like Ps or Pe depending on usage)
 194    Pf Punctuation, Final quote
 195        (may behave like Ps or Pe depending on usage)
 196    Po Punctuation, Other
 197    S  Symbol
 198    Sm Symbol, Math
 199    Sc Symbol, Currency
 200    Sk Symbol, Modifier
 201    So Symbol, Other
 202    Z  Separator
 203    Zs Separator, Space
 204    Zl Separator, Line
 205    Zp Separator, Paragraph
 206    C  Other
 207    Cc Other, Control
 208    Cf Other, Format
 209    Cs Other, Surrogate
 210    Co Other, Private Use
 211    Cn Other, Not Assigned (Unicode defines no Cn characters)
 212
 213 Additionally, because scripts differ in their directionality
 214 (for example Hebrew is written right to left), all characters
 215 have their directionality defined:
 216
 217    BidiL   Left-to-Right
 218    BidiLRE Left-to-Right Embedding
 219    BidiLRO Left-to-Right Override
 220    BidiR   Right-to-Left
 221    BidiAL  Right-to-Left Arabic
 222    BidiRLE Right-to-Left Embedding
 223    BidiRLO Right-to-Left Override
 224    BidiPDF Pop Directional Format
 225    BidiEN  European Number
 226    BidiES  European Number Separator
 227    BidiET  European Number Terminator
 228    BidiAN  Arabic Number
 229    BidiCS  Common Number Separator
 230    BidiNSM Non-Spacing Mark
 231    BidiBN  Boundary Neutral
 232    BidiB   Paragraph Separator
 233    BidiS   Segment Separator
 234    BidiWS  Whitespace
 235    BidiON  Other Neutrals
 236
 237 =head2 Scripts
 238
 239 The scripts available for C<\p{In...}> and C<\P{In...}>, for
 240 example \p{InCyrillic>, are as follows, for example C<\p{InLatin}>
 241 or C<\P{InHan}>:
 242
 243    Latin
 244    Greek
 245    Cyrillic
 246    Armenian
 247    Hebrew
 248    Arabic
 249    Syriac
 250    Thaana
 251    Devanagari
 252    Bengali
 253    Gurmukhi
 254    Gujarati
 255    Oriya
 256    Tamil
 257    Telugu
 258    Kannada
 259    Malayalam
 260    Sinhala
 261    Thai
 262    Lao
 263    Tibetan
 264    Myanmar
 265    Georgian
 266    Hangul
 267    Ethiopic
 268    Cherokee
 269    CanadianAboriginal
 270    Ogham
 271    Runic
 272    Khmer
 273    Mongolian
 274    Hiragana
 275    Katakana
 276    Bopomofo
 277    Han
 278    Yi
 279    OldItalic
 280    Gothic
 281    Deseret
 282    Inherited
 283
 284 =head2 Blocks
 285
 286 In addition to B<scripts>, Unicode also defines B<blocks> of
 287 characters.  The difference between scripts and blocks is that the
 288 former concept is closer to natural languages, while the latter
 289 concept is more an artificial grouping based on groups of 256 Unicode
 290 characters.  For example, the C<Latin> script contains letters from
 291 many blocks, but it does not contain all the characters from those
 292 blocks, it does not for example contain digits.
 293
 294 For more about scripts see the UTR #24:
 295 http://www.unicode.org/unicode/reports/tr24/
 296 For more about blocks see
 297 http://www.unicode.org/Public/UNIDATA/Blocks.txt
 298
 299 Because there are overlaps in naming (there are, for example, both
 300 a script called C<Katakana> and a block called C<Katakana>, the block
 301 version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
 302
 303 Notice that this definition was introduced in Perl 5.8.0: in Perl
 304 5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
 305 preferential character class definition; this meant that the
 306 definitions of some character classes changed (the ones in the
 307 below list that have the C<Block> appended).
 308
 309    BasicLatin
 310    Latin1Supplement
 311    LatinExtendedA
 312    LatinExtendedB
 313    IPAExtensions
 314    SpacingModifierLetters
 315    CombiningDiacriticalMarks
 316    GreekBlock
 317    CyrillicBlock
 318    ArmenianBlock
 319    HebrewBlock
 320    ArabicBlock
 321    SyriacBlock
 322    ThaanaBlock
 323    DevanagariBlock
 324    BengaliBlock
 325    GurmukhiBlock
 326    GujaratiBlock
 327    OriyaBlock
 328    TamilBlock
 329    TeluguBlock
 330    KannadaBlock
 331    MalayalamBlock
 332    SinhalaBlock
 333    ThaiBlock
 334    LaoBlock
 335    TibetanBlock
 336    MyanmarBlock
 337    GeorgianBlock
 338    HangulJamo
 339    EthiopicBlock
 340    CherokeeBlock
 341    UnifiedCanadianAboriginalSyllabics
 342    OghamBlock
 343    RunicBlock
 344    KhmerBlock
 345    MongolianBlock
 346    LatinExtendedAdditional
 347    GreekExtended
 348    GeneralPunctuation
 349    SuperscriptsandSubscripts
 350    CurrencySymbols
 351    CombiningMarksforSymbols
 352    LetterlikeSymbols
 353    NumberForms
 354    Arrows
 355    MathematicalOperators
 356    MiscellaneousTechnical
 357    ControlPictures
 358    OpticalCharacterRecognition
 359    EnclosedAlphanumerics
 360    BoxDrawing
 361    BlockElements
 362    GeometricShapes
 363    MiscellaneousSymbols
 364    Dingbats
 365    BraillePatterns
 366    CJKRadicalsSupplement
 367    KangxiRadicals
 368    IdeographicDescriptionCharacters
 369    CJKSymbolsandPunctuation
 370    HiraganaBlock
 371    KatakanaBlock
 372    BopomofoBlock
 373    HangulCompatibilityJamo
 374    Kanbun
 375    BopomofoExtended
 376    EnclosedCJKLettersandMonths
 377    CJKCompatibility
 378    CJKUnifiedIdeographsExtensionA
 379    CJKUnifiedIdeographs
 380    YiSyllables
 381    YiRadicals
 382    HangulSyllables
 383    HighSurrogates
 384    HighPrivateUseSurrogates
 385    LowSurrogates
 386    PrivateUse
 387    CJKCompatibilityIdeographs
 388    AlphabeticPresentationForms
 389    ArabicPresentationFormsA
 390    CombiningHalfMarks
 391    CJKCompatibilityForms
 392    SmallFormVariants
 393    ArabicPresentationFormsB
 394    Specials
 395    HalfwidthandFullwidthForms
 396    OldItalicBlock
 397    GothicBlock
 398    DeseretBlock
 399    ByzantineMusicalSymbols
 400    MusicalSymbols
 401    MathematicalAlphanumericSymbols
 402    CJKUnifiedIdeographsExtensionB
 403    CJKCompatibilityIdeographsSupplement
 404    Tags
 405
 406 =item *
 407
 408 The special pattern C<\X> match matches any extended Unicode sequence
 409 (a "combining character sequence" in Standardese), where the first
 410 character is a base character and subsequent characters are mark
 411 characters that apply to the base character.  It is equivalent to
 412 C<(?:\PM\pM*)>.
 413
 414 =item *
 415
 416 The C<tr///> operator translates characters instead of bytes.  Note
 417 that the C<tr///CU> functionality has been removed, as the interface
 418 was a mistake.  For similar functionality see pack('U0', ...) and
 419 pack('C0', ...).
 420
 421 =item *
 422
 423 Case translation operators use the Unicode case translation tables
 424 when provided character input.  Note that C<uc()> translates to
 425 uppercase, while C<ucfirst> translates to titlecase (for languages
 426 that make the distinction).  Naturally the corresponding backslash
 427 sequences have the same semantics.
 428
 429 =item *
 430
 431 Most operators that deal with positions or lengths in the string will
 432 automatically switch to using character positions, including C<chop()>,
 433 C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
 434 C<write()>, and C<length()>.  Operators that specifically don't switch
 435 include C<vec()>, C<pack()>, and C<unpack()>.  Operators that really
 436 don't care include C<chomp()>, as well as any other operator that
 437 treats a string as a bucket of bits, such as C<sort()>, and the
 438 operators dealing with filenames.
 439
 440 =item *
 441
 442 The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
 443 since they're often used for byte-oriented formats.  (Again, think
 444 "C<char>" in the C language.)  However, there is a new "C<U>" specifier
 445 that will convert between UTF-8 characters and integers.  (It works
 446 outside of the utf8 pragma too.)
 447
 448 =item *
 449
 450 The C<chr()> and C<ord()> functions work on characters.  This is like
 451 C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
 452 C<unpack("C")>.  In fact, the latter are how you now emulate
 453 byte-oriented C<chr()> and C<ord()> under utf8.
 454
 455 =item *
 456
 457 The bit string operators C<& | ^ ~> can operate on character data.
 458 However, for backward compatibility reasons (bit string operations
 459 when the characters all are less than 256 in ordinal value) one cannot
 460 mix C<~> (the bit complement) and characters both less than 256 and
 461 equal or greater than 256.  Most importantly, the DeMorgan's laws
 462 (C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
 463 Another way to look at this is that the complement cannot return
 464 B<both> the 8-bit (byte) wide bit complement, and the full character
 465 wide bit complement.
 466
 467 =item *
 468
 469 And finally, C<scalar reverse()> reverses by character rather than by byte.
 470
 471 =back
 472
 473 =head2 Character encodings for input and output
 474
 475 See L<Encode>.
 476
 477 =head1 CAVEATS
 478
 479 As of yet, there is no method for automatically coercing input and
 480 output to some encoding other than UTF-8 or UTF-EBCDIC.  This is planned
 481 in the near future, however.
 482
 483 Whether an arbitrary piece of data will be treated as "characters" or
 484 "bytes" by internal operations cannot be divined at the current time.
 485
 486 Use of locales with utf8 may lead to odd results.  Currently there is
 487 some attempt to apply 8-bit locale info to characters in the range
 488 0..255, but this is demonstrably incorrect for locales that use
 489 characters above that range (when mapped into Unicode).  It will also
 490 tend to run slower.  Avoidance of locales is strongly encouraged.
 491
 492 =head1 SEE ALSO
 493
 494 L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
 495
 496 =cut