ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Encodings supported by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition, an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 the first in the following sequence (with a few exceptions).
  14
  15 =over 4
  16
  17 =item *
  18
  19 The name used by the Perl community.  That includes 'utf8' and 'ascii'.
  20 Unlike aliases, canonical names directly reach the method so such
  21 frequently used words like 'utf8' don't need to do alias lookups.
  22
  23 =item *
  24
  25 The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
  26
  27 =item *
  28
  29 The name in the IANA registry.
  30
  31 =item *
  32
  33 The name used by the organization that defined it.
  34
  35 =back
  36
  37 In case I<de jure> canonical names differ from that of the Encode
  38 module, they are always aliased if it ever be implemented.  So you can
  39 safely tell if a given encoding is implemented or not just by passing
  40 the canonical name.
  41
  42 Because of all the alias issues, and because in the general case
  43 encodings have state, "Encode" uses an encoding object internally
  44 once an operation is in progress.
  45
  46 =head1 Supported Encodings
  47
  48 As of Perl 5.8.0, at least the following encodings are recognized.
  49 Note that unless otherwise specified, they are all case insensitive
  50 (via alias) and all occurrence of spaces are replaced with '-'.
  51 In other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52
  53 Encodings are categorized and implemented in several different modules
  54 but you don't have to C<use Encode::XX> to make them available for
  55 most cases.  Encode.pm will automatically load those modules on demand.
  56
  57 =head2 Built-in Encodings
  58
  59 The following encodings are always available.
  60
  61   Canonical     Aliases                      Comments & References
  62   ----------------------------------------------------------------
  63   ascii         US-ascii                                    [ECMA]
  64   iso-8859-1    latin1                                       [ISO]
  65   utf8          UTF-8                                    [RFC2279]
  66   ----------------------------------------------------------------
  67
  68 =head2 Encode::Unicode -- other Unicode encodings
  69
  70 Unicode coding schemes other than native utf8 are supported by
  71 Encode::Unicode, which will be autoloaded on demand.
  72
  73   ----------------------------------------------------------------
  74   UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
  75   UCS-2LE                                                     [UC]
  76   UTF-16                                                      [UC]
  77   UTF-16BE                                                    [UC]
  78   UTF-16LE                                                    [UC]
  79   UTF-32                                                      [UC]
  80   UTF-32BE                                                    [UC]
  81   UTF-32LE                                                    [UC]
  82   ----------------------------------------------------------------
  83
  84 To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
  85 see L<Encode::Unicode>.
  86
  87 =head2 Encode::Byte -- Extended ASCII
  88
  89 Encode::Byte implements most single-byte encodings except for
  90 Symbols and EBCDIC. The following encodings are based on single-byte
  91 encodings implemented as extended ASCII.  Most of them map
  92 \x80-\xff (upper half) to non-ASCII characters.
  93
  94 =over 4
  95
  96 =item ISO-8859 and corresponding vendor mappings
  97
  98 Since there are so many, they are presented in table format with
  99 languages and corresponding encoding names by vendors.  Note that
 100 the table is sorted in order of ISO-8859 and the corresponding vendor
 101 mappings are slightly different from that of ISO.  See
 102 L<http://czyborra.com/charsets/iso8859.html> for details.
 103
 104   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
 105   ----------------------------------------------------------------
 106   N. America    (ASCII)         cp437        AdobeStandardEncoding
 107                                 cp863 (DOSCanadaF)
 108   W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
 109                                                          hp-roman8
 110                                 cp860 (DOSPortuguese)
 111   Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
 112                                                 MacCroatian
 113                                                 MacRomanian
 114                                                 MacRumanian
 115   Latin3 [1]    iso-8859-3
 116   Latin4 [2]    iso-8859-4
 117   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 118     (See also next section)     cp866           MacUkrainian
 119   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 120                                 cp1006          MacFarsi
 121   Greek         iso-8859-7      cp737   cp1253  MacGreek
 122                                 cp869 (DOSGreek2)
 123   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 124   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 125   Nordics       iso-8859-10     cp865
 126                                 cp861           MacIcelandic
 127                                                 MacSami
 128   Thai          iso-8859-11 [3] cp874           MacThai
 129   (iso-8859-12 is nonexistent. Reserved for Indics?)
 130   Baltics       iso-8859-13     cp775           cp1257
 131   Celtics       iso-8859-14
 132   Latin9 [4]    iso-8859-15
 133   Latin10       iso-8859-16
 134   Vietnamese    viscii                  cp1258  MacVietnamese
 135   ----------------------------------------------------------------
 136
 137   [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
 138   [2] Baltics.  Now on 8859-10, except for Latvian.
 139   [3] Also know as TIS 620.
 140   [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
 141       letters that are missing from 8859-1 were added.
 142
 143 All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 144 L<http://czyborra.com/charsets/codepages.html>.
 145
 146 Macintosh encodings don't seem to be registered in such entities as
 147 IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 148 1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
 149 for details.
 150
 151 =item KOI8 - De Facto Standard for the Cyrillic world
 152
 153 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
 154 popular in the Net.   L<Encode> comes with the following KOI charsets.
 155 For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
 156
 157   ----------------------------------------------------------------
 158   koi8-f
 159   koi8-r cp878                                           [RFC1489]
 160   koi8-u                                                 [RFC2319]
 161   ----------------------------------------------------------------
 162
 163 =item gsm0338 - Hentai Latin 1
 164
 165 GSM0338 is for GSM handsets. Though it shares alphanumerals with
 166 ASCII, control character ranges and other parts are mapped very
 167 differently, presumably to store Greek and Cyrillic alphabets.
 168 This is also covered in Encode::Byte even though it is not an
 169 "extended ASCII" encoding.
 170
 171 =back
 172
 173 =head2 CJK: Chinese, Japanese, Korean (Multibyte)
 174
 175 Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
 176 below.  Also note that these are implemented in distinct modules by
 177 countries, due the the size concerns (simplified Chinese is mapped
 178 to 'CN', continental China, while traditional Chinese is mapped to
 179 'TW', Taiwan).  Please refer to their respective documentataion pages.
 180
 181 =over 4
 182
 183 =item Encode::CN -- Continental China
 184
 185   Standard      DOS/Win Macintosh                Comment/Reference
 186   ----------------------------------------------------------------
 187   euc-cn [1]            MacChineseSimp
 188   (gbk)         cp936 [2]
 189   gb12345-raw                      { GB12345 without CES }
 190   gb2312-raw                       { GB2312  without CES }
 191   hz
 192   iso-ir-165
 193   ----------------------------------------------------------------
 194
 195   [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
 196   [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
 197
 198 =item Encode::JP -- Japan
 199
 200   Standard      DOS/Win Macintosh                Comment/Reference
 201   ----------------------------------------------------------------
 202   euc-jp
 203   shiftjis      cp932   macJapanese
 204   7bit-jis
 205   euc-jp
 206   iso-2022-jp                                            [RFC1468]
 207   iso-2022-jp-1                                          [RFC2237]
 208   jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 209   jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 210   jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 211   ----------------------------------------------------------------
 212
 213 =item Encode::KR -- Korea
 214
 215   Standard      DOS/Win Macintosh                Comment/Reference
 216   ----------------------------------------------------------------
 217   euc-kr                MacKorean                        [RFC1557]
 218                 cp949 [1]
 219   iso-2022-kr                                            [RFC1557]
 220   johab                                  [KS X 1001:1998, Annex 3]
 221   ksc5601-raw                              { KSC5601 without CES }
 222   ----------------------------------------------------------------
 223
 224   [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
 225   See below.
 226
 227 =item Encode::TW -- Taiwan
 228
 229   Standard      DOS/Win Macintosh                Comment/Reference
 230   ----------------------------------------------------------------
 231   big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
 232   big5-hkscs
 233   ----------------------------------------------------------------
 234
 235 =item Encode::HanExtra -- More Chinese via CPAN
 236
 237 Due to size concerns, additional Chinese encodings below are
 238 distributed separately on CPAN, under the name Encode::HanExtra.
 239
 240   Standard      DOS/Win Macintosh                Comment/Reference
 241   ----------------------------------------------------------------
 242   gb18030
 243   euc-tw
 244   big5plus
 245   ----------------------------------------------------------------
 246
 247 =back
 248
 249 =head2 Miscellaneous encodings
 250
 251 =over 4
 252
 253 =item Encode::EBCDIC
 254
 255 See L<perlebcdic> for details.
 256
 257   ----------------------------------------------------------------
 258   cp37
 259   cp500
 260   cp875
 261   cp1026
 262   cp1047
 263   posix-bc
 264   ----------------------------------------------------------------
 265
 266 =item Encode::Symbols
 267
 268 For symbols  and dingbats.
 269
 270   ----------------------------------------------------------------
 271   symbol
 272   dingbats
 273   MacDingbats
 274   AdobeZdingbat
 275   AdobeSymbol
 276   ----------------------------------------------------------------
 277
 278 =back
 279
 280 =head1 Unsupported encodings
 281
 282 The following encodings are not supported as yet; some because they
 283 are rarely used, some because of technical difficulties.  They may
 284 be supported by external modules via CPAN in the future, however.
 285
 286 =over 4
 287
 288 =item   ISO-2022-JP-2 [RFC1554]
 289
 290 Not very popular yet.  Needs Unicode Database or equivalent to
 291 implement encode() (because it includes JIS X 0208/0212, KSC5601, and
 292 GB2312 simultaneously, whose code points in Unicode overlap.  So you
 293 need to lookup the database to determine to what character set a given
 294 Unicode character should belong).
 295
 296 =item ISO-2022-CN [RFC1922]
 297
 298 Not very popular.  Needs CNS 11643-1 and -2 which are not available in
 299 this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
 300 Autrijus Tang may add support for this encoding in his module in future.
 301
 302 =item Various HP-UX encodings
 303
 304 The following are unsupported due to the lack of mapping data.
 305
 306   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 307   '15' - japanese15, korean15, and roi15
 308
 309 =item Cyrillic encoding ISO-IR-111
 310
 311 Anton Tagunov doubts its usefulness.
 312
 313 =item ISO-8859-8-1 [Hebrew]
 314
 315 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 316 MacHebrew are supported because and just because there were mappings
 317 available at L<http://www.unicode.org/>).  Contributions welcome.
 318
 319 =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
 320
 321 Ditto.
 322
 323 =item Thai encoding TCVN
 324
 325 Ditto.
 326
 327 =item Vietnamese encodings VPS
 328
 329 Though Jungshik Shin has reported that Mozilla supports this encoding,
 330 it was too late before 5.8.0 for us to add it.  In the future, it
 331 may be available via a separate module.  See
 332 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
 333 and
 334 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 335 if you are interested in helping us.
 336
 337 =item Various Mac encodings
 338
 339 The following are unsupported due to the lack of mapping data.
 340
 341   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 342   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 343   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 344   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 345   MacVietnamese
 346
 347 The rest which are already available are based upon the vendor mappings
 348 at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 349
 350 =item (Mac) Indic encodings
 351
 352 The maps for the following are available at L<http://www.unicode.org/>
 353 but remain unsupport because those encodings need algorithmical
 354 approach, currently unsupported by F<enc2xs>:
 355
 356   MacDevanagari
 357   MacGurmukhi
 358   MacGujarati
 359
 360 For details, please see C<Unicode mapping issues and notes:> at
 361 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 362
 363 I believe this issue is prevalent not only for Mac Indics but also in
 364 other Indic encodings, but the above were the only Indic encodings
 365 maps that I could find at L<http://www.unicode.org/> .
 366
 367 =back
 368
 369 =head1 Encoding vs. Charset -- terminology
 370
 371 We are used to using the term (character) I<encoding> and I<character
 372 set> interchangeably.  But just as confusing the terms byte and
 373 character is dangerous and the terms should be differentiated when
 374 needed, we need to differentiate I<encoding> and I<character set>.
 375
 376 To understand that, here is a description of how we make computers
 377 grok our characters.
 378
 379 =over 4
 380
 381 =item *
 382
 383 First we start with which characters to include.  We call this
 384 collection of characters I<character repertoire>.
 385
 386 =item *
 387
 388 Then we have to give each character a unique ID so your computer can
 389 tell the difference between 'a' and 'A'.  This itemized character
 390 repertoire is now a I<character set>.
 391
 392 =item *
 393
 394 If your computer can grow the character set without further
 395 processing, you can go ahead and use it.  This is called a I<coded
 396 character set> (CCS) or I<raw character encoding>.  ASCII is used this
 397 way for most cases.
 398
 399 =item *
 400
 401 But in many cases, especially multi-byte CJK encodings, you have to
 402 tweak a little more.  Your network connection may not accept any data
 403 with the Most Significant Bit set, and your computer may not be able to
 404 tell if a given byte is a whole character or just half of it.  So you
 405 have to I<encode> the character set to use it.
 406
 407 A I<character encoding scheme> (CES) determines how to encode a given
 408 character set, or a set of multiple character sets.  7bit ISO-2022 is
 409 an example of a CES.  You switch between character sets via I<escape
 410 sequences>.
 411
 412 =back
 413
 414 Technically, or mathematically, speaking, a character set encoded in
 415 such a CES that maps character by character may form a CCS.  EUC is such
 416 an example.  The CES of EUC is as follows:
 417
 418 =over 4
 419
 420 =item *
 421
 422 Map ASCII unchanged.
 423
 424 =item *
 425
 426 Map such a character set that consists of 94 or 96 powered by N
 427 members by adding 0x80 to each byte.
 428
 429 =item *
 430
 431 You can also use 0x8e and 0x8f to indicate that the following sequence of
 432 characters belongs to yet another character set.  To each following byte
 433 is added the value 0x80.
 434
 435 =back
 436
 437 By carefully looking at the encoded byte sequence, you can find that the
 438 byte sequence conforms a unique number.  In that sense, EUC is a CCS
 439 generated by a CES above from up to four CCS (complicated?).  UTF-8
 440 falls into this category.  See L<perlUnicode/"UTF-8"> to find out how
 441 UTF-8 maps Unicode to a byte sequence.
 442
 443 You may also have found out by now why 7bit ISO-2022 cannot comprise
 444 a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
 445 it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
 446 so you have no trouble differentiating between "!!". and S<"  ">.
 447
 448 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 449
 450 This section tries to classify the supported encodings by their
 451 applicability for information exchange over the Internet and to
 452 choose the most suitable aliases to name them in the context of
 453 such communication.
 454
 455 =over 4
 456
 457 =item *
 458
 459 To (en|de)code encodings marked by C<(**)>, you need
 460 C<Encode::HanExtra>, available from CPAN.
 461
 462 =back
 463
 464 Encoding names
 465
 466   US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 467   Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 468   EUC-KR      Big5     GB2312
 469
 470 are registered with IANA as preferred MIME names and may
 471 be used over the Internet.
 472
 473 C<Shift_JIS> has been officialized by JIS X 0208:1997.
 474 L<Microsoft-related naming mess> gives details.
 475
 476 C<GB2312> is the IANA name for C<EUC-CN>.
 477 See L<Microsoft-related naming mess> for details.
 478
 479 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 480 with Encode. See L<Encode::CN> for details.
 481
 482   EUC-CN
 483   KOI8-U        [RFC2319]
 484
 485 have not been registered with IANA (as of March 2002) but
 486 seem to be supported by major web browsers.
 487 The IANA name for C<EUC-CN> is C<GB2312>.
 488
 489   KS_C_5601-1987
 490
 491 is heavily misused.
 492 See L<Microsoft-related naming mess> for details.
 493
 494 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 495 with Encode. See L<Encode::KR> for details.
 496
 497   UTF-16 UTF-16BE UTF-16LE
 498
 499 are IANA-registered C<charset>s. See [RFC 2781] for details.
 500 Jungshik Shin reports that UTF-16 with a BOM is well accepted
 501 by MS IE 5/6 and NS 4/6. Beware however that
 502
 503 =over 4
 504
 505 =item *
 506
 507 C<UTF-16> support in any software you're going to be
 508 using/interoperating with has probably been less tested
 509 then C<UTF-8> support
 510
 511 =item *
 512
 513 C<UTF-8> coded data seamlessly passes traditional
 514 command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
 515 data is likely to cause confusion (with its zero bytes,
 516 for example)
 517
 518 =item *
 519
 520 it is beyond the power of words to describe the way HTML browsers
 521 encode non-C<ASCII> form data. To get a general impression, visit
 522 L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
 523 While encoding of form data has stabilized for C<UTF-8> encoded pages
 524 (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
 525 expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
 526 pages!
 527
 528 =back
 529
 530 The rule of thumb is to use C<UTF-8> unless you know what
 531 you're doing and unless you really benefit from using C<UTF-16>.
 532
 533
 534   ISO-IR-165    [RFC1345]
 535   VISCII
 536   GB 12345
 537   GB 18030 (**)  (see links bellow)
 538   EUC-TW   (**)
 539
 540 are totally valid encodings but not registered at IANA.
 541 The names under which they are listed here are probably the
 542 most widely-known names for these encodings and are recommended
 543 names.
 544
 545   BIG5PLUS (**)
 546
 547 is a proprietary name.
 548
 549 =head2 Microsoft-related naming mess
 550
 551 Microsoft products misuse the following names:
 552
 553 =over 4
 554
 555 =item KS_C_5601-1987
 556
 557 Microsoft extension to C<EUC-KR>.
 558
 559 Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
 560
 561 See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
 562 for details.
 563
 564 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
 565 misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
 566 C<kcs5601-raw>.
 567
 568 See L<Encode::KR> for details.
 569
 570 =item GB2312
 571
 572 Microsoft extension to C<EUC-CN>.
 573
 574 Proper names: C<CP936>, C<GBK>.
 575
 576 C<GB2312> has been registered in the C<EUC-CN> meaning at
 577 IANA. This has partially repaired the situation: Microsoft's
 578 C<GB2312> has become a superset of the official C<GB2312>.
 579
 580 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 581 IANA registration. C<cp936> is supported separately.
 582 I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 583
 584 See L<Encode::CN> for details.
 585
 586 =item Big5
 587
 588 Microsoft extension to C<Big5>.
 589
 590 Proper name: C<CP950>.
 591
 592 Encode separately supports C<Big5> and C<cp950>.
 593
 594 =item Shift_JIS
 595
 596 Microsoft's understanding of C<Shift_JIS>.
 597
 598 JIS has not endorsed the full Microsoft standard however.
 599 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 600 character sets, while Microsoft has always used C<Shift_JIS>
 601 to encode a wider character repertoire. See C<IANA> registration for
 602 C<Windows-31J>.
 603
 604 As a historical predecessor, Microsoft's variant
 605 probably has more rights for the name, though it may be objected
 606 that Microsoft shouldn't have used JIS as part of the name
 607 in the first place.
 608
 609 Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
 610
 611 Encode separately supports C<Shift_JIS> and C<cp932>.
 612
 613 =back
 614
 615 =head1 Glossary
 616
 617 =over 4
 618
 619 =item character repertoire
 620
 621 A collection of unique characters.  A I<character> set in the strictest
 622 sense. At this stage, characters are not numbered.
 623
 624 =item coded character set (CCS)
 625
 626 A character set that is mapped in a way computers can use directly.
 627 Many character encodings, including EUC, fall in this category.
 628
 629 =item character encoding scheme (CES)
 630
 631 An algorithm to map a character set to a byte sequence.  You don't
 632 have to be able to tell which character set a given byte sequence
 633 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 634 example of being both a CCS and CES.
 635
 636 =item charset (in MIME context)
 637
 638 has long been used in the meaning of C<encoding>, CES.
 639
 640 While the word combination C<character set> has lost this meaning
 641 in MIME context since [RFC 2130], the C<charset> abbreviation has
 642 retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
 643
 644  This document uses the term "charset" to mean a set of rules for
 645  mapping from a sequence of octets to a sequence of characters, such
 646  as the combination of a coded character set and a character encoding
 647  scheme; this is also what is used as an identifier in MIME "charset="
 648  parameters, and registered in the IANA charset registry ...  (Note
 649  that this is NOT a term used by other standards bodies, such as ISO).
 650                                                [RFC 2277]
 651
 652 =item EUC
 653
 654 Extended Unix Character.  See ISO-2022.
 655
 656 =item ISO-2022
 657
 658 A CES that was carefully designed to coexist with ASCII.  There are a 7
 659 bit version and an 8 bit version.
 660
 661 The 7 bit version switches character set via escape sequence so it
 662 cannot form a CCS.  Since this is more difficult to handle in programs
 663 than the 8 bit version, the 7 bit version is not very popular except for
 664 iso-2022-jp, the I<de facto> standard CES for e-mails.
 665
 666 The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
 667 thereof.  Pre-5.6 perl could use them as string literals.
 668
 669 =item UCS
 670
 671 Short for I<Universal Character Set>.  When you say just UCS, it means
 672 I<Unicode>.
 673
 674 =item UCS-2
 675
 676 ISO/IEC 10646 encoding form: Universal Character Set coded in two
 677 octets.
 678
 679 =item Unicode
 680
 681 A character set that aims to include all character repertoires of the
 682 world.  Many character sets in various national as well as industrial
 683 standards have become, in a way, just subsets of Unicode.
 684
 685 =item UTF
 686
 687 Short for I<Unicode Transformation Format>.  Determines how to map a
 688 Unicode character into a byte sequence.
 689
 690 =item UTF-16
 691
 692 A UTF in 16-bit encoding.  Can either be in big endian or little
 693 endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
 694 surrogate support) and the little endian version is called UTF-16LE.
 695
 696 =back
 697
 698 =head1 See Also
 699
 700 L<Encode>,
 701 L<Encode::Byte>,
 702 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 703 L<Encode::EBCDIC>, L<Encode::Symbol>
 704
 705 =head1 References
 706
 707 =over 4
 708
 709 =item ECMA
 710
 711 European Computer Manufacturers Association
 712 L<http://www.ecma.ch>
 713
 714 =over 4
 715
 716 =item ECMA-035 (eq C<ISO-2022>)
 717
 718 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 719
 720 The specification of ISO-2022 is available from the link above.
 721
 722 =back
 723
 724 =item IANA
 725
 726 Internet Assigned Numbers Authority
 727 L<http://www.iana.org/>
 728
 729 =over 4
 730
 731 =item Assigned Charset Names by IANA
 732
 733 L<http://www.iana.org/assignments/character-sets>
 734
 735 Most of the C<canonical names> in Encode derive from this list
 736 so you can directly apply the string you have extracted from MIME
 737 header of mails and web pages.
 738
 739 =back
 740
 741 =item ISO
 742
 743 International Organization for Standardization
 744 L<http://www.iso.ch/>
 745
 746 =item RFC
 747
 748 Request For Comments -- need I say more?
 749 L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
 750 L<http://www.faqs.org/rfcs/>
 751
 752 =item UC
 753
 754 Unicode Consortium
 755 L<http://www.unicode.org/>
 756
 757 =over 4
 758
 759 =item Unicode Glossary
 760
 761 L<http://www.unicode.org/glossary/>
 762
 763 The glossary of this document is based upon this site.
 764
 765 =back
 766
 767 =back
 768
 769 =head2 Other Notable Sites
 770
 771 =over 4
 772
 773 =item czyborra.com
 774
 775 L<http://czyborra.com/>
 776
 777 Contains a a lot of useful information, especially gory details of ISO
 778 vs. vendor mappings.
 779
 780 =item CJK.inf
 781
 782 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 783
 784 Somewhat obsolete (last update in 1996), but still useful.  Also try
 785
 786 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 787
 788 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
 789
 790 =item Jungshik Shin's Hangul FAQ
 791
 792 L<http://jshin.net/faq>
 793
 794 And especially its subject 8.
 795
 796 L<http://jshin.net/faq/qa8.html>
 797
 798 A comprehensive overview of the Korean (C<KS *>) standards.
 799
 800 =item debian.org: "Introduction to i18n"
 801
 802 A brief description for most of the mentioned CJK encodings is
 803 contained in
 804 L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
 805
 806 =back
 807
 808 =head2 Offline sources
 809
 810 =over 4
 811
 812 =item C<CJKV Information Processing> by Ken Lunde
 813
 814 CJKV Information Processing
 815 1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 816
 817 The modern successor of C<CJK.inf>.
 818
 819 Features a comprehensive coverage of CJKV character sets and
 820 encodings along with many other issues faced by anyone trying
 821 to better support CJKV languages/scripts in all the areas of
 822 information processing.
 823
 824 To purchase this book, visit
 825 L<http://www.oreilly.com/catalog/cjkvinfo/>
 826 or your favourite bookstore.
 827
 828 =back
 829
 830 =cut