ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Encodings supported by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition, an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 the first in the following sequence (with a few exceptions).
  14
  15 =over 4
  16
  17 =item *
  18
  19 The name used by the Perl community.  That includes 'utf8' and 'ascii'.
  20 Unlike aliases, canonical names directly reach the method so such
  21 frequently used words like 'utf8' don't need to do alias lookups.
  22
  23 =item *
  24
  25 The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
  26
  27 =item *
  28
  29 The name in the IANA registry.
  30
  31 =item *
  32
  33 The name used by the organization that defined it.
  34
  35 =back
  36
  37 In case I<de jure> canonical names differ from that of the Encode
  38 module, they are always aliased if it ever be implemented.  So you can
  39 safely tell if a given encoding is implemented or not just by passing
  40 the canonical name.
  41
  42 Because of all the alias issues, and because in the general case
  43 encodings have state, "Encode" uses an encoding object internally
  44 once an operation is in progress.
  45
  46 =head1 Supported Encodings
  47
  48 As of Perl 5.8.0, at least the following encodings are recognized.
  49 Note that unless otherwise specified, they are all case insensitive
  50 (via alias) and all occurrence of spaces are replaced with '-'.
  51 In other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52
  53 Encodings are categorized and implemented in several different modules
  54 but you don't have to C<use Encode::XX> to make them available for
  55 most cases.  Encode.pm will automatically load those modules on demand.
  56
  57 =head2 Built-in Encodings
  58
  59 The following encodings are always available.
  60
  61   Canonical     Aliases                      Comments & References
  62   ----------------------------------------------------------------
  63   ascii         US-ascii                                    [ECMA]
  64   ascii-ctrl                                      Special Encoding
  65   iso-8859-1    latin1                                       [ISO]
  66   null                                            Special Encoding
  67   utf8          UTF-8                                    [RFC2279]
  68   ----------------------------------------------------------------
  69
  70 I<null> and I<ascii-ctrl> are special.  "null" fails for all character
  71 so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
  72 CHARACTERS will fall back to character references.  Ditto for
  73 "ascii-ctrl" except for control characters.  For fallback modes, see
  74 L<Encode>.
  75
  76 =head2 Encode::Unicode -- other Unicode encodings
  77
  78 Unicode coding schemes other than native utf8 are supported by
  79 Encode::Unicode, which will be autoloaded on demand.
  80
  81   ----------------------------------------------------------------
  82   UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
  83   UCS-2LE                                                     [UC]
  84   UTF-16                                                      [UC]
  85   UTF-16BE                                                    [UC]
  86   UTF-16LE                                                    [UC]
  87   UTF-32                                                      [UC]
  88   UTF-32BE      UCS-4                                         [UC]
  89   UTF-32LE                                                    [UC]
  90   ----------------------------------------------------------------
  91
  92 To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
  93 see L<Encode::Unicode>.
  94
  95 =head2 Encode::Byte -- Extended ASCII
  96
  97 Encode::Byte implements most single-byte encodings except for
  98 Symbols and EBCDIC. The following encodings are based on single-byte
  99 encodings implemented as extended ASCII.  Most of them map
 100 \x80-\xff (upper half) to non-ASCII characters.
 101
 102 =over 4
 103
 104 =item ISO-8859 and corresponding vendor mappings
 105
 106 Since there are so many, they are presented in table format with
 107 languages and corresponding encoding names by vendors.  Note that
 108 the table is sorted in order of ISO-8859 and the corresponding vendor
 109 mappings are slightly different from that of ISO.  See
 110 L<http://czyborra.com/charsets/iso8859.html> for details.
 111
 112   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
 113   ----------------------------------------------------------------
 114   N. America    (ASCII)         cp437        AdobeStandardEncoding
 115                                 cp863 (DOSCanadaF)
 116   W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
 117                                                          hp-roman8
 118                                 cp860 (DOSPortuguese)
 119   Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
 120                                                 MacCroatian
 121                                                 MacRomanian
 122                                                 MacRumanian
 123   Latin3 [1]    iso-8859-3
 124   Latin4 [2]    iso-8859-4
 125   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 126     (See also next section)     cp866           MacUkrainian
 127   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 128                                 cp1006          MacFarsi
 129   Greek         iso-8859-7      cp737   cp1253  MacGreek
 130                                 cp869 (DOSGreek2)
 131   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 132   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 133   Nordics       iso-8859-10     cp865
 134                                 cp861           MacIcelandic
 135                                                 MacSami
 136   Thai          iso-8859-11 [3] cp874           MacThai
 137   (iso-8859-12 is nonexistent. Reserved for Indics?)
 138   Baltics       iso-8859-13     cp775           cp1257
 139   Celtics       iso-8859-14
 140   Latin9 [4]    iso-8859-15
 141   Latin10       iso-8859-16
 142   Vietnamese    viscii                  cp1258  MacVietnamese
 143   ----------------------------------------------------------------
 144
 145   [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
 146   [2] Baltics.  Now on 8859-10, except for Latvian.
 147   [3] Also know as TIS 620.
 148   [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
 149       letters that are missing from 8859-1 were added.
 150
 151 All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 152 L<http://czyborra.com/charsets/codepages.html>.
 153
 154 Macintosh encodings don't seem to be registered in such entities as
 155 IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 156 1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
 157 for details.
 158
 159 =item KOI8 - De Facto Standard for the Cyrillic world
 160
 161 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
 162 popular in the Net.   L<Encode> comes with the following KOI charsets.
 163 For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
 164
 165   ----------------------------------------------------------------
 166   koi8-f
 167   koi8-r cp878                                           [RFC1489]
 168   koi8-u                                                 [RFC2319]
 169   ----------------------------------------------------------------
 170
 171 =item gsm0338 - Hentai Latin 1
 172
 173 GSM0338 is for GSM handsets. Though it shares alphanumerals with
 174 ASCII, control character ranges and other parts are mapped very
 175 differently, presumably to store Greek and Cyrillic alphabets.
 176 This is also covered in Encode::Byte even though it is not an
 177 "extended ASCII" encoding.
 178
 179 =back
 180
 181 =head2 CJK: Chinese, Japanese, Korean (Multibyte)
 182
 183 Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
 184 below.  Also note that these are implemented in distinct modules by
 185 countries, due the the size concerns (simplified Chinese is mapped
 186 to 'CN', continental China, while traditional Chinese is mapped to
 187 'TW', Taiwan).  Please refer to their respective documentataion pages.
 188
 189 =over 4
 190
 191 =item Encode::CN -- Continental China
 192
 193   Standard      DOS/Win Macintosh                Comment/Reference
 194   ----------------------------------------------------------------
 195   euc-cn [1]            MacChineseSimp
 196   (gbk)         cp936 [2]
 197   gb12345-raw                      { GB12345 without CES }
 198   gb2312-raw                       { GB2312  without CES }
 199   hz
 200   iso-ir-165
 201   ----------------------------------------------------------------
 202
 203   [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
 204   [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
 205
 206 =item Encode::JP -- Japan
 207
 208   Standard      DOS/Win Macintosh                Comment/Reference
 209   ----------------------------------------------------------------
 210   euc-jp
 211   shiftjis      cp932   macJapanese
 212   7bit-jis
 213   iso-2022-jp                                            [RFC1468]
 214   iso-2022-jp-1                                          [RFC2237]
 215   jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 216   jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 217   jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 218   ----------------------------------------------------------------
 219
 220 =item Encode::KR -- Korea
 221
 222   Standard      DOS/Win Macintosh                Comment/Reference
 223   ----------------------------------------------------------------
 224   euc-kr                MacKorean                        [RFC1557]
 225                 cp949 [1]
 226   iso-2022-kr                                            [RFC1557]
 227   johab                                  [KS X 1001:1998, Annex 3]
 228   ksc5601-raw                              { KSC5601 without CES }
 229   ----------------------------------------------------------------
 230
 231   [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
 232   See below.
 233
 234 =item Encode::TW -- Taiwan
 235
 236   Standard      DOS/Win Macintosh                Comment/Reference
 237   ----------------------------------------------------------------
 238   big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
 239   big5-hkscs
 240   ----------------------------------------------------------------
 241
 242 =item Encode::HanExtra -- More Chinese via CPAN
 243
 244 Due to size concerns, additional Chinese encodings below are
 245 distributed separately on CPAN, under the name Encode::HanExtra.
 246
 247   Standard      DOS/Win Macintosh                Comment/Reference
 248   ----------------------------------------------------------------
 249   big5ext                                   CMEX's Big5e Extension
 250   big5plus                                  CMEX's Big5+ Extension
 251   cccii         Chinese Character Code for Information Interchange
 252   euc-tw                             EUC (Extended Unix Character)
 253   gb18030                          GBK with Traditional Characters
 254   ----------------------------------------------------------------
 255
 256 =item Encode::JIS2K -- JIS X 0213 encodings via CPAN
 257
 258 Due to size concerns, additional Japanese encodings below are
 259 distributed separately on CPAN, under the name Encode::JIS2K.
 260
 261   Standard      DOS/Win Macintosh                Comment/Reference
 262   ----------------------------------------------------------------
 263   euc-jisx0213
 264   shiftjisx0123
 265   iso-2022-jp-3
 266   jis0213-1-raw
 267   jis0213-2-raw
 268   ----------------------------------------------------------------
 269
 270 =back
 271
 272 =head2 Miscellaneous encodings
 273
 274 =over 4
 275
 276 =item Encode::EBCDIC
 277
 278 See L<perlebcdic> for details.
 279
 280   ----------------------------------------------------------------
 281   cp37
 282   cp500
 283   cp875
 284   cp1026
 285   cp1047
 286   posix-bc
 287   ----------------------------------------------------------------
 288
 289 =item Encode::Symbols
 290
 291 For symbols  and dingbats.
 292
 293   ----------------------------------------------------------------
 294   symbol
 295   dingbats
 296   MacDingbats
 297   AdobeZdingbat
 298   AdobeSymbol
 299   ----------------------------------------------------------------
 300
 301 =item Encode::MIME::Header
 302
 303 Strictly speaking, MIME header encoding documented in RFC 2047 is more
 304 of encapsulation than encoding.  But included anyway.
 305
 306   ----------------------------------------------------------------
 307   MIME-Header                                            [RFC2047]
 308   MIME-B                                                 [RFC2047]
 309   MIME-Q                                                 [RFC2047]
 310   ----------------------------------------------------------------
 311
 312 =item Encode::Guess
 313
 314 This one is not a name of encoding but a utility that lets you pick up
 315 the most appropriate encoding for a data out of given I<suspects>.  See
 316 L<Encode::Guess> for details.
 317
 318 =back
 319
 320 =head1 Unsupported encodings
 321
 322 The following encodings are not supported as yet; some because they
 323 are rarely used, some because of technical difficulties.  They may
 324 be supported by external modules via CPAN in the future, however.
 325
 326 =over 4
 327
 328 =item   ISO-2022-JP-2 [RFC1554]
 329
 330 Not very popular yet.  Needs Unicode Database or equivalent to
 331 implement encode() (because it includes JIS X 0208/0212, KSC5601, and
 332 GB2312 simultaneously, whose code points in Unicode overlap.  So you
 333 need to lookup the database to determine to what character set a given
 334 Unicode character should belong).
 335
 336 =item ISO-2022-CN [RFC1922]
 337
 338 Not very popular.  Needs CNS 11643-1 and -2 which are not available in
 339 this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
 340 Autrijus Tang may add support for this encoding in his module in future.
 341
 342 =item Various HP-UX encodings
 343
 344 The following are unsupported due to the lack of mapping data.
 345
 346   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 347   '15' - japanese15, korean15, and roi15
 348
 349 =item Cyrillic encoding ISO-IR-111
 350
 351 Anton Tagunov doubts its usefulness.
 352
 353 =item ISO-8859-8-1 [Hebrew]
 354
 355 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 356 MacHebrew are supported because and just because there were mappings
 357 available at L<http://www.unicode.org/>).  Contributions welcome.
 358
 359 =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
 360
 361 Ditto.
 362
 363 =item Thai encoding TCVN
 364
 365 Ditto.
 366
 367 =item Vietnamese encodings VPS
 368
 369 Though Jungshik Shin has reported that Mozilla supports this encoding,
 370 it was too late before 5.8.0 for us to add it.  In the future, it
 371 may be available via a separate module.  See
 372 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
 373 and
 374 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 375 if you are interested in helping us.
 376
 377 =item Various Mac encodings
 378
 379 The following are unsupported due to the lack of mapping data.
 380
 381   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 382   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 383   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 384   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 385   MacVietnamese
 386
 387 The rest which are already available are based upon the vendor mappings
 388 at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 389
 390 =item (Mac) Indic encodings
 391
 392 The maps for the following are available at L<http://www.unicode.org/>
 393 but remain unsupport because those encodings need algorithmical
 394 approach, currently unsupported by F<enc2xs>:
 395
 396   MacDevanagari
 397   MacGurmukhi
 398   MacGujarati
 399
 400 For details, please see C<Unicode mapping issues and notes:> at
 401 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 402
 403 I believe this issue is prevalent not only for Mac Indics but also in
 404 other Indic encodings, but the above were the only Indic encodings
 405 maps that I could find at L<http://www.unicode.org/> .
 406
 407 =back
 408
 409 =head1 Encoding vs. Charset -- terminology
 410
 411 We are used to using the term (character) I<encoding> and I<character
 412 set> interchangeably.  But just as confusing the terms byte and
 413 character is dangerous and the terms should be differentiated when
 414 needed, we need to differentiate I<encoding> and I<character set>.
 415
 416 To understand that, here is a description of how we make computers
 417 grok our characters.
 418
 419 =over 4
 420
 421 =item *
 422
 423 First we start with which characters to include.  We call this
 424 collection of characters I<character repertoire>.
 425
 426 =item *
 427
 428 Then we have to give each character a unique ID so your computer can
 429 tell the difference between 'a' and 'A'.  This itemized character
 430 repertoire is now a I<character set>.
 431
 432 =item *
 433
 434 If your computer can grow the character set without further
 435 processing, you can go ahead and use it.  This is called a I<coded
 436 character set> (CCS) or I<raw character encoding>.  ASCII is used this
 437 way for most cases.
 438
 439 =item *
 440
 441 But in many cases, especially multi-byte CJK encodings, you have to
 442 tweak a little more.  Your network connection may not accept any data
 443 with the Most Significant Bit set, and your computer may not be able to
 444 tell if a given byte is a whole character or just half of it.  So you
 445 have to I<encode> the character set to use it.
 446
 447 A I<character encoding scheme> (CES) determines how to encode a given
 448 character set, or a set of multiple character sets.  7bit ISO-2022 is
 449 an example of a CES.  You switch between character sets via I<escape
 450 sequences>.
 451
 452 =back
 453
 454 Technically, or mathematically, speaking, a character set encoded in
 455 such a CES that maps character by character may form a CCS.  EUC is such
 456 an example.  The CES of EUC is as follows:
 457
 458 =over 4
 459
 460 =item *
 461
 462 Map ASCII unchanged.
 463
 464 =item *
 465
 466 Map such a character set that consists of 94 or 96 powered by N
 467 members by adding 0x80 to each byte.
 468
 469 =item *
 470
 471 You can also use 0x8e and 0x8f to indicate that the following sequence of
 472 characters belongs to yet another character set.  To each following byte
 473 is added the value 0x80.
 474
 475 =back
 476
 477 By carefully looking at the encoded byte sequence, you can find that the
 478 byte sequence conforms a unique number.  In that sense, EUC is a CCS
 479 generated by a CES above from up to four CCS (complicated?).  UTF-8
 480 falls into this category.  See L<perlUnicode/"UTF-8"> to find out how
 481 UTF-8 maps Unicode to a byte sequence.
 482
 483 You may also have found out by now why 7bit ISO-2022 cannot comprise
 484 a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
 485 it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
 486 so you have no trouble differentiating between "!!". and S<"  ">.
 487
 488 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 489
 490 This section tries to classify the supported encodings by their
 491 applicability for information exchange over the Internet and to
 492 choose the most suitable aliases to name them in the context of
 493 such communication.
 494
 495 =over 4
 496
 497 =item *
 498
 499 To (en|de)code encodings marked by C<(**)>, you need
 500 C<Encode::HanExtra>, available from CPAN.
 501
 502 =back
 503
 504 Encoding names
 505
 506   US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 507   Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 508   EUC-KR      Big5     GB2312
 509
 510 are registered with IANA as preferred MIME names and may
 511 be used over the Internet.
 512
 513 C<Shift_JIS> has been officialized by JIS X 0208:1997.
 514 L<Microsoft-related naming mess> gives details.
 515
 516 C<GB2312> is the IANA name for C<EUC-CN>.
 517 See L<Microsoft-related naming mess> for details.
 518
 519 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 520 with Encode. See L<Encode::CN> for details.
 521
 522   EUC-CN
 523   KOI8-U        [RFC2319]
 524
 525 have not been registered with IANA (as of March 2002) but
 526 seem to be supported by major web browsers.
 527 The IANA name for C<EUC-CN> is C<GB2312>.
 528
 529   KS_C_5601-1987
 530
 531 is heavily misused.
 532 See L<Microsoft-related naming mess> for details.
 533
 534 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 535 with Encode. See L<Encode::KR> for details.
 536
 537   UTF-16 UTF-16BE UTF-16LE
 538
 539 are IANA-registered C<charset>s. See [RFC 2781] for details.
 540 Jungshik Shin reports that UTF-16 with a BOM is well accepted
 541 by MS IE 5/6 and NS 4/6. Beware however that
 542
 543 =over 4
 544
 545 =item *
 546
 547 C<UTF-16> support in any software you're going to be
 548 using/interoperating with has probably been less tested
 549 then C<UTF-8> support
 550
 551 =item *
 552
 553 C<UTF-8> coded data seamlessly passes traditional
 554 command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
 555 data is likely to cause confusion (with its zero bytes,
 556 for example)
 557
 558 =item *
 559
 560 it is beyond the power of words to describe the way HTML browsers
 561 encode non-C<ASCII> form data. To get a general impression, visit
 562 L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
 563 While encoding of form data has stabilized for C<UTF-8> encoded pages
 564 (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
 565 expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
 566 pages!
 567
 568 =back
 569
 570 The rule of thumb is to use C<UTF-8> unless you know what
 571 you're doing and unless you really benefit from using C<UTF-16>.
 572
 573   ISO-IR-165    [RFC1345]
 574   VISCII
 575   GB 12345
 576   GB 18030 (**)  (see links bellow)
 577   EUC-TW   (**)
 578
 579 are totally valid encodings but not registered at IANA.
 580 The names under which they are listed here are probably the
 581 most widely-known names for these encodings and are recommended
 582 names.
 583
 584   BIG5PLUS (**)
 585
 586 is a proprietary name.
 587
 588 =head2 Microsoft-related naming mess
 589
 590 Microsoft products misuse the following names:
 591
 592 =over 4
 593
 594 =item KS_C_5601-1987
 595
 596 Microsoft extension to C<EUC-KR>.
 597
 598 Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
 599
 600 See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
 601 for details.
 602
 603 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
 604 misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
 605 C<kcs5601-raw>.
 606
 607 See L<Encode::KR> for details.
 608
 609 =item GB2312
 610
 611 Microsoft extension to C<EUC-CN>.
 612
 613 Proper names: C<CP936>, C<GBK>.
 614
 615 C<GB2312> has been registered in the C<EUC-CN> meaning at
 616 IANA. This has partially repaired the situation: Microsoft's
 617 C<GB2312> has become a superset of the official C<GB2312>.
 618
 619 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 620 IANA registration. C<cp936> is supported separately.
 621 I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 622
 623 See L<Encode::CN> for details.
 624
 625 =item Big5
 626
 627 Microsoft extension to C<Big5>.
 628
 629 Proper name: C<CP950>.
 630
 631 Encode separately supports C<Big5> and C<cp950>.
 632
 633 =item Shift_JIS
 634
 635 Microsoft's understanding of C<Shift_JIS>.
 636
 637 JIS has not endorsed the full Microsoft standard however.
 638 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 639 character sets, while Microsoft has always used C<Shift_JIS>
 640 to encode a wider character repertoire. See C<IANA> registration for
 641 C<Windows-31J>.
 642
 643 As a historical predecessor, Microsoft's variant
 644 probably has more rights for the name, though it may be objected
 645 that Microsoft shouldn't have used JIS as part of the name
 646 in the first place.
 647
 648 Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
 649
 650 Encode separately supports C<Shift_JIS> and C<cp932>.
 651
 652 =back
 653
 654 =head1 Glossary
 655
 656 =over 4
 657
 658 =item character repertoire
 659
 660 A collection of unique characters.  A I<character> set in the strictest
 661 sense. At this stage, characters are not numbered.
 662
 663 =item coded character set (CCS)
 664
 665 A character set that is mapped in a way computers can use directly.
 666 Many character encodings, including EUC, fall in this category.
 667
 668 =item character encoding scheme (CES)
 669
 670 An algorithm to map a character set to a byte sequence.  You don't
 671 have to be able to tell which character set a given byte sequence
 672 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 673 example of being both a CCS and CES.
 674
 675 =item charset (in MIME context)
 676
 677 has long been used in the meaning of C<encoding>, CES.
 678
 679 While the word combination C<character set> has lost this meaning
 680 in MIME context since [RFC 2130], the C<charset> abbreviation has
 681 retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
 682
 683  This document uses the term "charset" to mean a set of rules for
 684  mapping from a sequence of octets to a sequence of characters, such
 685  as the combination of a coded character set and a character encoding
 686  scheme; this is also what is used as an identifier in MIME "charset="
 687  parameters, and registered in the IANA charset registry ...  (Note
 688  that this is NOT a term used by other standards bodies, such as ISO).
 689                                                [RFC 2277]
 690
 691 =item EUC
 692
 693 Extended Unix Character.  See ISO-2022.
 694
 695 =item ISO-2022
 696
 697 A CES that was carefully designed to coexist with ASCII.  There are a 7
 698 bit version and an 8 bit version.
 699
 700 The 7 bit version switches character set via escape sequence so it
 701 cannot form a CCS.  Since this is more difficult to handle in programs
 702 than the 8 bit version, the 7 bit version is not very popular except for
 703 iso-2022-jp, the I<de facto> standard CES for e-mails.
 704
 705 The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
 706 thereof.  Pre-5.6 perl could use them as string literals.
 707
 708 =item UCS
 709
 710 Short for I<Universal Character Set>.  When you say just UCS, it means
 711 I<Unicode>.
 712
 713 =item UCS-2
 714
 715 ISO/IEC 10646 encoding form: Universal Character Set coded in two
 716 octets.
 717
 718 =item Unicode
 719
 720 A character set that aims to include all character repertoires of the
 721 world.  Many character sets in various national as well as industrial
 722 standards have become, in a way, just subsets of Unicode.
 723
 724 =item UTF
 725
 726 Short for I<Unicode Transformation Format>.  Determines how to map a
 727 Unicode character into a byte sequence.
 728
 729 =item UTF-16
 730
 731 A UTF in 16-bit encoding.  Can either be in big endian or little
 732 endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
 733 surrogate support) and the little endian version is called UTF-16LE.
 734
 735 =back
 736
 737 =head1 See Also
 738
 739 L<Encode>,
 740 L<Encode::Byte>,
 741 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 742 L<Encode::EBCDIC>, L<Encode::Symbol>
 743 L<Encode::MIME::Header>, L<Encode::Guess>
 744
 745 =head1 References
 746
 747 =over 4
 748
 749 =item ECMA
 750
 751 European Computer Manufacturers Association
 752 L<http://www.ecma.ch>
 753
 754 =over 4
 755
 756 =item ECMA-035 (eq C<ISO-2022>)
 757
 758 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 759
 760 The specification of ISO-2022 is available from the link above.
 761
 762 =back
 763
 764 =item IANA
 765
 766 Internet Assigned Numbers Authority
 767 L<http://www.iana.org/>
 768
 769 =over 4
 770
 771 =item Assigned Charset Names by IANA
 772
 773 L<http://www.iana.org/assignments/character-sets>
 774
 775 Most of the C<canonical names> in Encode derive from this list
 776 so you can directly apply the string you have extracted from MIME
 777 header of mails and web pages.
 778
 779 =back
 780
 781 =item ISO
 782
 783 International Organization for Standardization
 784 L<http://www.iso.ch/>
 785
 786 =item RFC
 787
 788 Request For Comments -- need I say more?
 789 L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
 790 L<http://www.faqs.org/rfcs/>
 791
 792 =item UC
 793
 794 Unicode Consortium
 795 L<http://www.unicode.org/>
 796
 797 =over 4
 798
 799 =item Unicode Glossary
 800
 801 L<http://www.unicode.org/glossary/>
 802
 803 The glossary of this document is based upon this site.
 804
 805 =back
 806
 807 =back
 808
 809 =head2 Other Notable Sites
 810
 811 =over 4
 812
 813 =item czyborra.com
 814
 815 L<http://czyborra.com/>
 816
 817 Contains a a lot of useful information, especially gory details of ISO
 818 vs. vendor mappings.
 819
 820 =item CJK.inf
 821
 822 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 823
 824 Somewhat obsolete (last update in 1996), but still useful.  Also try
 825
 826 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 827
 828 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
 829
 830 =item Jungshik Shin's Hangul FAQ
 831
 832 L<http://jshin.net/faq>
 833
 834 And especially its subject 8.
 835
 836 L<http://jshin.net/faq/qa8.html>
 837
 838 A comprehensive overview of the Korean (C<KS *>) standards.
 839
 840 =item debian.org: "Introduction to i18n"
 841
 842 A brief description for most of the mentioned CJK encodings is
 843 contained in
 844 L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
 845
 846 =back
 847
 848 =head2 Offline sources
 849
 850 =over 4
 851
 852 =item C<CJKV Information Processing> by Ken Lunde
 853
 854 CJKV Information Processing
 855 1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 856
 857 The modern successor of C<CJK.inf>.
 858
 859 Features a comprehensive coverage of CJKV character sets and
 860 encodings along with many other issues faced by anyone trying
 861 to better support CJKV languages/scripts in all the areas of
 862 information processing.
 863
 864 To purchase this book, visit
 865 L<http://www.oreilly.com/catalog/cjkvinfo/>
 866 or your favourite bookstore.
 867
 868 =back
 869
 870 =cut