ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 the first in the following sequence (with a few exceptions).
  14
  15 =over
  16
  17 =item *
  18
  19 The name used by the Perl community.  That includes 'utf8' and 'ascii'.
  20 Unlike aliases, canonical names directly reach the method so such
  21 frequently used words like 'utf8' don't need to do alias lookups.
  22
  23 =item *
  24
  25 The MIME name as defined in IETF RFCs  This includes all "iso-"'s.
  26
  27 =item *
  28
  29 The name in the IANA registry.
  30
  31 =item *
  32
  33 The name used by the organization that defined it.
  34
  35 =back
  36
  37 In case I<de jure> canonical names differ from that of the Encode
  38 module, they are always aliased if it ever be implemented.  So you can
  39 safely tell if a given encoding is implemented or not just by passing
  40 the canonical name.
  41
  42 Because of all the alias issues, and because in the general case
  43 encodings have state, "Encode" uses an encoding object internally
  44 once an operation is in progress.
  45
  46 =head1 Supported Encodings
  47
  48 As of Perl 5.8.0, at least the following encodings are recognized.
  49 Note that unless otherwise specified, they are all case insensitive
  50 (via alias) and all occurrence of spaces are replaced with '-'.
  51 In other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52
  53 Encodings are categorized and implemented in several different modules
  54 but you don't have to C<use Encode::XX> to make them available for
  55 most cases.  Encode.pm will automatically load those modules on demand.
  56
  57 =head2 Built-in Encodings
  58
  59 The following encodings are always available.
  60
  61   Canonical     Aliases                      Comments & References
  62   ----------------------------------------------------------------
  63   ascii         US-ascii                                    [ECMA]
  64   iso-8859-1    latin1                                       [ISO]
  65   utf8          UTF-8                                    [RFC2279]
  66   ----------------------------------------------------------------
  67
  68 =head2 Encode::Unicode -- other Unicode encodings
  69
  70 Unicode coding schemes other than native utf8 are supported by
  71 Encode::Unicode which will be autoloaded on demand.
  72
  73   ----------------------------------------------------------------
  74   UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
  75   UCS-2LE                                                     [UC]
  76   UTF-16                                                      [UC]
  77   UTF-16BE                                                    [UC]
  78   UTF-16LE                                                    [UC]
  79   UTF-32                                                      [UC]
  80   UTF-32BE                                                    [UC]
  81   UTF-32LE                                                    [UC]
  82   ----------------------------------------------------------------
  83
  84 To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
  85 see L<Encode::Unicode>.
  86
  87 =head2 Encode::Byte -- Extended ASCII
  88
  89 Encode::Byte implements most of single-byte encodings except for
  90 Symbols and EBCDIC. The following encodings are based single-byte
  91 encoding implemented as extended ASCII.  For most cases it uses
  92 \x80-\xff (upper half) to map non-ASCII characters.
  93
  94 =over 2
  95
  96 =item ISO-8859 and corresponding vendor mappings
  97
  98 Since there are so many, they are presented in table format with
  99 languages and corresponding encoding names by vendors.  Note the table
 100 is sorted in order of ISO-8859 and the corresponding vendor mappings
 101 are slightly different from that of ISO.  See
 102 L<http://czyborra.com/charsets/iso8859.html> for details.
 103
 104   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
 105   ----------------------------------------------------------------
 106   N. America    (ASCII)         cp437        AdobeStandardEncoding
 107                                 cp863 (DOSCanadaF)
 108   W.  Europe    iso-8859-1     cp850   cp1252  MacRoman  nextstep
 109                                                          hp-roman8
 110                                 cp860 (DOSPortuguese)
 111   Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
 112                                                 MacCroatian
 113                                                 MacRomanian
 114                                                 MacRumanian
 115   Latin3 [1]    iso-8859-3
 116   Latin4 [2]    iso-8859-4
 117   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 118     (Also see next section)     cp866           MacUkrainian
 119   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 120                                 cp1006          MacFarsi
 121   Greek         iso-8859-7      cp737   cp1253  MacGreek
 122                                 cp869 (DOSGreek2)
 123   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 124   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 125   Nordics       iso-8859-10     cp865
 126                                 cp861           MacIcelandic
 127                                                 MacSami
 128   Thai          iso-8859-11 [3] cp874           MacThai
 129   (iso-8859-12 is nonexistent. Reserved for Indics?)
 130   Baltics       iso-8859-13     cp775           cp1257
 131   Celtics       iso-8859-14
 132   Latin9 [4]    iso-8859-15
 133   Latin10       iso-8859-16
 134   Vietnamese    viscii                  cp1258  MacVietnamese
 135   ----------------------------------------------------------------
 136
 137   [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-5.
 138   [2] Baltics.  Now on 8859-10.
 139   [3] Also know as TIS 620.
 140   [4] Nicknamed Latin0; Euro sign as well as French and Finnish
 141       letters that are missing from 8859-1 are added.
 142
 143 All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 144 L<http://czyborra.com/charsets/codepages.html>.
 145
 146 Macintosh encodings don't seem to be registered in such entities as
 147 IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 148 1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
 149 for details
 150
 151 =item KOI8 - De Facto Standard for Cyrillic world
 152
 153 Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
 154 in the Net.   L<Encode> comes with the following KOI charsets.
 155 For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
 156
 157   ----------------------------------------------------------------
 158   koi8-f
 159   koi8-r cp878                                           [RFC1489]
 160   koi8-u                                                 [RFC2319]
 161
 162 =item gsm0338 - Hentai Latin 1
 163
 164 GSM0338 is for GSM handsets. Though it shares alphanumerals with
 165 ASCII, control character ranges and other parts are mapped very
 166 differently, presumably to store Greek and Cyrillic alphabets.
 167 This is also covered in Encode::Byte even though it does not
 168 comply to extended ASCII.
 169
 170 =back
 171
 172 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 173
 174 Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
 175 below.  Also note these are implemented in distinct module by
 176 languages, due the the size concerns.  Please refer to their
 177 respective document pages.
 178
 179 =over 4
 180
 181 =item Encode::CN -- Continental China
 182
 183   Standard      DOS/Win Macintosh                Comment/Reference
 184   ----------------------------------------------------------------
 185   euc-cn [1]            MacChineseSimp
 186   (gbk)         cp936 [2]
 187   gb12345-raw                      { GB12345 without CES }
 188   gb2312-raw                       { GB2312  without CES }
 189   hz
 190   iso-ir-165
 191   ----------------------------------------------------------------
 192
 193   [1] GB2312 is aliased to this.  see L<Microsoft-related naming mess>
 194   [2] gbk is aliased to this. see L<Microsoft-related naming mess>
 195
 196 =item Encode::JP -- Japan
 197
 198   Standard      DOS/Win Macintosh                Comment/Reference
 199   ----------------------------------------------------------------
 200   euc-jp
 201   shiftjis      cp932   macJapanese
 202   7bit-jis
 203   euc-jp
 204   iso-2022-jp                                            [RFC1468]
 205   iso-2022-jp-1                                          [RFC2237]
 206   jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 207   jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 208   jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 209   ----------------------------------------------------------------
 210
 211 =item Encode::KR -- Korea
 212
 213   Standard      DOS/Win Macintosh                Comment/Reference
 214   ----------------------------------------------------------------
 215   euc-kr                MacKorean                        [RFC1557]
 216                 cp949 [1]
 217   iso-2022-kr                                            [RFC1557]
 218   johab                                  [KS X 1001:1998, Annex 3]
 219   ksc5601-raw                              { KSC5601 without CES }
 220   ----------------------------------------------------------------
 221
 222   [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
 223   See below.
 224
 225 =item Encode::TW -- Taiwan
 226
 227   Standard      DOS/Win Macintosh                Comment/Reference
 228   ----------------------------------------------------------------
 229   big5          cp950   MacChineseTrad
 230   big5-hkscs
 231   ----------------------------------------------------------------
 232
 233 =item Encode::HanExtra -- More Chinese via CPAN
 234
 235 Due to size concerns, additional Chinese encodings below are
 236 distributed separately on CPAN, under the name Encode::HanExtra.
 237
 238   Standard      DOS/Win Macintosh                Comment/Reference
 239   ----------------------------------------------------------------
 240   gb18030
 241   euc-tw
 242   big5plus
 243   ----------------------------------------------------------------
 244
 245 =back
 246
 247 =head2 Miscellaneous encodings
 248
 249 =over 4
 250
 251 =item Encode::EBCDIC
 252
 253 See L<perlebcdic> for details.
 254
 255   ----------------------------------------------------------------
 256   cp37
 257   cp500
 258   cp875
 259   cp1026
 260   cp1047
 261   posix-bc
 262   ----------------------------------------------------------------
 263
 264 =item Encode::Symbols
 265
 266 For symbols  and dingbats.
 267
 268   ----------------------------------------------------------------
 269   symbol
 270   dingbats
 271   MacDingbats
 272   AdobeZdingbat
 273   AdobeSymbol
 274   ----------------------------------------------------------------
 275
 276 =back
 277
 278 =head1 Unsupported encodings
 279
 280 The following are not supported as yet.  Some because they are rarely
 281 used, some because of technical difficulties.  They may be supported by
 282 external modules via CPAN in future, however.
 283
 284 =over 4
 285
 286 =item   ISO-2022-JP-2 [RFC1554]
 287
 288 Not very popular yet.  Needs Unicode Database or equivalent to
 289 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
 290 GB2312 simultaneously, which code points in Unicode overlap.  So you
 291 need to lookup the database to determine what character set a given
 292 Unicode character should belong).
 293
 294 =item ISO-2022-CN [RFC1922]
 295
 296 Not very popular.  Needs CNS 11643-1 and 2 which are not available in
 297 this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
 298 Autrijus may add support for this encoding in his module in future.
 299
 300 =item various UP-UX encodings
 301
 302 The following are unsupported due to the lack of mapping data.
 303
 304   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 305   '15' - japanese15, korean15, and roi15
 306
 307 =item Cyrillic encoding ISO-IR-111
 308
 309 Anton doubts its usefulness.
 310
 311 =item ISO-8859-8-1 [Hebrew]
 312
 313 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 314 MacHebrew are supported because and just because there were mappings
 315 available at L<http://www.unicode.org/>).  Contributions welcome.
 316
 317 =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
 318
 319 Ditto.
 320
 321 =item Thai encoding TCVN
 322
 323 Ditto.
 324
 325 =item Vietnamese encodings VPS
 326
 327 Though Jungshik has reported that Mozilla supports this encoding it
 328 was too late before 5.8.0 for us to add one.  In future via a separate
 329 module.  See
 330 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
 331 and
 332 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 333 if you are interested in helping us.
 334
 335 =item Various Mac encodings
 336
 337 The following are unsupported due to the lack of mapping data.
 338
 339   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 340   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 341   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 342   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 343   MacVietnamese
 344
 345 The rest of which already available are based upon the vendor mappings
 346 at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 347
 348 =item (Mac) Indic encodings
 349
 350 The maps for the following is available at L<http://www.unicode.org/>
 351 but remains unsupport because those encodings need algorithmical
 352 approach, currently unsupported by F<enc2xs>
 353
 354   MacDevanagari
 355   MacGurmukhi
 356   MacGujarati
 357
 358 For details, please see C<Unicode mapping issues and notes:> at
 359 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 360
 361 I believe this issue is prevalent not only for Mac Indics but also in
 362 other Indic encodings, but the above were the only Indic encodings
 363 maps that I could find at L<http://www.unicode.org/> .
 364
 365 =back
 366
 367 =head1 Encoding vs. Charset -- terminology
 368
 369 We are used to using the term (character) I<encoding> and I<character set>
 370 interchangeably.  But just as using the term byte and character is
 371 dangerous and should be differentiated when needed, we need to
 372 differentiate I<encoding> and I<character set>.
 373
 374 To understand that, it's follow how we make computers grok our characters.
 375
 376 =over 4
 377
 378 =item *
 379
 380 First we start with which characters to include.  We call this
 381 collection of characters I<character repertoire>.
 382
 383 =item *
 384
 385 Then we have to give each character a unique ID so your computer can
 386 tell the difference from 'a' to 'A'.  This itemized character
 387 repertoire is now a I<character set>.
 388
 389 =item *
 390
 391 If your computer can grow the character set without further
 392 processing, you can go ahead use it.  This is called a I<coded
 393 character set> (CCS) or I<raw character encoding>.  ASCII is used this
 394 way for most cases.
 395
 396 =item *
 397
 398 But in many cases especially multi-byte CJK encodings, you have to
 399 tweak a little more.  Your network connection may not accept any data
 400 with the Most Significant Bit set, Your computer may not be able to
 401 tell if a given byte is a whole character or just half of it.  So you
 402 have to I<encode> the character set to use it.
 403
 404 A I<character encoding scheme> (CES) determines how to encode a given
 405 character set, or a set of multiple character sets.  7bit ISO-2022 is
 406 an example of CES.  You switch between character sets via I<escape
 407 sequence>.
 408
 409 =back
 410
 411 Technically, or Mathematically speaking, a character set encoded in
 412 such a CES that maps character by character may form a CCS.  EUC is such
 413 an example.  CES of EUC is as follows;
 414
 415 =over 4
 416
 417 =item *
 418
 419 Map ASCII unchanged.
 420
 421 =item *
 422
 423 Map such a character set that consists of 94 or 96 powered by N
 424 members by adding 0x80 to each byte.
 425
 426 =item *
 427
 428 You can also use 0x8e and 0x8f to tell the following sequence of
 429 characters belong to yet another character set.  each following byte
 430 is added by 0x80
 431
 432 =back
 433
 434 By carefully looking at at the encoded byte sequence, you may find the
 435 byte sequence conforms a unique number.  In that sense EUC is a CCS
 436 generated by a CES above from up to four CCS (complicated?).  UTF-8
 437 falls into this category.  See L<perlUnicode/"UTF-8"> to find how
 438 UTF-8 maps Unicode to a byte sequence.
 439
 440 You may also find by now why 7bit ISO-2022 cannot conform a CCS.  If
 441 you look at a byte sequence \x21\x21, you can't tell if it is two !'s
 442 or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you have no
 443 trouble between "!!". and "  "
 444
 445 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 446
 447 This section tries to classify the supported encodings by their
 448 applicability for information exchange over the Internet and to
 449 choose the most suitable aliases to name them in the context of
 450 such communication.
 451
 452 =over 2
 453
 454 =item *
 455
 456 To (en|de) code Encodings marked as C<(**)>, You need
 457 C<Encode::HanExtra>, available from CPAN.
 458
 459 =back
 460
 461 Encoding names
 462
 463   US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 464   Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 465   EUC-KR      Big5     GB2312
 466
 467 are registered to IANA as preferred MIME names and may probably
 468 be used over the Internet.
 469
 470 C<Shift_JIS> has been officialized by JIS X 0208:1997.
 471 L<Microsoft-related naming mess> gives details.
 472
 473 C<GB2312> is the IANA name for C<EUC-CN>.
 474 See L<Microsoft-related naming mess> for details.
 475
 476 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 477 with Encode. See L<Encode::CN> for details.
 478
 479   EUC-CN
 480   KOI8-U        [RFC2319]
 481
 482 have not been registered with IANA (as of March 2002) but
 483 seem to be supported by major web browsers.
 484 IANA name for C<EUC-CN> is C<GB2312>.
 485
 486   KS_C_5601-1987
 487
 488 is heavily misused.
 489 See L<Microsoft-related naming mess> for details.
 490
 491 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 492 with Encode. See L<Encode::KR> for details.
 493
 494   UTF-16 UTF-16BE UTF-16LE
 495
 496 are IANA-registered C<charset>s. See [RFC 2781] for details.
 497 Jungshik Shin reports that UTF-16 with a BOM is well accepted
 498 by MS IE 5/6 and NS 4/6. Beware however that
 499
 500 =over 2
 501
 502 =item *
 503
 504 C<UTF-16> support in any software you're going to be
 505 using/interoperating with has probably been less tested
 506 then C<UTF-8> support
 507
 508 =item *
 509
 510 C<UTF-8> coded data seamlessly passes traditional
 511 command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
 512 data is likely to cause confusion (with it's zero bytes,
 513 for example)
 514
 515 =item *
 516
 517 it is beyond the power of words to describe the way HTML browsers
 518 encode non-C<ASCII> form data. To get a general impression visit
 519 L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
 520 While encoding of form data has stabilized for C<UTF-8> coded pages
 521 (at least IE 5/6, NS 6, Opera 6 behave consistently), be sure to
 522 expect fun (and cross-browser discrepancies) with C<UTF-16> coded
 523 pages!
 524
 525 =back
 526
 527 The rule of thumb is to use C<UTF-8> unless you know what
 528 you're doing and unless you really benefit from using C<UTF-16>.
 529
 530
 531   ISO-IR-165    [RFC1345]
 532   VISCII
 533   GB 12345
 534   GB 18030 (**)  (see links bellow)
 535   EUC-TW   (**)
 536
 537 are totally valid encodings but not registered at IANA.
 538 The names under which they are listed here are probably the
 539 most widely-known names for these encodings and are recommended
 540 names.
 541
 542   BIG5PLUS (**)
 543
 544 is a bit proprietary name.
 545
 546 =head2 Microsoft-related naming mess
 547
 548 Microsoft products misuse the following names:
 549
 550 =over 2
 551
 552 =item KS_C_5601-1987
 553
 554 Microsoft extension to C<EUC-KR>.
 555
 556 Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
 557
 558 See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
 559 for details.
 560
 561 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
 562 misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
 563 C<kcs5601-raw>.
 564
 565 See L<Encode::KR> for details.
 566
 567 =item GB2312
 568
 569 Microsoft extension to C<EUC-CN>.
 570
 571 Proper names: C<CP936>, C<GBK>.
 572
 573 C<GB2312> has been registered in the C<EUC-CN> meaning at
 574 IANA. This has partially repaired the situation: Microsoft's
 575 C<GB2312> has become a superset of the official C<GB2312>.
 576
 577 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 578 IANA registration. C<cp936> is supported separately.
 579 I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 580
 581 See L<Encode::CN> for details.
 582
 583 =item Big5
 584
 585 Microsoft extension to C<Big5>.
 586
 587 Proper name: C<CP950>.
 588
 589 Encode separately supports C<Big5> and C<cp950>.
 590
 591 =item Shift_JIS
 592
 593 Microsoft's understanding of C<Shift_JIS>.
 594
 595 JIS has not endorsed the full Microsoft standard however.
 596 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 597 subsets, while Microsoft has always been meaning C<Shift_JIS> to
 598 encode a wider character repertoire. See C<IANA> registration for
 599 C<Windows-31J>.
 600
 601 As a historical predecessor Microsoft's variant
 602 probably has more rights for the name, albeit it may be objected
 603 that Microsoft shouldn't have used JIS as part of the name
 604 in the first place.
 605
 606 Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
 607
 608 Encode separately supports C<Shift_JIS> and C<cp932>.
 609
 610 =back
 611
 612 =head1 Glossary
 613
 614 =over 2
 615
 616 =item character repertoire
 617
 618 A collection of unique characters.  A I<character> set in the most
 619 strict sense. At this stage characters are not numbered.
 620
 621 =item coded character set (CCS)
 622
 623 A character set that is mapped in a way computers can use directly.
 624 Many character encodings including EUC falls in this category.
 625
 626 =item character encoding scheme (CES)
 627
 628 An algorithm to map a character set to a byte sequence.  You don't
 629 have to be able to tell which character set a given byte sequence
 630 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 631 example of being both a CCS and CES.
 632
 633 =item charset (in MIME context)
 634
 635 has long been used in the meaning of C<encoding>, CES.
 636
 637 While C<character set> word combination has lost this meaning
 638 in MIME context since [RFC 2130], C<charset> abbreviation has
 639 retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
 640
 641
 642  This document uses the term "charset" to mean a set of rules for
 643  mapping from a sequence of octets to a sequence of characters, such
 644  as the combination of a coded character set and a character encoding
 645  scheme; this is also what is used as an identifier in MIME "charset="
 646  parameters, and registered in the IANA charset registry ...  (Note
 647  that this is NOT a term used by other standards bodies, such as ISO).
 648                                                [RFC 2277]
 649
 650 =item EUC
 651
 652 Extended Unix Character.  See ISO-2022
 653
 654 =item ISO-2022
 655
 656 A CES that was carefully designed to coexist with ASCII.  There are 7
 657 bit version and 8 bit version.
 658
 659 7 bit version switches character set via escape sequence so this
 660 cannot form a CCS.  Since this is more difficult to handle in programs
 661 than the 8 bit version, 7 bit version is not very popular except for
 662 iso-2022-jp, the de facto standard CES for e-mails.
 663
 664 8 bit version can conform a CCS.  EUC and ISO-8859 are two examples
 665 thereof.  Pre-5.6 perl could use them as string literals.
 666
 667 =item UCS
 668
 669 Short for I<Universal Character Set>.  When you say just UCS, it means
 670 I<Unicode>
 671
 672 =item UCS-2
 673
 674 ISO/IEC 10646 encoding form: Universal Character Set coded in two
 675 octets.
 676
 677 =item Unicode
 678
 679 A Character Set that aims to include all character repertoire of the
 680 world.  Many character sets in various national as well as industrial
 681 standards have become, in a way, just subsets of Unicode.
 682
 683 =item UTF
 684
 685 Short for I<Unicode Transformation Format>.  Determines how to map a
 686 Unicode character into byte sequence.
 687
 688 =item UTF-16
 689
 690 A UTF in 16-bit encoding.  Can either be in big endian or little
 691 endian.  Big endian version is called UTF-16BE (equals to UCS-2 +
 692 Surrogate Support) and little endian version is UTF-16LE.
 693
 694 =back
 695
 696 =head1 See Also
 697
 698 L<Encode>,
 699 L<Encode::Byte>,
 700 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 701 L<Encode::EBCDIC>, L<Encode::Symbol>
 702
 703 =head1 References
 704
 705 =over 2
 706
 707 =item ECMA
 708
 709 European Computer Manufacturers Association
 710 L<http://www.ecma.ch>
 711
 712 =over 2
 713
 714 =item EMCA-035 (eq C<ISO-2022>)
 715
 716 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 717
 718 The very specification of ISO-2022 is available from the link above.
 719
 720 =back
 721
 722 =item IANA
 723
 724 Internet Assigned Numbers Authority
 725 L<http://www.iana.org/>
 726
 727 =over 2
 728
 729 =item Assigned Charset Names by IANA
 730
 731 L<http://www.iana.org/assignments/character-sets>
 732
 733 Most of the C<canonical names> in Encode derive from this list
 734 so you can directly apply the string you have extracted from MIME
 735 header of mails and web pages.
 736
 737 =back
 738
 739 =item ISO
 740
 741 International Organization for Standardization
 742 L<http://www.iso.ch/>
 743
 744 =item RFC
 745
 746 Request For Comments -- need I say more?
 747 L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
 748
 749 =item UC
 750
 751 Unicode Consortium
 752 L<http://www.unicode.org/>
 753
 754 =over 2
 755
 756 =item Unicode Glossary
 757
 758 L<http://www.unicode.org/glossary/>
 759
 760 The glossary of this document is based upon this site.
 761
 762 =back
 763
 764 =back
 765
 766 =head2 Other Notable Sites
 767
 768 =over 2
 769
 770 =item czyborra.com
 771
 772 L<http://czyborra.com/>
 773
 774 Contains a a lot of useful information, especially gory details of ISO
 775 vs. vendor mappings.
 776
 777 =item CJK.inf
 778
 779 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 780
 781 Somewhat obsolete (last update in 1996), but still useful.  Also try
 782
 783 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 784
 785 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 786
 787 =item Jungshik Shin's Hangul FAQ
 788
 789 L<http://jshin.net/faq>
 790
 791 And especially it's subject 8.
 792
 793 L<http://jshin.net/faq/qa8.html>
 794
 795 A comprehensive overview of the Korean (C<KS *>) standards.
 796
 797 =back
 798
 799 =head2 Offline sources
 800
 801 =over 2
 802
 803 =item C<CJKV Information Processing> by Ken Lunde
 804
 805 CJKV Information Processing
 806 1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 807
 808 The modern successor of the C<CJK.inf>.
 809
 810 Features a comprehensive coverage on CJKV character sets and
 811 encodings along with many other issues faced by anyone trying
 812 to better support CJKV languages/scripts in all the areas of
 813 information processing.
 814
 815 To purchase this book visit
 816 L<http://www.oreilly.com/catalog/cjkvinfo/>
 817
 818 =back
 819
 820 =cut
 821
 822 I could not find this page because the hostname doesn't resolve!
 823
 824 Brief description for most of the mentioned CJK encodings
 825 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>