ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 the first in the following sequence (with a few exceptions).
  14
  15 =over
  16
  17 =item *
  18
  19 The name used by the perl community.  That includes 'utf8' and 'ascii'.
  20 Unlike aliases, canonical names directly reaches the method so such
  21 frequently used words like 'utf8' should do without alias lookups.
  22
  23 =item *
  24
  25 The MIME name as defined in IETF RFCs  This includes all "iso-"'s.
  26
  27 =item *
  28
  29 The name in the IANA registry.
  30
  31 =item *
  32
  33 The name used by the organization that defined it.
  34
  35 =back
  36
  37 In case I<de jure> canonical names differ from that of the Encode
  38 module, they are always aliased if it ever be implemented.  So you can
  39 safely tell if a given encoding is implemented or not just by passing
  40 the canonical name.
  41
  42 Because of all the alias issues, and because in the general case
  43 encodings have state, "Encode" uses the encoding object internally
  44 once an operation is in progress.
  45
  46 =head1 Supported Encodings
  47
  48 As of Perl 5.8.0, at least the following encodings are recognized.
  49 Note that unless otherwise specified, they are all case insensitive
  50 (via alias) and all occurrance of spaces are replaced with '-'.  In
  51 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  52
  53 Encodings are categorized and implemented in several different modules
  54 but you don't have to C<use Encode::XX> to make them available for
  55 most cases.  Encode.pm will automatically load those modules in need.
  56
  57 =head2 Built-in Encodings
  58
  59 The following encodings are always available.
  60
  61   Canonical     Aliases                      Comments & References
  62   ----------------------------------------------------------------
  63   ascii         US-ascii                                    [ECMA]
  64   iso-8859-1    latin1                                       [ISO]
  65   utf8          UTF-8                                    [RFC2279]
  66   UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
  67   UCS-2LE                                                     [UC]
  68   UTF-16                                                      [UC]
  69   UTF-16BE                                                    [UC]
  70   UTF-16LE                                                    [UC]
  71   UTF-32                                                      [UC]
  72   UTF-32BE                                                    [UC]
  73   UTF-32LE                                                    [UC]
  74   ----------------------------------------------------------------
  75
  76 To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
  77 see L<Encode::Unicode>.
  78
  79 =head2 Encode::Byte -- Extended ASCII
  80
  81 Encode::Byte implements most of single-byte encodings except for
  82 Symbols and EBCDIC. The following encodings are based single-byte
  83 encoding implemented as extended ASCII.  For most cases it uses
  84 \x80-\xff (upper half) to map non-ASCII characters.
  85
  86 =over 2
  87
  88 =item ISO-8859 and corresponding vendor mappings
  89
  90 Since there are so many, They are presented in table format with
  91 Languages and corresponding encoding names by vendors.  Note the table
  92 is sorted in order of ISO-8859 and the corresponding vendor mappings
  93 are slightly different from that of ISO.  See
  94 L<http://czyborra.com/charsets/iso8859.html> for details.
  95
  96   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
  97   ----------------------------------------------------------------
  98   N. America    (ASCII)         cp437        AdobeStandardEncoding
  99                                 cp863 (DOSCanadaF)
 100   W.  Europe    (iso-8859-1)    cp850   cp1252  MacRoman  nextstep
 101                                                          hp-roman8
 102                                 cp860 (DOSPortuguese)
 103   CE. Europe    iso-8859-2      cp852   cp1250  MacCentralEurRoman
 104                                                 MacCroatian
 105                                                 MacRomanian
 106                                                 MacRumanian
 107   Latin3(*3)    iso-8859-3
 108   Latin4(*4)    iso-8859-4
 109   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 110     (Also see next section)     cp866           MacUkrainian
 111   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 112                                 cp1006          MacFarsi
 113   Greek         iso-8859-7      cp737   cp1253  MacGreek
 114                                 cp869 (DOSGreek2)
 115   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 116   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 117   Nordics       iso-8859-10     cp865
 118                                 cp861           MacIcelandic
 119                                                 MacSami
 120   Thai          iso-8859-11     cp874           MacThai
 121   (iso-8859-12 is nonexistent. Reserved for Indics?)
 122   Baltics       iso-8859-13     cp775           cp1257
 123   Celtics       iso-8859-14
 124   Latin9(*15)   iso-8859-15
 125   Latin10       iso-8859-16
 126   Vietnamese    viscii                  cp1258  MacVietnamese
 127   ----------------------------------------------------------------
 128
 129   (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
 130   (*4) Baltics.  Now on 8859-10
 131   (*9) Nicknamed Latin0; Euro sign as well as  French and Finnish
 132        letters that are missing from 8859-1 are added.
 133
 134 All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 135 L<http://czyborra.com/charsets/codepages.html>.
 136
 137 Macintosh encodings don't seem to be registered in such entities as
 138 IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
 139 1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
 140 for details
 141
 142 =item KOI8 - De Facto Standard for Cyrillic world
 143
 144 Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
 145 in the Net.   L<Encode> comes with the following KOI charsets.  for
 146 gory details, See <http://czyborra.com/charsets/cyrillic.html> for
 147 details.
 148
 149   ----------------------------------------------------------------
 150   koi8-f
 151   koi8-r cp878                                           [RFC1489]
 152   koi8-u                                                 [RFC2319]
 153
 154 =item gsm0338 - Hentai Latin 1
 155
 156 GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
 157 control character ranges and other parts are mapped very differently,
 158 presumablly to store Greek and Cyrillic alphabets.  This one is also
 159 covered in Encode::Byte even thought this one does not comply extended
 160 ASCII.
 161
 162 =back
 163
 164 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 165
 166 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 167 below.  Also note these are implemented in distinct module by
 168 languages, due the the size concerns.  Please also refer to their
 169 respective document pages.
 170
 171 =over 4
 172
 173 =item Encode::CN -- Continental China
 174
 175   Standard      DOS/Win Macintosh                Comment/Reference
 176   ----------------------------------------------------------------
 177   euc-cn(*1)            MacChineseSimp
 178   (gbk)         cp936 (*2)
 179   gb12345-raw                      { GB12345 without CES }
 180   gb2312-raw                       { GB2312  without CES }
 181   hz
 182   iso-ir-165
 183   ----------------------------------------------------------------
 184
 185   (*1) GB2312 is aliased to this.  see L<Microsoft-related naming mess>
 186   (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
 187
 188 =item Encode::JP -- Japan
 189
 190   Standard      DOS/Win Macintosh                Comment/Reference
 191   ----------------------------------------------------------------
 192   euc-jp
 193   shiftjis      cp932   macJapanese
 194   7bit-jis
 195   euc-jp
 196   iso-2022-jp                                            [RFC1468]
 197   iso-2022-jp-1                                          [RFC2237]
 198   jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 199   jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 200   jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 201   ----------------------------------------------------------------
 202
 203 =item Encode::KR -- Korea
 204
 205   Standard      DOS/Win Macintosh                Comment/Reference
 206   ----------------------------------------------------------------
 207   euc-kr                MacKorean                        [RFC1557]
 208                 cp949 (*)
 209   iso-2022-kr                                            [RFC1557]
 210   johab                                  [KS X 1001:1998, Annex 3]
 211   ksc5601-raw                              { KSC5601 without CES }
 212   ----------------------------------------------------------------
 213
 214   (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
 215   this.  See below.
 216
 217
 218 =item Encode::TW -- Taiwan
 219
 220   Standard      DOS/Win Macintosh                Comment/Reference
 221   ----------------------------------------------------------------
 222   big5          cp950   MacChineseTrad
 223   big5-hkscs
 224   ----------------------------------------------------------------
 225
 226 =item Encode::HanExtra -- More Chinese via CPAN
 227
 228 Due to size concerns, additional Chinese encodings below are
 229 distributed separately on CPAN, under the name Encode::HanExtra.
 230
 231   Standard      DOS/Win Macintosh                Comment/Reference
 232   ----------------------------------------------------------------
 233   gb18030
 234   euc-tw
 235   big5plus
 236   ----------------------------------------------------------------
 237
 238 =back
 239
 240 =head2 Miscellaneous encodings
 241
 242 =over 4
 243
 244 =item Encode::EBCDIC
 245
 246 See L<perlebcdic> for details.
 247
 248   ----------------------------------------------------------------
 249   cp37
 250   cp500
 251   cp875
 252   cp1026
 253   cp1047
 254   posix-bc
 255   ----------------------------------------------------------------
 256
 257 =item Encode::Symbols
 258
 259 For symbols  and dingbats.
 260
 261   ----------------------------------------------------------------
 262   symbol
 263   dingbats
 264   MacDingbats
 265   AdobeZdingbat
 266   AdobeSymbol
 267   ----------------------------------------------------------------
 268
 269 =back
 270
 271 =head1 Unsupported encodings
 272
 273 The following are not supported as yet.  Some because they are rarely
 274 usede, some because of technical difficulty.  They may be supported by
 275 external modules via CPAN in future, however.
 276
 277 =over 4
 278
 279 =item   ISO-2022-JP-2 [RFC1554]
 280
 281 Not very popular yet.  Needs Unicode Database or equivalent to
 282 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
 283 GB2312 sumulteniously, which code points in unicode overlap.  So you
 284 need to lookup the database to determine what character set a given
 285 Unicode character should belong).
 286
 287 =item   ISO-2022-CN [RFC1922]
 288
 289 Not very popular.  Needs CNS 11643-1 and 2 which are not available in
 290 this module.  CNS 11643 is supported (via euc-tw) in
 291 Encode::HanExtra.  Autrijus may add support for this encoding in his
 292 module in future
 293
 294 =item various UP-UX encodings
 295
 296 The following are unsoported due to the lack of mapping data.
 297
 298   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 299   '15' - japanese15, korean15, and  roi15
 300
 301 =item Cyrillic encoding ISO-IR-111
 302
 303 Anton doubts its usefulness.
 304
 305 =item ISO-8859-8-1 [Hebrew]
 306
 307 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 308 MacHebrew are supported because and just because there were mappings
 309 available at L<http://www.unicode.org/>).  Contribution welcome.
 310
 311 =item Thai encoding TCVN
 312
 313 Ditto.
 314
 315 =item Vietnamese encodings VPS
 316
 317 Though Jungshik has reported that mozilla supports this encoding,  It was too late for us to add one.  In future via a separate module.  See
 318 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
 319 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
 320 if you are interested in helping us.
 321
 322 =item various Mac encodings
 323
 324 The following are unsoported due to the lack of mapping data.
 325
 326   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 327   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 328   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 329   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 330   MacVietnamese
 331
 332 The rest of which already available are based upon the vendor mappings at
 333 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
 334
 335 =item (Mac) Indic encodings
 336
 337 The maps for the following is available at L<http://www.unicode.org/>
 338 but remains unsupport because those encordigs need algorithmical
 339 approach, unsupported by F<enc2xs>
 340
 341   MacDevanagari
 342   MacGurmukhi
 343   MacGujarati
 344
 345 For details, please see C<Unicode mapping issues and notes:> at
 346 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
 347
 348 I believe this issue is prevalent not only for Mac Indics but also in
 349 other Indic encodings but those mentions were the only Indic encodings
 350 maps that I could find at L<http://www.unicode.org/> .
 351
 352 =back
 353
 354 =head1 Encoding vs. Charset -- terminology
 355
 356 We are used to using the term (character) I<encoding> and I<character set>
 357 interchangeably.  But just as using the term byte and character is
 358 dangerous and should be differenciated when needed, we need to
 359 differenciate I<encoding> and I<character set>.
 360
 361 To understand that, it's follow how we make computers grok our characters.
 362
 363 =over 4
 364
 365 =item *
 366
 367 First we start with which characters to include.  We call this
 368 collection of characters I<character repertoire>.
 369
 370 =item *
 371
 372 Then we have to give each character a unique ID so your computer can
 373 tell the differnce from 'a' to 'A'.  This itemized character
 374 repartoire is now a I<character set>.
 375
 376 =item *
 377
 378 If your computer can grow the character set without further
 379 proccessing, you can go ahead use it.  This is called a I<coded
 380 character set> (CCS) or I<raw character encoding>.  ASCII is used this
 381 way for most cases.
 382
 383 =item *
 384
 385 But in many cases especially multi-byte CJK encodings, you have to
 386 tweak a little more.  Your network connection may not accept any data
 387 with the Most Significant Bit set, Your computer may not be able to
 388 tell if a given byte is a whole character or just half of it.  So you
 389 have to I<encode> the character set to use it.
 390
 391 A I<character encoding scheme> (CES) determines how to encode a given
 392 character set, or a set of multiple character sets.  7bit ISO-2022 is
 393 an example of CES.  You switch between character sets via I<escape
 394 sequence>.
 395
 396 =back
 397
 398 Technically, or Mathematically speaking, a character set encoded in
 399 such a CES that maps character by character may form a CCS.  EUC is such
 400 an example.  CES of EUC is as follows;
 401
 402 =over 4
 403
 404 =item *
 405
 406 Map ASCII unchanged.
 407
 408 =item *
 409
 410 Map such a character set that consists of 94 or 96 powered by N
 411 members by adding 0x80 to each byte.
 412
 413 =item *
 414
 415 You can also use 0x8e and 0x8f to tell the following sequence of
 416 characters belong to yet another character set.  each following byte
 417 is added by 0x80
 418
 419 =back
 420
 421 By carefully looking at at the encoded byte sequence, you may find the
 422 byte sequence conforms a unique number.  In that sense EUC is a CCS
 423 generated by a CES above from up to four CCS (complicated?).  UTF-8
 424 falls into this category.  See L<perlunicode/"UTF-8"> to find how
 425 UTF-8 maps Unicode to a byte sequence.
 426
 427 You may also find by now why 7bit ISO-2022 cannot conform a CCS.  If
 428 you look at a byte sequence \x21\x21, you can't tell if it is two !'s
 429 or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you have no
 430 trouble between "!!". and "  "
 431
 432 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 433
 434 This section tries to classify the supported encodings by their
 435 applicability for information exchange over the Internet and to
 436 choose the most suitable aliases to name them in the context of
 437 such communication.
 438
 439 =over 2
 440
 441 =item *
 442
 443 To (en|de) code Encodings marked as C<(**)>, You need
 444 C<Encode::HanExtra>, available from CPAN.
 445
 446 =back
 447
 448 Encoding names
 449
 450   US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 451   Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 452   EUC-KR      Big5     GB2312
 453
 454 are registered to IANA as preferred MIME names and may probably
 455 be used over the Internet.
 456
 457 C<Shift_JIS> has been officialized by JIS X 0208-1997.
 458 L<Microsoft-related naming mess> gives details.
 459
 460 C<GB2312> is the IANA name for C<EUC-CN>.
 461 See L<Microsoft-related naming mess> for details.
 462
 463 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
 464 with Encode. See L<Encode::CN> for details.
 465
 466   EUC-CN
 467   KOI8-U        [RFC2319]
 468
 469 have not been registered with IANA (as of March 2002) but
 470 seem to be supported by major web browsers.
 471 IANA name for C<EUC-CN> is C<GB2312>.
 472
 473   KS_C_5601-1987
 474
 475 is heavily misused.
 476 See L<Microsoft-related naming mess> for details.
 477
 478 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
 479 with Encode. See L<Encode::KR> for details.
 480
 481   UTF-16 UTF-16BE UTF-16LE
 482
 483 are a IANA-registered C<charset>s. See [RFC 2781] for details.
 484 Jungshik Shin reports that UTF-16 with a BOM is well accepted
 485 by MS IE 5/6 and NS 4/6. Beware however that
 486
 487 =over 2
 488
 489 =item *
 490
 491 C<UTF-16> support in any software you're going to be
 492 using/interoperating with has probably been less tested
 493 then C<UTF-8> support
 494
 495 =item *
 496
 497 data coded with C<UTF-8> seamlessly passes traditional
 498 command piping (C<cat>, C<more>, etc.) while UTF-16 coded
 499 data is likely to cause confusion (with it's zero bytes,
 500 for example)
 501
 502 =item *
 503
 504 it is beyond the power of words to describe the way HTML browsers
 505 encode non-C<ASCII> form data. To get a general impression refer to
 506 L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
 507 While encoding of form data has stabilzed for C<UTF-8> coded pages
 508 (at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
 509 expect fun (and cross-browser discrepancies) with C<UTF-16> coded
 510 pages!
 511
 512 =back
 513
 514 The rule of thumb is to use C<UTF-8> unless you know what
 515 you're doing and unless you really need from using C<UTF-16>.
 516
 517
 518   ISO-IR-165    [RFC1345]
 519   GBK
 520   VISCII
 521   GB 12345
 522   GB 18030 (**)  (see links bellow)
 523   EUC-TW   (**)
 524
 525 are totally valid encodings but not registered at IANA.
 526 The names under which they are listed here are probably the
 527 most widely-known names for these encodings and are recommended
 528 names.
 529
 530   BIG5PLUS (**)
 531
 532 is a bit proprietary name.
 533
 534 =head2 Microsoft-related naming mess
 535
 536 Microsoft products misuse the following names:
 537
 538 =over 2
 539
 540 =item KS_C_5601-1987
 541
 542 Microsoft extension to C<EUC-KR>.
 543
 544 Proper name: C<CP949>.
 545
 546 See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
 547 for details.
 548
 549 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
 550 misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
 551 C<kcs5601-raw>.
 552
 553 See L<Encode::KR> for details.
 554
 555 =item GB2312
 556
 557 Microsoft extension to C<EUC-CN>.
 558
 559 Proper names: C<CP936>, C<GBK>.
 560
 561 C<GB2312> has been registered in the C<EUC-CN> meaning at
 562 IANA. This has partially repaired the situation: Microsoft's
 563 C<GB2312> has become a superset of the official C<GB2312>.
 564
 565 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
 566 IANA registration. C<cp936> is supported separately.
 567 I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
 568
 569 See L<Encode::CN> for details.
 570
 571 =item Big5
 572
 573 Microsoft extension to C<Big5>.
 574
 575 Proper name: C<CP950>.
 576
 577 Encode separately supports C<Big5> and C<cp950>.
 578
 579 =item Shift_JIS
 580
 581 Microsoft's understanding of C<Shift_JIS>.
 582
 583 JIS has not endorsed the full Microsoft standard however.
 584 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
 585 subsets, while Microsoft has always been meaning C<Shift_JIS> to
 586 encode a wider character repertoire.
 587
 588 As a historical predecessor Microsoft's variant
 589 probably has more rights for the name, albeit it may be objected
 590 that Microsoft shouldn't have used JIS as part of the name
 591 in the first place.
 592
 593 Unabiguous name: C<CP932>.
 594
 595 Encode separately supports C<Shift_JIS> and C<cp932>.
 596
 597 =back
 598
 599 =head1 Glossary
 600
 601 =over 2
 602
 603 =item character repertoire
 604
 605 A collection of unique characters.  A I<character> set in the most
 606 strict sense. At this stage characters are not numberd.
 607
 608 =item coded character set (CCS)
 609
 610 A character set that is mapped in a way computers can use directly.
 611 Many character encodings including EUC falls in this category.
 612
 613 =item character encoding scheme (CES)
 614
 615 An algorithm to map a character set to a byte sequence.  You don't
 616 have to be able to tell which character set a given byte sequence
 617 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 618 example of being both a CCS and CES.
 619
 620 =item charset (in MIME context)
 621
 622 has long been used in the meaning of C<encoding>, CES.
 623
 624 While C<character set> word combination has lost this meaning
 625 in MIME context since [RFC 2130], C<charset> abbreviation has
 626 retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
 627
 628
 629  This document uses the term "charset" to mean a set of rules for
 630  mapping from a sequence of octets to a sequence of characters, such
 631  as the combination of a coded character set and a character encoding
 632  scheme; this is also what is used as an identifier in MIME "charset="
 633  parameters, and registered in the IANA charset registry ...  (Note
 634  that this is NOT a term used by other standards bodies, such as ISO).
 635                                                [RFC 2277]
 636
 637 =item EUC
 638
 639 Extended Unix Character.  See ISO-2022
 640
 641 =item ISO-2022
 642
 643 A CES that was carefully designed to coexist with ASCII.  There are 7
 644 bit version and 8 bit version.
 645
 646 7 bit version switches character set via escape sequence so this
 647 cannot form a CCS.  Since this is more difficult to handle in programs
 648 than the 8 bit version, 7 bit version is not very popular except for
 649 iso-2022-jp, the de facto standard CES for e-mails.
 650
 651 8 bit version can conform a CCS.  EUC and ISO-8859 are two examples
 652 thereof.  pre-5.6 perl could use them as string literals.
 653
 654 =item UCS
 655
 656 Short for I<Universal Character Set>.  When you say just UCS, it means
 657 I<Unicode>
 658
 659 =item UCS-2
 660
 661 ISO/IEC 10646 encoding form: Universal Character Set coded in two
 662 octets.
 663
 664 =item Unicode
 665
 666 A Character Set that aims to include all character repertoire of the
 667 world.  Many character sets in various national as well as industorial
 668 standards have become, in a way, just subsets of Unicode.
 669
 670 =item UTF
 671
 672 Short for I<Unicode Transformation Format>.  Determines how to map a
 673 unicode character into byte sequnece.
 674
 675 =item UTF-16
 676
 677 A UTF in 16-bit encoding.  Can either be in big endian or little
 678 endian.  Big endian version is called UTF-16BE (equals to UCS-2 +
 679 Surrogate Support) and little endian version is UTF-16LE.
 680
 681 =back
 682
 683 =head1 See Also
 684
 685 L<Encode>,
 686 L<Encode::Byte>,
 687 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 688 L<Encode::EBCDIC>, L<Encode::Symbol>
 689
 690 =head1 References
 691
 692 =over 2
 693
 694 =item ECMA
 695
 696 European Computer Manufacturers Association
 697 L<http://www.ecma.ch>
 698
 699 =over 2
 700
 701 =item EMCA-035 (eq C<ISO-2022>)
 702
 703 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 704
 705 The very dspecification of ISO-2022 is available from the link above.
 706
 707 =back
 708
 709 =item IANA
 710
 711 Internet Assigned Numbers Authority
 712 L<http://www.iana.org/>
 713
 714 =over 2
 715
 716 =item Assigned Charset Names by IANA
 717
 718 L<http://www.iana.org/assignments/character-sets>
 719
 720 Most of the C<canonical names> in Encode derive from this list
 721 so you can directly apply the string you have extracted from MIME
 722 header of mails and we pages.
 723
 724 =back
 725
 726 =item ISO
 727
 728 International Organization for Standardization
 729 L<http://www.iso.ch/>
 730
 731 =item RFC
 732
 733 Request For Comment -- need I say more?
 734 L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
 735
 736 =item UC
 737
 738 Unicode Consortium
 739 L<http://www.unicode.org/>
 740
 741 =over 2
 742
 743 =item Unicode Glossary
 744
 745 L<http://www.unicode.org/glossary/>
 746
 747 The glossary of this document is based opon this site.
 748
 749 =back
 750
 751 =back
 752
 753 =head2 Other Notable Sites
 754
 755 =over 2
 756
 757 =item czyborra.com
 758
 759 L<http://czyborra.com/>
 760
 761 Contains a a lot of useful information, especially gory details of ISO
 762 vs. vendor mappings.
 763
 764 =item CJK.inf
 765
 766 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 767
 768 Somewhat obsolete (last update in 1996), but still useful.  Also try
 769
 770 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 771
 772 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 773
 774 =item Jungshik Shin's Hangul FAQ
 775
 776 L<http://jshin.net/faq>
 777
 778 And especially it's subject 8
 779
 780 L<http://jshin.net/faq/qa8.html>
 781
 782 a comprehensive overview of the Korean (C<KS *>) standards.
 783
 784 =back
 785
 786 =head2 Offline sources
 787
 788 =over 2
 789
 790 =item C<CJKV Information Processing> by Ken Lunde
 791
 792 CJKV Information Processing
 793 1999 O'Reilly & Associates, ISBN : 1-56592-224-7
 794
 795 The modern successor of the C<CJK.inf>.
 796
 797 Features a comprehensive coverage on CJKV character sets and
 798 encodings along with many other issues faced by anyone trying
 799 to better support CJKV languages/scripts in all the areas of
 800 information processing.
 801
 802 To purchase this book visit
 803 L<http://www.oreilly.com/catalog/cjkvinfo/>
 804
 805 =back
 806
 807 =cut
 808
 809 I could not find this page because the hostname doesn't resolve!
 810
 811  Brief description for most of the mentioned CJK encodings
 812 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>