Upgrade to Encode 1.52, from Dan Kogai.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
a999c27c 3Encode::Supported -- Supported encodings by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
a999c27c 15=over
16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
25The MIME name as defined in IETF RFCs This includes all "iso-"'s.
26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
962111ca 63 ascii US-ascii [ECMA]
64 iso-8859-1 latin1 [ISO]
65 utf8 UTF-8 [RFC2279]
c731e18e 66 ----------------------------------------------------------------
67
c731e18e 68=head2 Encode::Unicode -- other Unicode encodings
69
70Unicode coding schemes other than native utf8 are supported by
71Encode::Unicode which will be autoloaded on demand.
72
73 ----------------------------------------------------------------
f2a2953c 74 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
75 UCS-2LE [UC]
76 UTF-16 [UC]
77 UTF-16BE [UC]
78 UTF-16LE [UC]
79 UTF-32 [UC]
80 UTF-32BE [UC]
81 UTF-32LE [UC]
67d7b5ef 82 ----------------------------------------------------------------
5d030b67 83
f2a2953c 84To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
85see L<Encode::Unicode>.
86
a999c27c 87=head2 Encode::Byte -- Extended ASCII
5d030b67 88
a999c27c 89Encode::Byte implements most of single-byte encodings except for
90Symbols and EBCDIC. The following encodings are based single-byte
91encoding implemented as extended ASCII. For most cases it uses
92\x80-\xff (upper half) to map non-ASCII characters.
93
94=over 2
95
96=item ISO-8859 and corresponding vendor mappings
97
962111ca 98Since there are so many, they are presented in table format with
99languages and corresponding encoding names by vendors. Note the table
a999c27c 100is sorted in order of ISO-8859 and the corresponding vendor mappings
101are slightly different from that of ISO. See
102L<http://czyborra.com/charsets/iso8859.html> for details.
103
962111ca 104 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 105 ----------------------------------------------------------------
962111ca 106 N. America (ASCII) cp437 AdobeStandardEncoding
107 cp863 (DOSCanadaF)
85982a32 108 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
962111ca 109 hp-roman8
110 cp860 (DOSPortuguese)
111 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
112 MacCroatian
113 MacRomanian
114 MacRumanian
115 Latin3 [1] iso-8859-3
116 Latin4 [2] iso-8859-4
117 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
118 (Also see next section) cp866 MacUkrainian
119 Arabic iso-8859-6 cp864 cp1256 MacArabic
120 cp1006 MacFarsi
121 Greek iso-8859-7 cp737 cp1253 MacGreek
122 cp869 (DOSGreek2)
123 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
124 Turkish iso-8859-9 cp857 cp1254 MacTurkish
125 Nordics iso-8859-10 cp865
126 cp861 MacIcelandic
127 MacSami
128 Thai iso-8859-11 [3] cp874 MacThai
a999c27c 129 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 130 Baltics iso-8859-13 cp775 cp1257
a999c27c 131 Celtics iso-8859-14
962111ca 132 Latin9 [4] iso-8859-15
a999c27c 133 Latin10 iso-8859-16
962111ca 134 Vietnamese viscii cp1258 MacVietnamese
a999c27c 135 ----------------------------------------------------------------
136
962111ca 137 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-5.
138 [2] Baltics. Now on 8859-10.
139 [3] Also know as TIS 620.
140 [4] Nicknamed Latin0; Euro sign as well as French and Finnish
141 letters that are missing from 8859-1 are added.
a999c27c 142
143All cp* are also available as ibm-*, ms-*, and windows-* . See also
144L<http://czyborra.com/charsets/codepages.html>.
145
146Macintosh encodings don't seem to be registered in such entities as
147IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1481150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
149for details
150
151=item KOI8 - De Facto Standard for Cyrillic world
152
153Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
962111ca 154in the Net. L<Encode> comes with the following KOI charsets.
155For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 156
67d7b5ef 157 ----------------------------------------------------------------
962111ca 158 koi8-f
159 koi8-r cp878 [RFC1489]
160 koi8-u [RFC2319]
85982a32 161 ----------------------------------------------------------------
962111ca 162
a999c27c 163=item gsm0338 - Hentai Latin 1
164
962111ca 165GSM0338 is for GSM handsets. Though it shares alphanumerals with
166ASCII, control character ranges and other parts are mapped very
167differently, presumably to store Greek and Cyrillic alphabets.
168This is also covered in Encode::Byte even though it does not
169comply to extended ASCII.
a999c27c 170
171=back
5d030b67 172
5129552c 173=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 174
962111ca 175Note that Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 176below. Also note these are implemented in distinct module by
962111ca 177languages, due the the size concerns. Please refer to their
a63c962f 178respective document pages.
5d030b67 179
5129552c 180=over 4
181
182=item Encode::CN -- Continental China
183
962111ca 184 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 185 ----------------------------------------------------------------
962111ca 186 euc-cn [1] MacChineseSimp
187 (gbk) cp936 [2]
188 gb12345-raw { GB12345 without CES }
189 gb2312-raw { GB2312 without CES }
5129552c 190 hz
191 iso-ir-165
67d7b5ef 192 ----------------------------------------------------------------
5129552c 193
962111ca 194 [1] GB2312 is aliased to this. see L<Microsoft-related naming mess>
195 [2] gbk is aliased to this. see L<Microsoft-related naming mess>
f2a2953c 196
5129552c 197=item Encode::JP -- Japan
198
962111ca 199 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 200 ----------------------------------------------------------------
a999c27c 201 euc-jp
962111ca 202 shiftjis cp932 macJapanese
f2a2953c 203 7bit-jis
204 euc-jp
962111ca 205 iso-2022-jp [RFC1468]
206 iso-2022-jp-1 [RFC2237]
f2a2953c 207 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
208 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
209 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 210 ----------------------------------------------------------------
5129552c 211
212=item Encode::KR -- Korea
213
962111ca 214 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 215 ----------------------------------------------------------------
962111ca 216 euc-kr MacKorean [RFC1557]
217 cp949 [1]
218 iso-2022-kr [RFC1557]
a999c27c 219 johab [KS X 1001:1998, Annex 3]
f2a2953c 220 ksc5601-raw { KSC5601 without CES }
67d7b5ef 221 ----------------------------------------------------------------
5129552c 222
962111ca 223 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
224 See below.
225
5129552c 226=item Encode::TW -- Taiwan
227
962111ca 228 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 229 ----------------------------------------------------------------
b0b300a3 230 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
231 big5-hkscs
67d7b5ef 232 ----------------------------------------------------------------
5129552c 233
234=item Encode::HanExtra -- More Chinese via CPAN
235
236Due to size concerns, additional Chinese encodings below are
237distributed separately on CPAN, under the name Encode::HanExtra.
238
962111ca 239 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 240 ----------------------------------------------------------------
5129552c 241 gb18030
242 euc-tw
243 big5plus
67d7b5ef 244 ----------------------------------------------------------------
5129552c 245
246=back
247
248=head2 Miscellaneous encodings
249
250=over 4
251
252=item Encode::EBCDIC
5d030b67 253
a999c27c 254See L<perlebcdic> for details.
5d030b67 255
67d7b5ef 256 ----------------------------------------------------------------
5d030b67 257 cp37
a999c27c 258 cp500
259 cp875
260 cp1026
261 cp1047
5d030b67 262 posix-bc
67d7b5ef 263 ----------------------------------------------------------------
5129552c 264
a63c962f 265=item Encode::Symbols
5d030b67 266
5129552c 267For symbols and dingbats.
5d030b67 268
67d7b5ef 269 ----------------------------------------------------------------
5d030b67 270 symbol
271 dingbats
a999c27c 272 MacDingbats
273 AdobeZdingbat
274 AdobeSymbol
67d7b5ef 275 ----------------------------------------------------------------
276
277=back
278
279=head1 Unsupported encodings
280
281The following are not supported as yet. Some because they are rarely
962111ca 282used, some because of technical difficulties. They may be supported by
67d7b5ef 283external modules via CPAN in future, however.
284
285=over 4
286
287=item ISO-2022-JP-2 [RFC1554]
288
289Not very popular yet. Needs Unicode Database or equivalent to
290implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
962111ca 291GB2312 simultaneously, which code points in Unicode overlap. So you
67d7b5ef 292need to lookup the database to determine what character set a given
293Unicode character should belong).
294
962111ca 295=item ISO-2022-CN [RFC1922]
67d7b5ef 296
297Not very popular. Needs CNS 11643-1 and 2 which are not available in
962111ca 298this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
299Autrijus may add support for this encoding in his module in future.
67d7b5ef 300
301=item various UP-UX encodings
302
962111ca 303The following are unsupported due to the lack of mapping data.
304
67d7b5ef 305 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 306 '15' - japanese15, korean15, and roi15
67d7b5ef 307
308=item Cyrillic encoding ISO-IR-111
309
310Anton doubts its usefulness.
311
312=item ISO-8859-8-1 [Hebrew]
313
a999c27c 314None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
315MacHebrew are supported because and just because there were mappings
962111ca 316available at L<http://www.unicode.org/>). Contributions welcome.
317
318=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
319
320Ditto.
67d7b5ef 321
322=item Thai encoding TCVN
323
324Ditto.
325
326=item Vietnamese encodings VPS
327
962111ca 328Though Jungshik has reported that Mozilla supports this encoding it
329was too late before 5.8.0 for us to add one. In future via a separate
330module. See
331L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
332and
a999c27c 333L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
334if you are interested in helping us.
67d7b5ef 335
962111ca 336=item Various Mac encodings
67d7b5ef 337
962111ca 338The following are unsupported due to the lack of mapping data.
a999c27c 339
340 MacArmenian, MacBengali, MacBurmese, MacEthiopic
341 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
342 MacLaotian, MacMalayalam, MacMongolian, MacOriya
343 MacSinhalese, MacTamil, MacTelugu, MacTibetan
344 MacVietnamese
345
962111ca 346The rest of which already available are based upon the vendor mappings
347at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 348
349=item (Mac) Indic encodings
350
351The maps for the following is available at L<http://www.unicode.org/>
962111ca 352but remains unsupport because those encodings need algorithmical
353approach, currently unsupported by F<enc2xs>
67d7b5ef 354
a999c27c 355 MacDevanagari
356 MacGurmukhi
357 MacGujarati
67d7b5ef 358
a999c27c 359For details, please see C<Unicode mapping issues and notes:> at
360L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
361
362I believe this issue is prevalent not only for Mac Indics but also in
962111ca 363other Indic encodings, but the above were the only Indic encodings
a999c27c 364maps that I could find at L<http://www.unicode.org/> .
5129552c 365
366=back
5d030b67 367
a999c27c 368=head1 Encoding vs. Charset -- terminology
5d030b67 369
a999c27c 370We are used to using the term (character) I<encoding> and I<character set>
371interchangeably. But just as using the term byte and character is
962111ca 372dangerous and should be differentiated when needed, we need to
373differentiate I<encoding> and I<character set>.
5d030b67 374
f2a2953c 375To understand that, it's follow how we make computers grok our characters.
a999c27c 376
377=over 4
378
379=item *
67d7b5ef 380
a999c27c 381First we start with which characters to include. We call this
382collection of characters I<character repertoire>.
5d030b67 383
a999c27c 384=item *
5d030b67 385
a999c27c 386Then we have to give each character a unique ID so your computer can
962111ca 387tell the difference from 'a' to 'A'. This itemized character
388repertoire is now a I<character set>.
a63c962f 389
a999c27c 390=item *
391
392If your computer can grow the character set without further
962111ca 393processing, you can go ahead use it. This is called a I<coded
a999c27c 394character set> (CCS) or I<raw character encoding>. ASCII is used this
395way for most cases.
396
397=item *
398
399But in many cases especially multi-byte CJK encodings, you have to
400tweak a little more. Your network connection may not accept any data
401with the Most Significant Bit set, Your computer may not be able to
402tell if a given byte is a whole character or just half of it. So you
403have to I<encode> the character set to use it.
404
405A I<character encoding scheme> (CES) determines how to encode a given
406character set, or a set of multiple character sets. 7bit ISO-2022 is
407an example of CES. You switch between character sets via I<escape
408sequence>.
67d7b5ef 409
410=back
411
a999c27c 412Technically, or Mathematically speaking, a character set encoded in
413such a CES that maps character by character may form a CCS. EUC is such
414an example. CES of EUC is as follows;
67d7b5ef 415
a999c27c 416=over 4
5d030b67 417
a999c27c 418=item *
5d030b67 419
a999c27c 420Map ASCII unchanged.
421
422=item *
423
424Map such a character set that consists of 94 or 96 powered by N
425members by adding 0x80 to each byte.
426
427=item *
428
429You can also use 0x8e and 0x8f to tell the following sequence of
430characters belong to yet another character set. each following byte
431is added by 0x80
432
433=back
434
435By carefully looking at at the encoded byte sequence, you may find the
436byte sequence conforms a unique number. In that sense EUC is a CCS
437generated by a CES above from up to four CCS (complicated?). UTF-8
962111ca 438falls into this category. See L<perlUnicode/"UTF-8"> to find how
a999c27c 439UTF-8 maps Unicode to a byte sequence.
440
441You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
442you look at a byte sequence \x21\x21, you can't tell if it is two !'s
443or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
444trouble between "!!". and " "
67d7b5ef 445
a63c962f 446=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
447
448This section tries to classify the supported encodings by their
449applicability for information exchange over the Internet and to
450choose the most suitable aliases to name them in the context of
451such communication.
452
67d7b5ef 453=over 2
454
455=item *
456
f2a2953c 457To (en|de) code Encodings marked as C<(**)>, You need
a999c27c 458C<Encode::HanExtra>, available from CPAN.
67d7b5ef 459
460=back
461
a63c962f 462Encoding names
5d030b67 463
f2a2953c 464 US-ASCII UTF-8 ISO-8859-* KOI8-R
465 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
466 EUC-KR Big5 GB2312
a999c27c 467
468are registered to IANA as preferred MIME names and may probably
469be used over the Internet.
5d030b67 470
c731e18e 471C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 472L<Microsoft-related naming mess> gives details.
5d030b67 473
a999c27c 474C<GB2312> is the IANA name for C<EUC-CN>.
475See L<Microsoft-related naming mess> for details.
476
477C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 478with Encode. See L<Encode::CN> for details.
5d030b67 479
a63c962f 480 EUC-CN
f2a2953c 481 KOI8-U [RFC2319]
5d030b67 482
a999c27c 483have not been registered with IANA (as of March 2002) but
484seem to be supported by major web browsers.
485IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 486
487 KS_C_5601-1987
488
a999c27c 489is heavily misused.
490See L<Microsoft-related naming mess> for details.
491
492C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 493with Encode. See L<Encode::KR> for details.
494
495 UTF-16 UTF-16BE UTF-16LE
496
448e90bb 497are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 498Jungshik Shin reports that UTF-16 with a BOM is well accepted
499by MS IE 5/6 and NS 4/6. Beware however that
500
501=over 2
502
503=item *
5d030b67 504
f2a2953c 505C<UTF-16> support in any software you're going to be
506using/interoperating with has probably been less tested
507then C<UTF-8> support
5d030b67 508
f2a2953c 509=item *
510
c731e18e 511C<UTF-8> coded data seamlessly passes traditional
512command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
f2a2953c 513data is likely to cause confusion (with it's zero bytes,
514for example)
515
516=item *
517
518it is beyond the power of words to describe the way HTML browsers
c731e18e 519encode non-C<ASCII> form data. To get a general impression visit
f2a2953c 520L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
c731e18e 521While encoding of form data has stabilized for C<UTF-8> coded pages
962111ca 522(at least IE 5/6, NS 6, Opera 6 behave consistently), be sure to
f2a2953c 523expect fun (and cross-browser discrepancies) with C<UTF-16> coded
524pages!
525
526=back
527
528The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 529you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 530
5d030b67 531
f2a2953c 532 ISO-IR-165 [RFC1345]
5d030b67 533 VISCII
a63c962f 534 GB 12345
f2a2953c 535 GB 18030 (**) (see links bellow)
536 EUC-TW (**)
5d030b67 537
538are totally valid encodings but not registered at IANA.
a63c962f 539The names under which they are listed here are probably the
540most widely-known names for these encodings and are recommended
541names.
542
f2a2953c 543 BIG5PLUS (**)
a63c962f 544
67d7b5ef 545is a bit proprietary name.
5d030b67 546
a999c27c 547=head2 Microsoft-related naming mess
548
549Microsoft products misuse the following names:
5d030b67 550
67d7b5ef 551=over 2
a63c962f 552
a999c27c 553=item KS_C_5601-1987
5d030b67 554
a999c27c 555Microsoft extension to C<EUC-KR>.
5d030b67 556
c731e18e 557Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 558
f2a2953c 559See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 560for details.
5d030b67 561
f2a2953c 562Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
563misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
564C<kcs5601-raw>.
5d030b67 565
f2a2953c 566See L<Encode::KR> for details.
67d7b5ef 567
a999c27c 568=item GB2312
67d7b5ef 569
a999c27c 570Microsoft extension to C<EUC-CN>.
a63c962f 571
a999c27c 572Proper names: C<CP936>, C<GBK>.
a63c962f 573
a999c27c 574C<GB2312> has been registered in the C<EUC-CN> meaning at
575IANA. This has partially repaired the situation: Microsoft's
576C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 577
a999c27c 578Encode aliases C<GB2312> to C<euc-cn> in full agreement with
579IANA registration. C<cp936> is supported separately.
f2a2953c 580I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 581
f2a2953c 582See L<Encode::CN> for details.
a999c27c 583
584=item Big5
585
586Microsoft extension to C<Big5>.
587
588Proper name: C<CP950>.
589
590Encode separately supports C<Big5> and C<cp950>.
591
592=item Shift_JIS
593
594Microsoft's understanding of C<Shift_JIS>.
595
596JIS has not endorsed the full Microsoft standard however.
597The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
85982a32 598character sets, while Microsoft has always been meaning C<Shift_JIS>
599to encode a wider character repertoire. See C<IANA> registration for
c731e18e 600C<Windows-31J>.
a999c27c 601
602As a historical predecessor Microsoft's variant
603probably has more rights for the name, albeit it may be objected
604that Microsoft shouldn't have used JIS as part of the name
605in the first place.
606
fcb875d4 607Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c 608
609Encode separately supports C<Shift_JIS> and C<cp932>.
610
611=back
612
613=head1 Glossary
614
615=over 2
616
617=item character repertoire
618
619A collection of unique characters. A I<character> set in the most
962111ca 620strict sense. At this stage characters are not numbered.
a999c27c 621
622=item coded character set (CCS)
623
624A character set that is mapped in a way computers can use directly.
625Many character encodings including EUC falls in this category.
626
627=item character encoding scheme (CES)
628
629An algorithm to map a character set to a byte sequence. You don't
630have to be able to tell which character set a given byte sequence
631belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
632example of being both a CCS and CES.
633
f2a2953c 634=item charset (in MIME context)
635
636has long been used in the meaning of C<encoding>, CES.
637
638While C<character set> word combination has lost this meaning
639in MIME context since [RFC 2130], C<charset> abbreviation has
640retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
641
642
643 This document uses the term "charset" to mean a set of rules for
644 mapping from a sequence of octets to a sequence of characters, such
645 as the combination of a coded character set and a character encoding
646 scheme; this is also what is used as an identifier in MIME "charset="
647 parameters, and registered in the IANA charset registry ... (Note
648 that this is NOT a term used by other standards bodies, such as ISO).
649 [RFC 2277]
650
a999c27c 651=item EUC
652
653Extended Unix Character. See ISO-2022
654
655=item ISO-2022
656
657A CES that was carefully designed to coexist with ASCII. There are 7
f2a2953c 658bit version and 8 bit version.
659
6607 bit version switches character set via escape sequence so this
661cannot form a CCS. Since this is more difficult to handle in programs
662than the 8 bit version, 7 bit version is not very popular except for
663iso-2022-jp, the de facto standard CES for e-mails.
664
6658 bit version can conform a CCS. EUC and ISO-8859 are two examples
962111ca 666thereof. Pre-5.6 perl could use them as string literals.
a999c27c 667
668=item UCS
669
670Short for I<Universal Character Set>. When you say just UCS, it means
671I<Unicode>
672
673=item UCS-2
674
675ISO/IEC 10646 encoding form: Universal Character Set coded in two
676octets.
677
678=item Unicode
679
f2a2953c 680A Character Set that aims to include all character repertoire of the
962111ca 681world. Many character sets in various national as well as industrial
f2a2953c 682standards have become, in a way, just subsets of Unicode.
a999c27c 683
684=item UTF
685
f2a2953c 686Short for I<Unicode Transformation Format>. Determines how to map a
962111ca 687Unicode character into byte sequence.
a999c27c 688
689=item UTF-16
690
691A UTF in 16-bit encoding. Can either be in big endian or little
f2a2953c 692endian. Big endian version is called UTF-16BE (equals to UCS-2 +
693Surrogate Support) and little endian version is UTF-16LE.
67d7b5ef 694
695=back
5d030b67 696
697=head1 See Also
698
5129552c 699L<Encode>,
700L<Encode::Byte>,
a63c962f 701L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 702L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 703
a999c27c 704=head1 References
705
706=over 2
707
708=item ECMA
709
710European Computer Manufacturers Association
711L<http://www.ecma.ch>
712
713=over 2
714
715=item EMCA-035 (eq C<ISO-2022>)
716
717L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
718
448e90bb 719The very specification of ISO-2022 is available from the link above.
a999c27c 720
721=back
722
723=item IANA
724
725Internet Assigned Numbers Authority
726L<http://www.iana.org/>
727
728=over 2
729
730=item Assigned Charset Names by IANA
731
732L<http://www.iana.org/assignments/character-sets>
733
734Most of the C<canonical names> in Encode derive from this list
735so you can directly apply the string you have extracted from MIME
448e90bb 736header of mails and web pages.
a999c27c 737
738=back
739
740=item ISO
741
742International Organization for Standardization
743L<http://www.iso.ch/>
744
745=item RFC
746
962111ca 747Request For Comments -- need I say more?
f2a2953c 748L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
a999c27c 749
750=item UC
751
752Unicode Consortium
753L<http://www.unicode.org/>
754
755=over 2
756
757=item Unicode Glossary
758
759L<http://www.unicode.org/glossary/>
760
962111ca 761The glossary of this document is based upon this site.
a999c27c 762
763=back
764
765=back
766
767=head2 Other Notable Sites
768
769=over 2
770
771=item czyborra.com
772
f2a2953c 773L<http://czyborra.com/>
a999c27c 774
775Contains a a lot of useful information, especially gory details of ISO
776vs. vendor mappings.
777
778=item CJK.inf
779
780L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
781
782Somewhat obsolete (last update in 1996), but still useful. Also try
783
784L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
785
786You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
787
f2a2953c 788=item Jungshik Shin's Hangul FAQ
789
790L<http://jshin.net/faq>
791
962111ca 792And especially it's subject 8.
f2a2953c 793
794L<http://jshin.net/faq/qa8.html>
795
962111ca 796A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 797
798=back
799
800=head2 Offline sources
801
802=over 2
803
804=item C<CJKV Information Processing> by Ken Lunde
805
806CJKV Information Processing
8071999 O'Reilly & Associates, ISBN : 1-56592-224-7
808
809The modern successor of the C<CJK.inf>.
810
811Features a comprehensive coverage on CJKV character sets and
812encodings along with many other issues faced by anyone trying
813to better support CJKV languages/scripts in all the areas of
814information processing.
815
816To purchase this book visit
817L<http://www.oreilly.com/catalog/cjkvinfo/>
818
a999c27c 819=back
820
5d030b67 821=cut
67d7b5ef 822
823I could not find this page because the hostname doesn't resolve!
824
962111ca 825Brief description for most of the mentioned CJK encodings
67d7b5ef 826L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>