CPAN.pm sync
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
a999c27c 3Encode::Supported -- Supported encodings by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
a999c27c 15=over
16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
25The MIME name as defined in IETF RFCs This includes all "iso-"'s.
26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
962111ca 63 ascii US-ascii [ECMA]
64 iso-8859-1 latin1 [ISO]
65 utf8 UTF-8 [RFC2279]
c731e18e 66 ----------------------------------------------------------------
67
c731e18e 68=head2 Encode::Unicode -- other Unicode encodings
69
70Unicode coding schemes other than native utf8 are supported by
71Encode::Unicode which will be autoloaded on demand.
72
73 ----------------------------------------------------------------
f2a2953c 74 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
75 UCS-2LE [UC]
76 UTF-16 [UC]
77 UTF-16BE [UC]
78 UTF-16LE [UC]
79 UTF-32 [UC]
80 UTF-32BE [UC]
81 UTF-32LE [UC]
67d7b5ef 82 ----------------------------------------------------------------
5d030b67 83
f2a2953c 84To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
85see L<Encode::Unicode>.
86
a999c27c 87=head2 Encode::Byte -- Extended ASCII
5d030b67 88
a999c27c 89Encode::Byte implements most of single-byte encodings except for
90Symbols and EBCDIC. The following encodings are based single-byte
91encoding implemented as extended ASCII. For most cases it uses
92\x80-\xff (upper half) to map non-ASCII characters.
93
94=over 2
95
96=item ISO-8859 and corresponding vendor mappings
97
962111ca 98Since there are so many, they are presented in table format with
99languages and corresponding encoding names by vendors. Note the table
a999c27c 100is sorted in order of ISO-8859 and the corresponding vendor mappings
101are slightly different from that of ISO. See
102L<http://czyborra.com/charsets/iso8859.html> for details.
103
962111ca 104 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 105 ----------------------------------------------------------------
962111ca 106 N. America (ASCII) cp437 AdobeStandardEncoding
107 cp863 (DOSCanadaF)
108 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
109 hp-roman8
110 cp860 (DOSPortuguese)
111 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
112 MacCroatian
113 MacRomanian
114 MacRumanian
115 Latin3 [1] iso-8859-3
116 Latin4 [2] iso-8859-4
117 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
118 (Also see next section) cp866 MacUkrainian
119 Arabic iso-8859-6 cp864 cp1256 MacArabic
120 cp1006 MacFarsi
121 Greek iso-8859-7 cp737 cp1253 MacGreek
122 cp869 (DOSGreek2)
123 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
124 Turkish iso-8859-9 cp857 cp1254 MacTurkish
125 Nordics iso-8859-10 cp865
126 cp861 MacIcelandic
127 MacSami
128 Thai iso-8859-11 [3] cp874 MacThai
a999c27c 129 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 130 Baltics iso-8859-13 cp775 cp1257
a999c27c 131 Celtics iso-8859-14
962111ca 132 Latin9 [4] iso-8859-15
a999c27c 133 Latin10 iso-8859-16
962111ca 134 Vietnamese viscii cp1258 MacVietnamese
a999c27c 135 ----------------------------------------------------------------
136
962111ca 137 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-5.
138 [2] Baltics. Now on 8859-10.
139 [3] Also know as TIS 620.
140 [4] Nicknamed Latin0; Euro sign as well as French and Finnish
141 letters that are missing from 8859-1 are added.
a999c27c 142
143All cp* are also available as ibm-*, ms-*, and windows-* . See also
144L<http://czyborra.com/charsets/codepages.html>.
145
146Macintosh encodings don't seem to be registered in such entities as
147IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1481150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
149for details
150
151=item KOI8 - De Facto Standard for Cyrillic world
152
153Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
962111ca 154in the Net. L<Encode> comes with the following KOI charsets.
155For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 156
67d7b5ef 157 ----------------------------------------------------------------
962111ca 158 koi8-f
159 koi8-r cp878 [RFC1489]
160 koi8-u [RFC2319]
161
a999c27c 162=item gsm0338 - Hentai Latin 1
163
962111ca 164GSM0338 is for GSM handsets. Though it shares alphanumerals with
165ASCII, control character ranges and other parts are mapped very
166differently, presumably to store Greek and Cyrillic alphabets.
167This is also covered in Encode::Byte even though it does not
168comply to extended ASCII.
a999c27c 169
170=back
5d030b67 171
5129552c 172=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 173
962111ca 174Note that Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 175below. Also note these are implemented in distinct module by
962111ca 176languages, due the the size concerns. Please refer to their
a63c962f 177respective document pages.
5d030b67 178
5129552c 179=over 4
180
181=item Encode::CN -- Continental China
182
962111ca 183 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 184 ----------------------------------------------------------------
962111ca 185 euc-cn [1] MacChineseSimp
186 (gbk) cp936 [2]
187 gb12345-raw { GB12345 without CES }
188 gb2312-raw { GB2312 without CES }
5129552c 189 hz
190 iso-ir-165
67d7b5ef 191 ----------------------------------------------------------------
5129552c 192
962111ca 193 [1] GB2312 is aliased to this. see L<Microsoft-related naming mess>
194 [2] gbk is aliased to this. see L<Microsoft-related naming mess>
f2a2953c 195
5129552c 196=item Encode::JP -- Japan
197
962111ca 198 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 199 ----------------------------------------------------------------
a999c27c 200 euc-jp
962111ca 201 shiftjis cp932 macJapanese
f2a2953c 202 7bit-jis
203 euc-jp
962111ca 204 iso-2022-jp [RFC1468]
205 iso-2022-jp-1 [RFC2237]
f2a2953c 206 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
207 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
208 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 209 ----------------------------------------------------------------
5129552c 210
211=item Encode::KR -- Korea
212
962111ca 213 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 214 ----------------------------------------------------------------
962111ca 215 euc-kr MacKorean [RFC1557]
216 cp949 [1]
217 iso-2022-kr [RFC1557]
a999c27c 218 johab [KS X 1001:1998, Annex 3]
f2a2953c 219 ksc5601-raw { KSC5601 without CES }
67d7b5ef 220 ----------------------------------------------------------------
5129552c 221
962111ca 222 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
223 See below.
224
5129552c 225=item Encode::TW -- Taiwan
226
962111ca 227 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 228 ----------------------------------------------------------------
962111ca 229 big5 cp950 MacChineseTrad
5129552c 230 big5-hkscs
67d7b5ef 231 ----------------------------------------------------------------
5129552c 232
233=item Encode::HanExtra -- More Chinese via CPAN
234
235Due to size concerns, additional Chinese encodings below are
236distributed separately on CPAN, under the name Encode::HanExtra.
237
962111ca 238 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 239 ----------------------------------------------------------------
5129552c 240 gb18030
241 euc-tw
242 big5plus
67d7b5ef 243 ----------------------------------------------------------------
5129552c 244
245=back
246
247=head2 Miscellaneous encodings
248
249=over 4
250
251=item Encode::EBCDIC
5d030b67 252
a999c27c 253See L<perlebcdic> for details.
5d030b67 254
67d7b5ef 255 ----------------------------------------------------------------
5d030b67 256 cp37
a999c27c 257 cp500
258 cp875
259 cp1026
260 cp1047
5d030b67 261 posix-bc
67d7b5ef 262 ----------------------------------------------------------------
5129552c 263
a63c962f 264=item Encode::Symbols
5d030b67 265
5129552c 266For symbols and dingbats.
5d030b67 267
67d7b5ef 268 ----------------------------------------------------------------
5d030b67 269 symbol
270 dingbats
a999c27c 271 MacDingbats
272 AdobeZdingbat
273 AdobeSymbol
67d7b5ef 274 ----------------------------------------------------------------
275
276=back
277
278=head1 Unsupported encodings
279
280The following are not supported as yet. Some because they are rarely
962111ca 281used, some because of technical difficulties. They may be supported by
67d7b5ef 282external modules via CPAN in future, however.
283
284=over 4
285
286=item ISO-2022-JP-2 [RFC1554]
287
288Not very popular yet. Needs Unicode Database or equivalent to
289implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
962111ca 290GB2312 simultaneously, which code points in Unicode overlap. So you
67d7b5ef 291need to lookup the database to determine what character set a given
292Unicode character should belong).
293
962111ca 294=item ISO-2022-CN [RFC1922]
67d7b5ef 295
296Not very popular. Needs CNS 11643-1 and 2 which are not available in
962111ca 297this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
298Autrijus may add support for this encoding in his module in future.
67d7b5ef 299
300=item various UP-UX encodings
301
962111ca 302The following are unsupported due to the lack of mapping data.
303
67d7b5ef 304 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 305 '15' - japanese15, korean15, and roi15
67d7b5ef 306
307=item Cyrillic encoding ISO-IR-111
308
309Anton doubts its usefulness.
310
311=item ISO-8859-8-1 [Hebrew]
312
a999c27c 313None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
314MacHebrew are supported because and just because there were mappings
962111ca 315available at L<http://www.unicode.org/>). Contributions welcome.
316
317=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
318
319Ditto.
67d7b5ef 320
321=item Thai encoding TCVN
322
323Ditto.
324
325=item Vietnamese encodings VPS
326
962111ca 327Though Jungshik has reported that Mozilla supports this encoding it
328was too late before 5.8.0 for us to add one. In future via a separate
329module. See
330L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
331and
a999c27c 332L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
333if you are interested in helping us.
67d7b5ef 334
962111ca 335=item Various Mac encodings
67d7b5ef 336
962111ca 337The following are unsupported due to the lack of mapping data.
a999c27c 338
339 MacArmenian, MacBengali, MacBurmese, MacEthiopic
340 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
341 MacLaotian, MacMalayalam, MacMongolian, MacOriya
342 MacSinhalese, MacTamil, MacTelugu, MacTibetan
343 MacVietnamese
344
962111ca 345The rest of which already available are based upon the vendor mappings
346at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 347
348=item (Mac) Indic encodings
349
350The maps for the following is available at L<http://www.unicode.org/>
962111ca 351but remains unsupport because those encodings need algorithmical
352approach, currently unsupported by F<enc2xs>
67d7b5ef 353
a999c27c 354 MacDevanagari
355 MacGurmukhi
356 MacGujarati
67d7b5ef 357
a999c27c 358For details, please see C<Unicode mapping issues and notes:> at
359L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
360
361I believe this issue is prevalent not only for Mac Indics but also in
962111ca 362other Indic encodings, but the above were the only Indic encodings
a999c27c 363maps that I could find at L<http://www.unicode.org/> .
5129552c 364
365=back
5d030b67 366
a999c27c 367=head1 Encoding vs. Charset -- terminology
5d030b67 368
a999c27c 369We are used to using the term (character) I<encoding> and I<character set>
370interchangeably. But just as using the term byte and character is
962111ca 371dangerous and should be differentiated when needed, we need to
372differentiate I<encoding> and I<character set>.
5d030b67 373
f2a2953c 374To understand that, it's follow how we make computers grok our characters.
a999c27c 375
376=over 4
377
378=item *
67d7b5ef 379
a999c27c 380First we start with which characters to include. We call this
381collection of characters I<character repertoire>.
5d030b67 382
a999c27c 383=item *
5d030b67 384
a999c27c 385Then we have to give each character a unique ID so your computer can
962111ca 386tell the difference from 'a' to 'A'. This itemized character
387repertoire is now a I<character set>.
a63c962f 388
a999c27c 389=item *
390
391If your computer can grow the character set without further
962111ca 392processing, you can go ahead use it. This is called a I<coded
a999c27c 393character set> (CCS) or I<raw character encoding>. ASCII is used this
394way for most cases.
395
396=item *
397
398But in many cases especially multi-byte CJK encodings, you have to
399tweak a little more. Your network connection may not accept any data
400with the Most Significant Bit set, Your computer may not be able to
401tell if a given byte is a whole character or just half of it. So you
402have to I<encode> the character set to use it.
403
404A I<character encoding scheme> (CES) determines how to encode a given
405character set, or a set of multiple character sets. 7bit ISO-2022 is
406an example of CES. You switch between character sets via I<escape
407sequence>.
67d7b5ef 408
409=back
410
a999c27c 411Technically, or Mathematically speaking, a character set encoded in
412such a CES that maps character by character may form a CCS. EUC is such
413an example. CES of EUC is as follows;
67d7b5ef 414
a999c27c 415=over 4
5d030b67 416
a999c27c 417=item *
5d030b67 418
a999c27c 419Map ASCII unchanged.
420
421=item *
422
423Map such a character set that consists of 94 or 96 powered by N
424members by adding 0x80 to each byte.
425
426=item *
427
428You can also use 0x8e and 0x8f to tell the following sequence of
429characters belong to yet another character set. each following byte
430is added by 0x80
431
432=back
433
434By carefully looking at at the encoded byte sequence, you may find the
435byte sequence conforms a unique number. In that sense EUC is a CCS
436generated by a CES above from up to four CCS (complicated?). UTF-8
962111ca 437falls into this category. See L<perlUnicode/"UTF-8"> to find how
a999c27c 438UTF-8 maps Unicode to a byte sequence.
439
440You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
441you look at a byte sequence \x21\x21, you can't tell if it is two !'s
442or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
443trouble between "!!". and " "
67d7b5ef 444
a63c962f 445=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
446
447This section tries to classify the supported encodings by their
448applicability for information exchange over the Internet and to
449choose the most suitable aliases to name them in the context of
450such communication.
451
67d7b5ef 452=over 2
453
454=item *
455
f2a2953c 456To (en|de) code Encodings marked as C<(**)>, You need
a999c27c 457C<Encode::HanExtra>, available from CPAN.
67d7b5ef 458
459=back
460
a63c962f 461Encoding names
5d030b67 462
f2a2953c 463 US-ASCII UTF-8 ISO-8859-* KOI8-R
464 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
465 EUC-KR Big5 GB2312
a999c27c 466
467are registered to IANA as preferred MIME names and may probably
468be used over the Internet.
5d030b67 469
c731e18e 470C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 471L<Microsoft-related naming mess> gives details.
5d030b67 472
a999c27c 473C<GB2312> is the IANA name for C<EUC-CN>.
474See L<Microsoft-related naming mess> for details.
475
476C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 477with Encode. See L<Encode::CN> for details.
5d030b67 478
a63c962f 479 EUC-CN
f2a2953c 480 KOI8-U [RFC2319]
5d030b67 481
a999c27c 482have not been registered with IANA (as of March 2002) but
483seem to be supported by major web browsers.
484IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 485
486 KS_C_5601-1987
487
a999c27c 488is heavily misused.
489See L<Microsoft-related naming mess> for details.
490
491C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 492with Encode. See L<Encode::KR> for details.
493
494 UTF-16 UTF-16BE UTF-16LE
495
448e90bb 496are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 497Jungshik Shin reports that UTF-16 with a BOM is well accepted
498by MS IE 5/6 and NS 4/6. Beware however that
499
500=over 2
501
502=item *
5d030b67 503
f2a2953c 504C<UTF-16> support in any software you're going to be
505using/interoperating with has probably been less tested
506then C<UTF-8> support
5d030b67 507
f2a2953c 508=item *
509
c731e18e 510C<UTF-8> coded data seamlessly passes traditional
511command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
f2a2953c 512data is likely to cause confusion (with it's zero bytes,
513for example)
514
515=item *
516
517it is beyond the power of words to describe the way HTML browsers
c731e18e 518encode non-C<ASCII> form data. To get a general impression visit
f2a2953c 519L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
c731e18e 520While encoding of form data has stabilized for C<UTF-8> coded pages
962111ca 521(at least IE 5/6, NS 6, Opera 6 behave consistently), be sure to
f2a2953c 522expect fun (and cross-browser discrepancies) with C<UTF-16> coded
523pages!
524
525=back
526
527The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 528you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 529
5d030b67 530
f2a2953c 531 ISO-IR-165 [RFC1345]
5d030b67 532 VISCII
a63c962f 533 GB 12345
f2a2953c 534 GB 18030 (**) (see links bellow)
535 EUC-TW (**)
5d030b67 536
537are totally valid encodings but not registered at IANA.
a63c962f 538The names under which they are listed here are probably the
539most widely-known names for these encodings and are recommended
540names.
541
f2a2953c 542 BIG5PLUS (**)
a63c962f 543
67d7b5ef 544is a bit proprietary name.
5d030b67 545
a999c27c 546=head2 Microsoft-related naming mess
547
548Microsoft products misuse the following names:
5d030b67 549
67d7b5ef 550=over 2
a63c962f 551
a999c27c 552=item KS_C_5601-1987
5d030b67 553
a999c27c 554Microsoft extension to C<EUC-KR>.
5d030b67 555
c731e18e 556Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 557
f2a2953c 558See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 559for details.
5d030b67 560
f2a2953c 561Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
562misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
563C<kcs5601-raw>.
5d030b67 564
f2a2953c 565See L<Encode::KR> for details.
67d7b5ef 566
a999c27c 567=item GB2312
67d7b5ef 568
a999c27c 569Microsoft extension to C<EUC-CN>.
a63c962f 570
a999c27c 571Proper names: C<CP936>, C<GBK>.
a63c962f 572
a999c27c 573C<GB2312> has been registered in the C<EUC-CN> meaning at
574IANA. This has partially repaired the situation: Microsoft's
575C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 576
a999c27c 577Encode aliases C<GB2312> to C<euc-cn> in full agreement with
578IANA registration. C<cp936> is supported separately.
f2a2953c 579I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 580
f2a2953c 581See L<Encode::CN> for details.
a999c27c 582
583=item Big5
584
585Microsoft extension to C<Big5>.
586
587Proper name: C<CP950>.
588
589Encode separately supports C<Big5> and C<cp950>.
590
591=item Shift_JIS
592
593Microsoft's understanding of C<Shift_JIS>.
594
595JIS has not endorsed the full Microsoft standard however.
596The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
597subsets, while Microsoft has always been meaning C<Shift_JIS> to
fcb875d4 598encode a wider character repertoire. See C<IANA> registration for
c731e18e 599C<Windows-31J>.
a999c27c 600
601As a historical predecessor Microsoft's variant
602probably has more rights for the name, albeit it may be objected
603that Microsoft shouldn't have used JIS as part of the name
604in the first place.
605
fcb875d4 606Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c 607
608Encode separately supports C<Shift_JIS> and C<cp932>.
609
610=back
611
612=head1 Glossary
613
614=over 2
615
616=item character repertoire
617
618A collection of unique characters. A I<character> set in the most
962111ca 619strict sense. At this stage characters are not numbered.
a999c27c 620
621=item coded character set (CCS)
622
623A character set that is mapped in a way computers can use directly.
624Many character encodings including EUC falls in this category.
625
626=item character encoding scheme (CES)
627
628An algorithm to map a character set to a byte sequence. You don't
629have to be able to tell which character set a given byte sequence
630belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
631example of being both a CCS and CES.
632
f2a2953c 633=item charset (in MIME context)
634
635has long been used in the meaning of C<encoding>, CES.
636
637While C<character set> word combination has lost this meaning
638in MIME context since [RFC 2130], C<charset> abbreviation has
639retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
640
641
642 This document uses the term "charset" to mean a set of rules for
643 mapping from a sequence of octets to a sequence of characters, such
644 as the combination of a coded character set and a character encoding
645 scheme; this is also what is used as an identifier in MIME "charset="
646 parameters, and registered in the IANA charset registry ... (Note
647 that this is NOT a term used by other standards bodies, such as ISO).
648 [RFC 2277]
649
a999c27c 650=item EUC
651
652Extended Unix Character. See ISO-2022
653
654=item ISO-2022
655
656A CES that was carefully designed to coexist with ASCII. There are 7
f2a2953c 657bit version and 8 bit version.
658
6597 bit version switches character set via escape sequence so this
660cannot form a CCS. Since this is more difficult to handle in programs
661than the 8 bit version, 7 bit version is not very popular except for
662iso-2022-jp, the de facto standard CES for e-mails.
663
6648 bit version can conform a CCS. EUC and ISO-8859 are two examples
962111ca 665thereof. Pre-5.6 perl could use them as string literals.
a999c27c 666
667=item UCS
668
669Short for I<Universal Character Set>. When you say just UCS, it means
670I<Unicode>
671
672=item UCS-2
673
674ISO/IEC 10646 encoding form: Universal Character Set coded in two
675octets.
676
677=item Unicode
678
f2a2953c 679A Character Set that aims to include all character repertoire of the
962111ca 680world. Many character sets in various national as well as industrial
f2a2953c 681standards have become, in a way, just subsets of Unicode.
a999c27c 682
683=item UTF
684
f2a2953c 685Short for I<Unicode Transformation Format>. Determines how to map a
962111ca 686Unicode character into byte sequence.
a999c27c 687
688=item UTF-16
689
690A UTF in 16-bit encoding. Can either be in big endian or little
f2a2953c 691endian. Big endian version is called UTF-16BE (equals to UCS-2 +
692Surrogate Support) and little endian version is UTF-16LE.
67d7b5ef 693
694=back
5d030b67 695
696=head1 See Also
697
5129552c 698L<Encode>,
699L<Encode::Byte>,
a63c962f 700L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 701L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 702
a999c27c 703=head1 References
704
705=over 2
706
707=item ECMA
708
709European Computer Manufacturers Association
710L<http://www.ecma.ch>
711
712=over 2
713
714=item EMCA-035 (eq C<ISO-2022>)
715
716L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
717
448e90bb 718The very specification of ISO-2022 is available from the link above.
a999c27c 719
720=back
721
722=item IANA
723
724Internet Assigned Numbers Authority
725L<http://www.iana.org/>
726
727=over 2
728
729=item Assigned Charset Names by IANA
730
731L<http://www.iana.org/assignments/character-sets>
732
733Most of the C<canonical names> in Encode derive from this list
734so you can directly apply the string you have extracted from MIME
448e90bb 735header of mails and web pages.
a999c27c 736
737=back
738
739=item ISO
740
741International Organization for Standardization
742L<http://www.iso.ch/>
743
744=item RFC
745
962111ca 746Request For Comments -- need I say more?
f2a2953c 747L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
a999c27c 748
749=item UC
750
751Unicode Consortium
752L<http://www.unicode.org/>
753
754=over 2
755
756=item Unicode Glossary
757
758L<http://www.unicode.org/glossary/>
759
962111ca 760The glossary of this document is based upon this site.
a999c27c 761
762=back
763
764=back
765
766=head2 Other Notable Sites
767
768=over 2
769
770=item czyborra.com
771
f2a2953c 772L<http://czyborra.com/>
a999c27c 773
774Contains a a lot of useful information, especially gory details of ISO
775vs. vendor mappings.
776
777=item CJK.inf
778
779L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
780
781Somewhat obsolete (last update in 1996), but still useful. Also try
782
783L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
784
785You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
786
f2a2953c 787=item Jungshik Shin's Hangul FAQ
788
789L<http://jshin.net/faq>
790
962111ca 791And especially it's subject 8.
f2a2953c 792
793L<http://jshin.net/faq/qa8.html>
794
962111ca 795A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 796
797=back
798
799=head2 Offline sources
800
801=over 2
802
803=item C<CJKV Information Processing> by Ken Lunde
804
805CJKV Information Processing
8061999 O'Reilly & Associates, ISBN : 1-56592-224-7
807
808The modern successor of the C<CJK.inf>.
809
810Features a comprehensive coverage on CJKV character sets and
811encodings along with many other issues faced by anyone trying
812to better support CJKV languages/scripts in all the areas of
813information processing.
814
815To purchase this book visit
816L<http://www.oreilly.com/catalog/cjkvinfo/>
817
a999c27c 818=back
819
5d030b67 820=cut
67d7b5ef 821
822I could not find this page because the hostname doesn't resolve!
823
962111ca 824Brief description for most of the mentioned CJK encodings
67d7b5ef 825L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>