Upgrade to Encode 1.26, from Dan Kogai.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
a999c27c 3Encode::Supported -- Supported encodings by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
a999c27c 15=over
16
17=item *
18
19The name used by the perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reaches the method so such
21frequently used words like 'utf8' should do without alias lookups.
22
23=item *
24
25The MIME name as defined in IETF RFCs This includes all "iso-"'s.
26
27=item *
28
29The name in the IANA registry.
30
31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
43encodings have state, "Encode" uses the encoding object internally
44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
a63c962f 50(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67 51other words, "ISO 8859 1" and "iso-8859-1" are identical.
52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
55most cases. Encode.pm will automatically load those modules in need.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
67d7b5ef 61 Canonical Aliases Comments & References
62 ----------------------------------------------------------------
a999c27c 63 ascii US-ascii [ECMA]
67d7b5ef 64 iso-8859-1 latin1 [ISO]
a999c27c 65 utf8 UTF-8 [RFC2279]
f2a2953c 66 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
67 UCS-2LE [UC]
68 UTF-16 [UC]
69 UTF-16BE [UC]
70 UTF-16LE [UC]
71 UTF-32 [UC]
72 UTF-32BE [UC]
73 UTF-32LE [UC]
67d7b5ef 74 ----------------------------------------------------------------
5d030b67 75
f2a2953c 76To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
77see L<Encode::Unicode>.
78
a999c27c 79=head2 Encode::Byte -- Extended ASCII
5d030b67 80
a999c27c 81Encode::Byte implements most of single-byte encodings except for
82Symbols and EBCDIC. The following encodings are based single-byte
83encoding implemented as extended ASCII. For most cases it uses
84\x80-\xff (upper half) to map non-ASCII characters.
85
86=over 2
87
88=item ISO-8859 and corresponding vendor mappings
89
90Since there are so many, They are presented in table format with
91Languages and corresponding encoding names by vendors. Note the table
92is sorted in order of ISO-8859 and the corresponding vendor mappings
93are slightly different from that of ISO. See
94L<http://czyborra.com/charsets/iso8859.html> for details.
95
96 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
97 ----------------------------------------------------------------
98 N. America (ASCII) cp437 AdobeStandardEncoding
99 cp863 (DOSCanadaF)
100 W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
101 hp-roman8
102 cp860 (DOSPortuguese)
103 CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
104 MacCroatian
105 MacRomanian
106 MacRumanian
107 Latin3(*3) iso-8859-3
108 Latin4(*4) iso-8859-4
109 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
110 (Also see next section) cp866 MacUkrainian
111 Arabic iso-8859-6 cp864 cp1256 MacArabic
112 cp1006 MacFarsi
113 Greek iso-8859-7 cp737 cp1253 MacGreek
114 cp869 (DOSGreek2)
115 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
116 Turkish iso-8859-9 cp857 cp1254 MacTurkish
117 Nordics iso-8859-10 cp865
118 cp861 MacIcelandic
119 MacSami
120 Thai iso-8859-11 cp874 MacThai
121 (iso-8859-12 is nonexistent. Reserved for Indics?)
122 Baltics iso-8859-13 cp775 cp1257
123 Celtics iso-8859-14
124 Latin9(*15) iso-8859-15
125 Latin10 iso-8859-16
126 Vietnamese viscii cp1258 MacVietnamese
127 ----------------------------------------------------------------
128
129 (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
130 (*4) Baltics. Now on 8859-10
131 (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
132 letters that are missing from 8859-1 are added.
133
134All cp* are also available as ibm-*, ms-*, and windows-* . See also
135L<http://czyborra.com/charsets/codepages.html>.
136
137Macintosh encodings don't seem to be registered in such entities as
138IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1391150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
140for details
141
142=item KOI8 - De Facto Standard for Cyrillic world
143
144Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
145in the Net. L<Encode> comes with the following KOI charsets. for
146gory details, See <http://czyborra.com/charsets/cyrillic.html> for
147details.
5d030b67 148
67d7b5ef 149 ----------------------------------------------------------------
67d7b5ef 150 koi8-f
a999c27c 151 koi8-r cp878 [RFC1489]
67d7b5ef 152 koi8-u [RFC2319]
67d7b5ef 153
a999c27c 154=item gsm0338 - Hentai Latin 1
155
156GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
157control character ranges and other parts are mapped very differently,
f2a2953c 158presumablly to store Greek and Cyrillic alphabets. This one is also
159covered in Encode::Byte even thought this one does not comply extended
160ASCII.
a999c27c 161
162=back
5d030b67 163
5129552c 164=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 165
166Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 167below. Also note these are implemented in distinct module by
168languages, due the the size concerns. Please also refer to their
169respective document pages.
5d030b67 170
5129552c 171=over 4
172
173=item Encode::CN -- Continental China
174
f2a2953c 175 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 176 ----------------------------------------------------------------
f2a2953c 177 euc-cn(*1) MacChineseSimp
178 (gbk) cp936 (*2)
179 gb12345-raw { GB12345 without CES }
180 gb2312-raw { GB2312 without CES }
5129552c 181 hz
182 iso-ir-165
67d7b5ef 183 ----------------------------------------------------------------
5129552c 184
f2a2953c 185 (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
186 (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
187
5129552c 188=item Encode::JP -- Japan
189
f2a2953c 190 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 191 ----------------------------------------------------------------
a999c27c 192 euc-jp
193 shiftjis cp932 macJapanese
f2a2953c 194 7bit-jis
195 euc-jp
196 iso-2022-jp [RFC1468]
197 iso-2022-jp-1 [RFC2237]
198 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
199 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
200 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 201 ----------------------------------------------------------------
5129552c 202
203=item Encode::KR -- Korea
204
f2a2953c 205 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 206 ----------------------------------------------------------------
a999c27c 207 euc-kr MacKorean [RFC1557]
f2a2953c 208 cp949 (*)
a999c27c 209 iso-2022-kr [RFC1557]
210 johab [KS X 1001:1998, Annex 3]
f2a2953c 211 ksc5601-raw { KSC5601 without CES }
67d7b5ef 212 ----------------------------------------------------------------
5129552c 213
f2a2953c 214 (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
215 this. See below.
216
217
5129552c 218=item Encode::TW -- Taiwan
219
f2a2953c 220 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 221 ----------------------------------------------------------------
a999c27c 222 big5 cp950 MacChineseTrad
5129552c 223 big5-hkscs
67d7b5ef 224 ----------------------------------------------------------------
5129552c 225
226=item Encode::HanExtra -- More Chinese via CPAN
227
228Due to size concerns, additional Chinese encodings below are
229distributed separately on CPAN, under the name Encode::HanExtra.
230
f2a2953c 231 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 232 ----------------------------------------------------------------
5129552c 233 gb18030
234 euc-tw
235 big5plus
67d7b5ef 236 ----------------------------------------------------------------
5129552c 237
238=back
239
240=head2 Miscellaneous encodings
241
242=over 4
243
244=item Encode::EBCDIC
5d030b67 245
a999c27c 246See L<perlebcdic> for details.
5d030b67 247
67d7b5ef 248 ----------------------------------------------------------------
5d030b67 249 cp37
a999c27c 250 cp500
251 cp875
252 cp1026
253 cp1047
5d030b67 254 posix-bc
67d7b5ef 255 ----------------------------------------------------------------
5129552c 256
a63c962f 257=item Encode::Symbols
5d030b67 258
5129552c 259For symbols and dingbats.
5d030b67 260
67d7b5ef 261 ----------------------------------------------------------------
5d030b67 262 symbol
263 dingbats
a999c27c 264 MacDingbats
265 AdobeZdingbat
266 AdobeSymbol
67d7b5ef 267 ----------------------------------------------------------------
268
269=back
270
271=head1 Unsupported encodings
272
273The following are not supported as yet. Some because they are rarely
274usede, some because of technical difficulty. They may be supported by
275external modules via CPAN in future, however.
276
277=over 4
278
279=item ISO-2022-JP-2 [RFC1554]
280
281Not very popular yet. Needs Unicode Database or equivalent to
282implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
283GB2312 sumulteniously, which code points in unicode overlap. So you
284need to lookup the database to determine what character set a given
285Unicode character should belong).
286
287=item ISO-2022-CN [RFC1922]
288
289Not very popular. Needs CNS 11643-1 and 2 which are not available in
290this module. CNS 11643 is supported (via euc-tw) in
291Encode::HanExtra. Autrijus may add support for this encoding in his
292module in future
293
294=item various UP-UX encodings
295
296The following are unsoported due to the lack of mapping data.
297
298 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
299 '15' - japanese15, korean15, and roi15
300
301=item Cyrillic encoding ISO-IR-111
302
303Anton doubts its usefulness.
304
305=item ISO-8859-8-1 [Hebrew]
306
a999c27c 307None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
308MacHebrew are supported because and just because there were mappings
309available at L<http://www.unicode.org/>). Contribution welcome.
67d7b5ef 310
311=item Thai encoding TCVN
312
313Ditto.
314
315=item Vietnamese encodings VPS
316
a999c27c 317Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
318L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
319L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
320if you are interested in helping us.
67d7b5ef 321
322=item various Mac encodings
323
a999c27c 324The following are unsoported due to the lack of mapping data.
325
326 MacArmenian, MacBengali, MacBurmese, MacEthiopic
327 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
328 MacLaotian, MacMalayalam, MacMongolian, MacOriya
329 MacSinhalese, MacTamil, MacTelugu, MacTibetan
330 MacVietnamese
331
332The rest of which already available are based upon the vendor mappings at
333L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
334
335=item (Mac) Indic encodings
336
337The maps for the following is available at L<http://www.unicode.org/>
338but remains unsupport because those encordigs need algorithmical
339approach, unsupported by F<enc2xs>
67d7b5ef 340
a999c27c 341 MacDevanagari
342 MacGurmukhi
343 MacGujarati
67d7b5ef 344
a999c27c 345For details, please see C<Unicode mapping issues and notes:> at
346L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
347
348I believe this issue is prevalent not only for Mac Indics but also in
349other Indic encodings but those mentions were the only Indic encodings
350maps that I could find at L<http://www.unicode.org/> .
5129552c 351
352=back
5d030b67 353
a999c27c 354=head1 Encoding vs. Charset -- terminology
5d030b67 355
a999c27c 356We are used to using the term (character) I<encoding> and I<character set>
357interchangeably. But just as using the term byte and character is
358dangerous and should be differenciated when needed, we need to
359differenciate I<encoding> and I<character set>.
5d030b67 360
f2a2953c 361To understand that, it's follow how we make computers grok our characters.
a999c27c 362
363=over 4
364
365=item *
67d7b5ef 366
a999c27c 367First we start with which characters to include. We call this
368collection of characters I<character repertoire>.
5d030b67 369
a999c27c 370=item *
5d030b67 371
a999c27c 372Then we have to give each character a unique ID so your computer can
373tell the differnce from 'a' to 'A'. This itemized character
374repartoire is now a I<character set>.
a63c962f 375
a999c27c 376=item *
377
378If your computer can grow the character set without further
379proccessing, you can go ahead use it. This is called a I<coded
380character set> (CCS) or I<raw character encoding>. ASCII is used this
381way for most cases.
382
383=item *
384
385But in many cases especially multi-byte CJK encodings, you have to
386tweak a little more. Your network connection may not accept any data
387with the Most Significant Bit set, Your computer may not be able to
388tell if a given byte is a whole character or just half of it. So you
389have to I<encode> the character set to use it.
390
391A I<character encoding scheme> (CES) determines how to encode a given
392character set, or a set of multiple character sets. 7bit ISO-2022 is
393an example of CES. You switch between character sets via I<escape
394sequence>.
67d7b5ef 395
396=back
397
a999c27c 398Technically, or Mathematically speaking, a character set encoded in
399such a CES that maps character by character may form a CCS. EUC is such
400an example. CES of EUC is as follows;
67d7b5ef 401
a999c27c 402=over 4
5d030b67 403
a999c27c 404=item *
5d030b67 405
a999c27c 406Map ASCII unchanged.
407
408=item *
409
410Map such a character set that consists of 94 or 96 powered by N
411members by adding 0x80 to each byte.
412
413=item *
414
415You can also use 0x8e and 0x8f to tell the following sequence of
416characters belong to yet another character set. each following byte
417is added by 0x80
418
419=back
420
421By carefully looking at at the encoded byte sequence, you may find the
422byte sequence conforms a unique number. In that sense EUC is a CCS
423generated by a CES above from up to four CCS (complicated?). UTF-8
424falls into this category. See L<perlunicode/"UTF-8"> to find how
425UTF-8 maps Unicode to a byte sequence.
426
427You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
428you look at a byte sequence \x21\x21, you can't tell if it is two !'s
429or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
430trouble between "!!". and " "
67d7b5ef 431
a63c962f 432=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
433
434This section tries to classify the supported encodings by their
435applicability for information exchange over the Internet and to
436choose the most suitable aliases to name them in the context of
437such communication.
438
67d7b5ef 439=over 2
440
441=item *
442
f2a2953c 443To (en|de) code Encodings marked as C<(**)>, You need
a999c27c 444C<Encode::HanExtra>, available from CPAN.
67d7b5ef 445
446=back
447
a63c962f 448Encoding names
5d030b67 449
f2a2953c 450 US-ASCII UTF-8 ISO-8859-* KOI8-R
451 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
452 EUC-KR Big5 GB2312
a999c27c 453
454are registered to IANA as preferred MIME names and may probably
455be used over the Internet.
5d030b67 456
a999c27c 457C<Shift_JIS> has been officialized by JIS X 0208-1997.
458L<Microsoft-related naming mess> gives details.
5d030b67 459
a999c27c 460C<GB2312> is the IANA name for C<EUC-CN>.
461See L<Microsoft-related naming mess> for details.
462
463C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 464with Encode. See L<Encode::CN> for details.
5d030b67 465
a63c962f 466 EUC-CN
f2a2953c 467 KOI8-U [RFC2319]
5d030b67 468
a999c27c 469have not been registered with IANA (as of March 2002) but
470seem to be supported by major web browsers.
471IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 472
473 KS_C_5601-1987
474
a999c27c 475is heavily misused.
476See L<Microsoft-related naming mess> for details.
477
478C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 479with Encode. See L<Encode::KR> for details.
480
481 UTF-16 UTF-16BE UTF-16LE
482
483are a IANA-registered C<charset>s. See [RFC 2781] for details.
484Jungshik Shin reports that UTF-16 with a BOM is well accepted
485by MS IE 5/6 and NS 4/6. Beware however that
486
487=over 2
488
489=item *
5d030b67 490
f2a2953c 491C<UTF-16> support in any software you're going to be
492using/interoperating with has probably been less tested
493then C<UTF-8> support
5d030b67 494
f2a2953c 495=item *
496
497data coded with C<UTF-8> seamlessly passes traditional
498command piping (C<cat>, C<more>, etc.) while UTF-16 coded
499data is likely to cause confusion (with it's zero bytes,
500for example)
501
502=item *
503
504it is beyond the power of words to describe the way HTML browsers
505encode non-C<ASCII> form data. To get a general impression refer to
506L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
507While encoding of form data has stabilzed for C<UTF-8> coded pages
508(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
509expect fun (and cross-browser discrepancies) with C<UTF-16> coded
510pages!
511
512=back
513
514The rule of thumb is to use C<UTF-8> unless you know what
515you're doing and unless you really need from using C<UTF-16>.
a999c27c 516
5d030b67 517
f2a2953c 518 ISO-IR-165 [RFC1345]
5d030b67 519 GBK
520 VISCII
a63c962f 521 GB 12345
f2a2953c 522 GB 18030 (**) (see links bellow)
523 EUC-TW (**)
5d030b67 524
525are totally valid encodings but not registered at IANA.
a63c962f 526The names under which they are listed here are probably the
527most widely-known names for these encodings and are recommended
528names.
529
f2a2953c 530 BIG5PLUS (**)
a63c962f 531
67d7b5ef 532is a bit proprietary name.
5d030b67 533
a999c27c 534=head2 Microsoft-related naming mess
535
536Microsoft products misuse the following names:
5d030b67 537
67d7b5ef 538=over 2
a63c962f 539
a999c27c 540=item KS_C_5601-1987
5d030b67 541
a999c27c 542Microsoft extension to C<EUC-KR>.
5d030b67 543
a999c27c 544Proper name: C<CP949>.
67d7b5ef 545
f2a2953c 546See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 547for details.
5d030b67 548
f2a2953c 549Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
550misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
551C<kcs5601-raw>.
5d030b67 552
f2a2953c 553See L<Encode::KR> for details.
67d7b5ef 554
a999c27c 555=item GB2312
67d7b5ef 556
a999c27c 557Microsoft extension to C<EUC-CN>.
a63c962f 558
a999c27c 559Proper names: C<CP936>, C<GBK>.
a63c962f 560
a999c27c 561C<GB2312> has been registered in the C<EUC-CN> meaning at
562IANA. This has partially repaired the situation: Microsoft's
563C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 564
a999c27c 565Encode aliases C<GB2312> to C<euc-cn> in full agreement with
566IANA registration. C<cp936> is supported separately.
f2a2953c 567I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 568
f2a2953c 569See L<Encode::CN> for details.
a999c27c 570
571=item Big5
572
573Microsoft extension to C<Big5>.
574
575Proper name: C<CP950>.
576
577Encode separately supports C<Big5> and C<cp950>.
578
579=item Shift_JIS
580
581Microsoft's understanding of C<Shift_JIS>.
582
583JIS has not endorsed the full Microsoft standard however.
584The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
585subsets, while Microsoft has always been meaning C<Shift_JIS> to
586encode a wider character repertoire.
587
588As a historical predecessor Microsoft's variant
589probably has more rights for the name, albeit it may be objected
590that Microsoft shouldn't have used JIS as part of the name
591in the first place.
592
593Unabiguous name: C<CP932>.
594
595Encode separately supports C<Shift_JIS> and C<cp932>.
596
597=back
598
599=head1 Glossary
600
601=over 2
602
603=item character repertoire
604
605A collection of unique characters. A I<character> set in the most
606strict sense. At this stage characters are not numberd.
607
608=item coded character set (CCS)
609
610A character set that is mapped in a way computers can use directly.
611Many character encodings including EUC falls in this category.
612
613=item character encoding scheme (CES)
614
615An algorithm to map a character set to a byte sequence. You don't
616have to be able to tell which character set a given byte sequence
617belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
618example of being both a CCS and CES.
619
f2a2953c 620=item charset (in MIME context)
621
622has long been used in the meaning of C<encoding>, CES.
623
624While C<character set> word combination has lost this meaning
625in MIME context since [RFC 2130], C<charset> abbreviation has
626retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
627
628
629 This document uses the term "charset" to mean a set of rules for
630 mapping from a sequence of octets to a sequence of characters, such
631 as the combination of a coded character set and a character encoding
632 scheme; this is also what is used as an identifier in MIME "charset="
633 parameters, and registered in the IANA charset registry ... (Note
634 that this is NOT a term used by other standards bodies, such as ISO).
635 [RFC 2277]
636
a999c27c 637=item EUC
638
639Extended Unix Character. See ISO-2022
640
641=item ISO-2022
642
643A CES that was carefully designed to coexist with ASCII. There are 7
f2a2953c 644bit version and 8 bit version.
645
6467 bit version switches character set via escape sequence so this
647cannot form a CCS. Since this is more difficult to handle in programs
648than the 8 bit version, 7 bit version is not very popular except for
649iso-2022-jp, the de facto standard CES for e-mails.
650
6518 bit version can conform a CCS. EUC and ISO-8859 are two examples
652thereof. pre-5.6 perl could use them as string literals.
a999c27c 653
654=item UCS
655
656Short for I<Universal Character Set>. When you say just UCS, it means
657I<Unicode>
658
659=item UCS-2
660
661ISO/IEC 10646 encoding form: Universal Character Set coded in two
662octets.
663
664=item Unicode
665
f2a2953c 666A Character Set that aims to include all character repertoire of the
667world. Many character sets in various national as well as industorial
668standards have become, in a way, just subsets of Unicode.
a999c27c 669
670=item UTF
671
f2a2953c 672Short for I<Unicode Transformation Format>. Determines how to map a
a999c27c 673unicode character into byte sequnece.
674
675=item UTF-16
676
677A UTF in 16-bit encoding. Can either be in big endian or little
f2a2953c 678endian. Big endian version is called UTF-16BE (equals to UCS-2 +
679Surrogate Support) and little endian version is UTF-16LE.
67d7b5ef 680
681=back
5d030b67 682
683=head1 See Also
684
5129552c 685L<Encode>,
686L<Encode::Byte>,
a63c962f 687L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 688L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 689
a999c27c 690=head1 References
691
692=over 2
693
694=item ECMA
695
696European Computer Manufacturers Association
697L<http://www.ecma.ch>
698
699=over 2
700
701=item EMCA-035 (eq C<ISO-2022>)
702
703L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
704
705The very dspecification of ISO-2022 is available from the link above.
706
707=back
708
709=item IANA
710
711Internet Assigned Numbers Authority
712L<http://www.iana.org/>
713
714=over 2
715
716=item Assigned Charset Names by IANA
717
718L<http://www.iana.org/assignments/character-sets>
719
720Most of the C<canonical names> in Encode derive from this list
721so you can directly apply the string you have extracted from MIME
722header of mails and we pages.
723
724=back
725
726=item ISO
727
728International Organization for Standardization
729L<http://www.iso.ch/>
730
731=item RFC
732
733Request For Comment -- need I say more?
f2a2953c 734L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
a999c27c 735
736=item UC
737
738Unicode Consortium
739L<http://www.unicode.org/>
740
741=over 2
742
743=item Unicode Glossary
744
745L<http://www.unicode.org/glossary/>
746
747The glossary of this document is based opon this site.
748
749=back
750
751=back
752
753=head2 Other Notable Sites
754
755=over 2
756
757=item czyborra.com
758
f2a2953c 759L<http://czyborra.com/>
a999c27c 760
761Contains a a lot of useful information, especially gory details of ISO
762vs. vendor mappings.
763
764=item CJK.inf
765
766L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
767
768Somewhat obsolete (last update in 1996), but still useful. Also try
769
770L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
771
772You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
773
f2a2953c 774=item Jungshik Shin's Hangul FAQ
775
776L<http://jshin.net/faq>
777
778And especially it's subject 8
779
780L<http://jshin.net/faq/qa8.html>
781
782a comprehensive overview of the Korean (C<KS *>) standards.
783
784=back
785
786=head2 Offline sources
787
788=over 2
789
790=item C<CJKV Information Processing> by Ken Lunde
791
792CJKV Information Processing
7931999 O'Reilly & Associates, ISBN : 1-56592-224-7
794
795The modern successor of the C<CJK.inf>.
796
797Features a comprehensive coverage on CJKV character sets and
798encodings along with many other issues faced by anyone trying
799to better support CJKV languages/scripts in all the areas of
800information processing.
801
802To purchase this book visit
803L<http://www.oreilly.com/catalog/cjkvinfo/>
804
a999c27c 805=back
806
5d030b67 807=cut
67d7b5ef 808
809I could not find this page because the hostname doesn't resolve!
810
811 Brief description for most of the mentioned CJK encodings
812L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>