Upgrade to Encode 1.63.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
0ab8f81e 3Encode::Supported -- Encodings supported by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
0ab8f81e 10is ignored. In addition, an encoding may have aliases.
5d030b67 11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
0ab8f81e 15=over 4
a999c27c 16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
0ab8f81e 25The MIME name as defined in IETF RFCs. This includes all "iso-"s.
a999c27c 26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
962111ca 63 ascii US-ascii [ECMA]
64 iso-8859-1 latin1 [ISO]
65 utf8 UTF-8 [RFC2279]
c731e18e 66 ----------------------------------------------------------------
67
c731e18e 68=head2 Encode::Unicode -- other Unicode encodings
69
70Unicode coding schemes other than native utf8 are supported by
0ab8f81e 71Encode::Unicode, which will be autoloaded on demand.
c731e18e 72
73 ----------------------------------------------------------------
f2a2953c 74 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
75 UCS-2LE [UC]
76 UTF-16 [UC]
77 UTF-16BE [UC]
78 UTF-16LE [UC]
79 UTF-32 [UC]
80 UTF-32BE [UC]
81 UTF-32LE [UC]
67d7b5ef 82 ----------------------------------------------------------------
5d030b67 83
0ab8f81e 84To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
f2a2953c 85see L<Encode::Unicode>.
86
a999c27c 87=head2 Encode::Byte -- Extended ASCII
5d030b67 88
0ab8f81e 89Encode::Byte implements most single-byte encodings except for
90Symbols and EBCDIC. The following encodings are based on single-byte
91encodings implemented as extended ASCII. Most of them map
92\x80-\xff (upper half) to non-ASCII characters.
a999c27c 93
0ab8f81e 94=over 4
a999c27c 95
96=item ISO-8859 and corresponding vendor mappings
97
962111ca 98Since there are so many, they are presented in table format with
0ab8f81e 99languages and corresponding encoding names by vendors. Note that
100the table is sorted in order of ISO-8859 and the corresponding vendor
101mappings are slightly different from that of ISO. See
a999c27c 102L<http://czyborra.com/charsets/iso8859.html> for details.
103
962111ca 104 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 105 ----------------------------------------------------------------
962111ca 106 N. America (ASCII) cp437 AdobeStandardEncoding
107 cp863 (DOSCanadaF)
0ab8f81e 108 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
962111ca 109 hp-roman8
110 cp860 (DOSPortuguese)
111 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
112 MacCroatian
113 MacRomanian
114 MacRumanian
115 Latin3 [1] iso-8859-3
116 Latin4 [2] iso-8859-4
117 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
0ab8f81e 118 (See also next section) cp866 MacUkrainian
962111ca 119 Arabic iso-8859-6 cp864 cp1256 MacArabic
120 cp1006 MacFarsi
121 Greek iso-8859-7 cp737 cp1253 MacGreek
122 cp869 (DOSGreek2)
123 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
124 Turkish iso-8859-9 cp857 cp1254 MacTurkish
125 Nordics iso-8859-10 cp865
126 cp861 MacIcelandic
127 MacSami
128 Thai iso-8859-11 [3] cp874 MacThai
a999c27c 129 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 130 Baltics iso-8859-13 cp775 cp1257
a999c27c 131 Celtics iso-8859-14
962111ca 132 Latin9 [4] iso-8859-15
a999c27c 133 Latin10 iso-8859-16
962111ca 134 Vietnamese viscii cp1258 MacVietnamese
a999c27c 135 ----------------------------------------------------------------
136
0ab8f81e 137 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
138 [2] Baltics. Now on 8859-10, except for Latvian.
962111ca 139 [3] Also know as TIS 620.
0ab8f81e 140 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
141 letters that are missing from 8859-1 were added.
a999c27c 142
143All cp* are also available as ibm-*, ms-*, and windows-* . See also
144L<http://czyborra.com/charsets/codepages.html>.
145
146Macintosh encodings don't seem to be registered in such entities as
147IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1481150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
0ab8f81e 149for details.
a999c27c 150
0ab8f81e 151=item KOI8 - De Facto Standard for the Cyrillic world
a999c27c 152
0ab8f81e 153Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
154popular in the Net. L<Encode> comes with the following KOI charsets.
962111ca 155For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 156
67d7b5ef 157 ----------------------------------------------------------------
962111ca 158 koi8-f
159 koi8-r cp878 [RFC1489]
160 koi8-u [RFC2319]
85982a32 161 ----------------------------------------------------------------
962111ca 162
a999c27c 163=item gsm0338 - Hentai Latin 1
164
962111ca 165GSM0338 is for GSM handsets. Though it shares alphanumerals with
166ASCII, control character ranges and other parts are mapped very
167differently, presumably to store Greek and Cyrillic alphabets.
0ab8f81e 168This is also covered in Encode::Byte even though it is not an
169"extended ASCII" encoding.
a999c27c 170
171=back
5d030b67 172
0ab8f81e 173=head2 CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 174
962111ca 175Note that Vietnamese is listed above. Also read "Encoding vs Charset"
0ab8f81e 176below. Also note that these are implemented in distinct modules by
177countries, due the the size concerns (simplified Chinese is mapped
178to 'CN', continental China, while traditional Chinese is mapped to
179'TW', Taiwan). Please refer to their respective documentataion pages.
5d030b67 180
5129552c 181=over 4
182
183=item Encode::CN -- Continental China
184
962111ca 185 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 186 ----------------------------------------------------------------
962111ca 187 euc-cn [1] MacChineseSimp
188 (gbk) cp936 [2]
189 gb12345-raw { GB12345 without CES }
190 gb2312-raw { GB2312 without CES }
5129552c 191 hz
192 iso-ir-165
67d7b5ef 193 ----------------------------------------------------------------
5129552c 194
0ab8f81e 195 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
196 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
f2a2953c 197
5129552c 198=item Encode::JP -- Japan
199
962111ca 200 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 201 ----------------------------------------------------------------
a999c27c 202 euc-jp
962111ca 203 shiftjis cp932 macJapanese
f2a2953c 204 7bit-jis
205 euc-jp
962111ca 206 iso-2022-jp [RFC1468]
207 iso-2022-jp-1 [RFC2237]
f2a2953c 208 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
209 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
210 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 211 ----------------------------------------------------------------
5129552c 212
213=item Encode::KR -- Korea
214
962111ca 215 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 216 ----------------------------------------------------------------
962111ca 217 euc-kr MacKorean [RFC1557]
218 cp949 [1]
219 iso-2022-kr [RFC1557]
a999c27c 220 johab [KS X 1001:1998, Annex 3]
f2a2953c 221 ksc5601-raw { KSC5601 without CES }
67d7b5ef 222 ----------------------------------------------------------------
5129552c 223
962111ca 224 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
225 See below.
226
5129552c 227=item Encode::TW -- Taiwan
228
962111ca 229 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 230 ----------------------------------------------------------------
b0b300a3 231 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
232 big5-hkscs
67d7b5ef 233 ----------------------------------------------------------------
5129552c 234
235=item Encode::HanExtra -- More Chinese via CPAN
236
237Due to size concerns, additional Chinese encodings below are
238distributed separately on CPAN, under the name Encode::HanExtra.
239
962111ca 240 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 241 ----------------------------------------------------------------
5129552c 242 gb18030
243 euc-tw
244 big5plus
67d7b5ef 245 ----------------------------------------------------------------
5129552c 246
247=back
248
249=head2 Miscellaneous encodings
250
251=over 4
252
253=item Encode::EBCDIC
5d030b67 254
a999c27c 255See L<perlebcdic> for details.
5d030b67 256
67d7b5ef 257 ----------------------------------------------------------------
5d030b67 258 cp37
a999c27c 259 cp500
260 cp875
261 cp1026
262 cp1047
5d030b67 263 posix-bc
67d7b5ef 264 ----------------------------------------------------------------
5129552c 265
a63c962f 266=item Encode::Symbols
5d030b67 267
5129552c 268For symbols and dingbats.
5d030b67 269
67d7b5ef 270 ----------------------------------------------------------------
5d030b67 271 symbol
272 dingbats
a999c27c 273 MacDingbats
274 AdobeZdingbat
275 AdobeSymbol
67d7b5ef 276 ----------------------------------------------------------------
277
278=back
279
280=head1 Unsupported encodings
281
0ab8f81e 282The following encodings are not supported as yet; some because they
283are rarely used, some because of technical difficulties. They may
284be supported by external modules via CPAN in the future, however.
67d7b5ef 285
286=over 4
287
288=item ISO-2022-JP-2 [RFC1554]
289
290Not very popular yet. Needs Unicode Database or equivalent to
0ab8f81e 291implement encode() (because it includes JIS X 0208/0212, KSC5601, and
292GB2312 simultaneously, whose code points in Unicode overlap. So you
293need to lookup the database to determine to what character set a given
67d7b5ef 294Unicode character should belong).
295
962111ca 296=item ISO-2022-CN [RFC1922]
67d7b5ef 297
0ab8f81e 298Not very popular. Needs CNS 11643-1 and -2 which are not available in
962111ca 299this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
0ab8f81e 300Autrijus Tang may add support for this encoding in his module in future.
67d7b5ef 301
0ab8f81e 302=item Various HP-UX encodings
67d7b5ef 303
962111ca 304The following are unsupported due to the lack of mapping data.
305
67d7b5ef 306 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 307 '15' - japanese15, korean15, and roi15
67d7b5ef 308
309=item Cyrillic encoding ISO-IR-111
310
0ab8f81e 311Anton Tagunov doubts its usefulness.
67d7b5ef 312
313=item ISO-8859-8-1 [Hebrew]
314
a999c27c 315None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
316MacHebrew are supported because and just because there were mappings
962111ca 317available at L<http://www.unicode.org/>). Contributions welcome.
318
319=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
320
321Ditto.
67d7b5ef 322
323=item Thai encoding TCVN
324
325Ditto.
326
327=item Vietnamese encodings VPS
328
0ab8f81e 329Though Jungshik Shin has reported that Mozilla supports this encoding,
330it was too late before 5.8.0 for us to add it. In the future, it
331may be available via a separate module. See
962111ca 332L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
333and
a999c27c 334L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
335if you are interested in helping us.
67d7b5ef 336
962111ca 337=item Various Mac encodings
67d7b5ef 338
962111ca 339The following are unsupported due to the lack of mapping data.
a999c27c 340
341 MacArmenian, MacBengali, MacBurmese, MacEthiopic
342 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
343 MacLaotian, MacMalayalam, MacMongolian, MacOriya
344 MacSinhalese, MacTamil, MacTelugu, MacTibetan
345 MacVietnamese
346
0ab8f81e 347The rest which are already available are based upon the vendor mappings
962111ca 348at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 349
350=item (Mac) Indic encodings
351
0ab8f81e 352The maps for the following are available at L<http://www.unicode.org/>
353but remain unsupport because those encodings need algorithmical
354approach, currently unsupported by F<enc2xs>:
67d7b5ef 355
a999c27c 356 MacDevanagari
357 MacGurmukhi
358 MacGujarati
67d7b5ef 359
a999c27c 360For details, please see C<Unicode mapping issues and notes:> at
361L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
362
363I believe this issue is prevalent not only for Mac Indics but also in
962111ca 364other Indic encodings, but the above were the only Indic encodings
a999c27c 365maps that I could find at L<http://www.unicode.org/> .
5129552c 366
367=back
5d030b67 368
a999c27c 369=head1 Encoding vs. Charset -- terminology
5d030b67 370
0ab8f81e 371We are used to using the term (character) I<encoding> and I<character
372set> interchangeably. But just as confusing the terms byte and
373character is dangerous and the terms should be differentiated when
374needed, we need to differentiate I<encoding> and I<character set>.
5d030b67 375
0ab8f81e 376To understand that, here is a description of how we make computers
377grok our characters.
a999c27c 378
379=over 4
380
381=item *
67d7b5ef 382
a999c27c 383First we start with which characters to include. We call this
384collection of characters I<character repertoire>.
5d030b67 385
a999c27c 386=item *
5d030b67 387
a999c27c 388Then we have to give each character a unique ID so your computer can
0ab8f81e 389tell the difference between 'a' and 'A'. This itemized character
962111ca 390repertoire is now a I<character set>.
a63c962f 391
a999c27c 392=item *
393
394If your computer can grow the character set without further
0ab8f81e 395processing, you can go ahead and use it. This is called a I<coded
a999c27c 396character set> (CCS) or I<raw character encoding>. ASCII is used this
397way for most cases.
398
399=item *
400
0ab8f81e 401But in many cases, especially multi-byte CJK encodings, you have to
a999c27c 402tweak a little more. Your network connection may not accept any data
0ab8f81e 403with the Most Significant Bit set, and your computer may not be able to
a999c27c 404tell if a given byte is a whole character or just half of it. So you
405have to I<encode> the character set to use it.
406
407A I<character encoding scheme> (CES) determines how to encode a given
408character set, or a set of multiple character sets. 7bit ISO-2022 is
0ab8f81e 409an example of a CES. You switch between character sets via I<escape
410sequences>.
67d7b5ef 411
412=back
413
0ab8f81e 414Technically, or mathematically, speaking, a character set encoded in
a999c27c 415such a CES that maps character by character may form a CCS. EUC is such
0ab8f81e 416an example. The CES of EUC is as follows:
67d7b5ef 417
a999c27c 418=over 4
5d030b67 419
a999c27c 420=item *
5d030b67 421
a999c27c 422Map ASCII unchanged.
423
424=item *
425
426Map such a character set that consists of 94 or 96 powered by N
427members by adding 0x80 to each byte.
428
429=item *
430
0ab8f81e 431You can also use 0x8e and 0x8f to indicate that the following sequence of
432characters belongs to yet another character set. To each following byte
433is added the value 0x80.
a999c27c 434
435=back
436
0ab8f81e 437By carefully looking at the encoded byte sequence, you can find that the
438byte sequence conforms a unique number. In that sense, EUC is a CCS
a999c27c 439generated by a CES above from up to four CCS (complicated?). UTF-8
0ab8f81e 440falls into this category. See L<perlUnicode/"UTF-8"> to find out how
a999c27c 441UTF-8 maps Unicode to a byte sequence.
442
0ab8f81e 443You may also have found out by now why 7bit ISO-2022 cannot comprise
444a CCS. If you look at a byte sequence \x21\x21, you can't tell if
445it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
446so you have no trouble differentiating between "!!". and S<" ">.
67d7b5ef 447
a63c962f 448=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
449
450This section tries to classify the supported encodings by their
451applicability for information exchange over the Internet and to
452choose the most suitable aliases to name them in the context of
453such communication.
454
0ab8f81e 455=over 4
67d7b5ef 456
457=item *
458
0ab8f81e 459To (en|de)code encodings marked by C<(**)>, you need
a999c27c 460C<Encode::HanExtra>, available from CPAN.
67d7b5ef 461
462=back
463
a63c962f 464Encoding names
5d030b67 465
f2a2953c 466 US-ASCII UTF-8 ISO-8859-* KOI8-R
467 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
468 EUC-KR Big5 GB2312
a999c27c 469
0ab8f81e 470are registered with IANA as preferred MIME names and may
a999c27c 471be used over the Internet.
5d030b67 472
c731e18e 473C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 474L<Microsoft-related naming mess> gives details.
5d030b67 475
a999c27c 476C<GB2312> is the IANA name for C<EUC-CN>.
477See L<Microsoft-related naming mess> for details.
478
479C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 480with Encode. See L<Encode::CN> for details.
5d030b67 481
a63c962f 482 EUC-CN
f2a2953c 483 KOI8-U [RFC2319]
5d030b67 484
a999c27c 485have not been registered with IANA (as of March 2002) but
486seem to be supported by major web browsers.
0ab8f81e 487The IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 488
489 KS_C_5601-1987
490
a999c27c 491is heavily misused.
492See L<Microsoft-related naming mess> for details.
493
494C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 495with Encode. See L<Encode::KR> for details.
496
497 UTF-16 UTF-16BE UTF-16LE
498
448e90bb 499are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 500Jungshik Shin reports that UTF-16 with a BOM is well accepted
501by MS IE 5/6 and NS 4/6. Beware however that
502
0ab8f81e 503=over 4
f2a2953c 504
505=item *
5d030b67 506
f2a2953c 507C<UTF-16> support in any software you're going to be
508using/interoperating with has probably been less tested
509then C<UTF-8> support
5d030b67 510
f2a2953c 511=item *
512
c731e18e 513C<UTF-8> coded data seamlessly passes traditional
514command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
0ab8f81e 515data is likely to cause confusion (with its zero bytes,
f2a2953c 516for example)
517
518=item *
519
520it is beyond the power of words to describe the way HTML browsers
0ab8f81e 521encode non-C<ASCII> form data. To get a general impression, visit
f2a2953c 522L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
0ab8f81e 523While encoding of form data has stabilized for C<UTF-8> encoded pages
524(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
525expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
f2a2953c 526pages!
527
528=back
529
530The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 531you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 532
5d030b67 533
f2a2953c 534 ISO-IR-165 [RFC1345]
5d030b67 535 VISCII
a63c962f 536 GB 12345
f2a2953c 537 GB 18030 (**) (see links bellow)
538 EUC-TW (**)
5d030b67 539
540are totally valid encodings but not registered at IANA.
a63c962f 541The names under which they are listed here are probably the
542most widely-known names for these encodings and are recommended
543names.
544
f2a2953c 545 BIG5PLUS (**)
a63c962f 546
0ab8f81e 547is a proprietary name.
5d030b67 548
a999c27c 549=head2 Microsoft-related naming mess
550
551Microsoft products misuse the following names:
5d030b67 552
0ab8f81e 553=over 4
a63c962f 554
a999c27c 555=item KS_C_5601-1987
5d030b67 556
a999c27c 557Microsoft extension to C<EUC-KR>.
5d030b67 558
c731e18e 559Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 560
f2a2953c 561See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 562for details.
5d030b67 563
f2a2953c 564Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
565misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
566C<kcs5601-raw>.
5d030b67 567
f2a2953c 568See L<Encode::KR> for details.
67d7b5ef 569
a999c27c 570=item GB2312
67d7b5ef 571
a999c27c 572Microsoft extension to C<EUC-CN>.
a63c962f 573
a999c27c 574Proper names: C<CP936>, C<GBK>.
a63c962f 575
a999c27c 576C<GB2312> has been registered in the C<EUC-CN> meaning at
577IANA. This has partially repaired the situation: Microsoft's
578C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 579
a999c27c 580Encode aliases C<GB2312> to C<euc-cn> in full agreement with
581IANA registration. C<cp936> is supported separately.
f2a2953c 582I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 583
f2a2953c 584See L<Encode::CN> for details.
a999c27c 585
586=item Big5
587
588Microsoft extension to C<Big5>.
589
590Proper name: C<CP950>.
591
592Encode separately supports C<Big5> and C<cp950>.
593
594=item Shift_JIS
595
596Microsoft's understanding of C<Shift_JIS>.
597
598JIS has not endorsed the full Microsoft standard however.
599The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
0ab8f81e 600character sets, while Microsoft has always used C<Shift_JIS>
85982a32 601to encode a wider character repertoire. See C<IANA> registration for
c731e18e 602C<Windows-31J>.
a999c27c 603
0ab8f81e 604As a historical predecessor, Microsoft's variant
605probably has more rights for the name, though it may be objected
a999c27c 606that Microsoft shouldn't have used JIS as part of the name
607in the first place.
608
fcb875d4 609Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c 610
611Encode separately supports C<Shift_JIS> and C<cp932>.
612
613=back
614
615=head1 Glossary
616
0ab8f81e 617=over 4
a999c27c 618
619=item character repertoire
620
0ab8f81e 621A collection of unique characters. A I<character> set in the strictest
622sense. At this stage, characters are not numbered.
a999c27c 623
624=item coded character set (CCS)
625
626A character set that is mapped in a way computers can use directly.
0ab8f81e 627Many character encodings, including EUC, fall in this category.
a999c27c 628
629=item character encoding scheme (CES)
630
631An algorithm to map a character set to a byte sequence. You don't
632have to be able to tell which character set a given byte sequence
633belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
634example of being both a CCS and CES.
635
f2a2953c 636=item charset (in MIME context)
637
638has long been used in the meaning of C<encoding>, CES.
639
0ab8f81e 640While the word combination C<character set> has lost this meaning
641in MIME context since [RFC 2130], the C<charset> abbreviation has
642retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
f2a2953c 643
644 This document uses the term "charset" to mean a set of rules for
645 mapping from a sequence of octets to a sequence of characters, such
646 as the combination of a coded character set and a character encoding
647 scheme; this is also what is used as an identifier in MIME "charset="
648 parameters, and registered in the IANA charset registry ... (Note
649 that this is NOT a term used by other standards bodies, such as ISO).
650 [RFC 2277]
651
a999c27c 652=item EUC
653
0ab8f81e 654Extended Unix Character. See ISO-2022.
a999c27c 655
656=item ISO-2022
657
0ab8f81e 658A CES that was carefully designed to coexist with ASCII. There are a 7
659bit version and an 8 bit version.
f2a2953c 660
0ab8f81e 661The 7 bit version switches character set via escape sequence so it
f2a2953c 662cannot form a CCS. Since this is more difficult to handle in programs
0ab8f81e 663than the 8 bit version, the 7 bit version is not very popular except for
664iso-2022-jp, the I<de facto> standard CES for e-mails.
f2a2953c 665
0ab8f81e 666The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
962111ca 667thereof. Pre-5.6 perl could use them as string literals.
a999c27c 668
669=item UCS
670
671Short for I<Universal Character Set>. When you say just UCS, it means
0ab8f81e 672I<Unicode>.
a999c27c 673
674=item UCS-2
675
676ISO/IEC 10646 encoding form: Universal Character Set coded in two
677octets.
678
679=item Unicode
680
0ab8f81e 681A character set that aims to include all character repertoires of the
962111ca 682world. Many character sets in various national as well as industrial
f2a2953c 683standards have become, in a way, just subsets of Unicode.
a999c27c 684
685=item UTF
686
f2a2953c 687Short for I<Unicode Transformation Format>. Determines how to map a
0ab8f81e 688Unicode character into a byte sequence.
a999c27c 689
690=item UTF-16
691
692A UTF in 16-bit encoding. Can either be in big endian or little
0ab8f81e 693endian. The big endian version is called UTF-16BE (equal to UCS-2 +
694surrogate support) and the little endian version is called UTF-16LE.
67d7b5ef 695
696=back
5d030b67 697
698=head1 See Also
699
5129552c 700L<Encode>,
701L<Encode::Byte>,
a63c962f 702L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 703L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 704
a999c27c 705=head1 References
706
0ab8f81e 707=over 4
a999c27c 708
709=item ECMA
710
711European Computer Manufacturers Association
712L<http://www.ecma.ch>
713
0ab8f81e 714=over 4
a999c27c 715
0ab8f81e 716=item ECMA-035 (eq C<ISO-2022>)
a999c27c 717
718L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
719
0ab8f81e 720The specification of ISO-2022 is available from the link above.
a999c27c 721
722=back
723
724=item IANA
725
726Internet Assigned Numbers Authority
727L<http://www.iana.org/>
728
0ab8f81e 729=over 4
a999c27c 730
731=item Assigned Charset Names by IANA
732
733L<http://www.iana.org/assignments/character-sets>
734
735Most of the C<canonical names> in Encode derive from this list
736so you can directly apply the string you have extracted from MIME
448e90bb 737header of mails and web pages.
a999c27c 738
739=back
740
741=item ISO
742
743International Organization for Standardization
744L<http://www.iso.ch/>
745
746=item RFC
747
962111ca 748Request For Comments -- need I say more?
0ab8f81e 749L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
750L<http://www.faqs.org/rfcs/>
a999c27c 751
752=item UC
753
754Unicode Consortium
755L<http://www.unicode.org/>
756
0ab8f81e 757=over 4
a999c27c 758
759=item Unicode Glossary
760
761L<http://www.unicode.org/glossary/>
762
962111ca 763The glossary of this document is based upon this site.
a999c27c 764
765=back
766
767=back
768
769=head2 Other Notable Sites
770
0ab8f81e 771=over 4
a999c27c 772
773=item czyborra.com
774
f2a2953c 775L<http://czyborra.com/>
a999c27c 776
777Contains a a lot of useful information, especially gory details of ISO
778vs. vendor mappings.
779
780=item CJK.inf
781
782L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
783
784Somewhat obsolete (last update in 1996), but still useful. Also try
785
786L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
787
0ab8f81e 788You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
a999c27c 789
f2a2953c 790=item Jungshik Shin's Hangul FAQ
791
792L<http://jshin.net/faq>
793
0ab8f81e 794And especially its subject 8.
f2a2953c 795
796L<http://jshin.net/faq/qa8.html>
797
962111ca 798A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 799
0ab8f81e 800=item debian.org: "Introduction to i18n"
801
802A brief description for most of the mentioned CJK encodings is
803contained in
804L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
805
f2a2953c 806=back
807
808=head2 Offline sources
809
0ab8f81e 810=over 4
f2a2953c 811
812=item C<CJKV Information Processing> by Ken Lunde
813
814CJKV Information Processing
8151999 O'Reilly & Associates, ISBN : 1-56592-224-7
816
0ab8f81e 817The modern successor of C<CJK.inf>.
f2a2953c 818
0ab8f81e 819Features a comprehensive coverage of CJKV character sets and
f2a2953c 820encodings along with many other issues faced by anyone trying
821to better support CJKV languages/scripts in all the areas of
822information processing.
823
0ab8f81e 824To purchase this book, visit
f2a2953c 825L<http://www.oreilly.com/catalog/cjkvinfo/>
0ab8f81e 826or your favourite bookstore.
f2a2953c 827
a999c27c 828=back
829
5d030b67 830=cut