OS/2: bug found by John Poltorak.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
0ab8f81e 3Encode::Supported -- Encodings supported by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
0ab8f81e 10is ignored. In addition, an encoding may have aliases.
5d030b67 11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
0ab8f81e 15=over 4
a999c27c 16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
0ab8f81e 25The MIME name as defined in IETF RFCs. This includes all "iso-"s.
a999c27c 26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
962111ca 63 ascii US-ascii [ECMA]
64 iso-8859-1 latin1 [ISO]
65 utf8 UTF-8 [RFC2279]
c731e18e 66 ----------------------------------------------------------------
67
c731e18e 68=head2 Encode::Unicode -- other Unicode encodings
69
70Unicode coding schemes other than native utf8 are supported by
0ab8f81e 71Encode::Unicode, which will be autoloaded on demand.
c731e18e 72
73 ----------------------------------------------------------------
f2a2953c 74 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
75 UCS-2LE [UC]
76 UTF-16 [UC]
77 UTF-16BE [UC]
78 UTF-16LE [UC]
79 UTF-32 [UC]
126bf8bf 80 UTF-32BE UCS-4 [UC]
f2a2953c 81 UTF-32LE [UC]
67d7b5ef 82 ----------------------------------------------------------------
5d030b67 83
0ab8f81e 84To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
f2a2953c 85see L<Encode::Unicode>.
86
a999c27c 87=head2 Encode::Byte -- Extended ASCII
5d030b67 88
0ab8f81e 89Encode::Byte implements most single-byte encodings except for
90Symbols and EBCDIC. The following encodings are based on single-byte
91encodings implemented as extended ASCII. Most of them map
92\x80-\xff (upper half) to non-ASCII characters.
a999c27c 93
0ab8f81e 94=over 4
a999c27c 95
96=item ISO-8859 and corresponding vendor mappings
97
962111ca 98Since there are so many, they are presented in table format with
0ab8f81e 99languages and corresponding encoding names by vendors. Note that
100the table is sorted in order of ISO-8859 and the corresponding vendor
101mappings are slightly different from that of ISO. See
a999c27c 102L<http://czyborra.com/charsets/iso8859.html> for details.
103
962111ca 104 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 105 ----------------------------------------------------------------
962111ca 106 N. America (ASCII) cp437 AdobeStandardEncoding
107 cp863 (DOSCanadaF)
0ab8f81e 108 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
962111ca 109 hp-roman8
110 cp860 (DOSPortuguese)
111 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
112 MacCroatian
113 MacRomanian
114 MacRumanian
115 Latin3 [1] iso-8859-3
116 Latin4 [2] iso-8859-4
117 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
0ab8f81e 118 (See also next section) cp866 MacUkrainian
962111ca 119 Arabic iso-8859-6 cp864 cp1256 MacArabic
120 cp1006 MacFarsi
121 Greek iso-8859-7 cp737 cp1253 MacGreek
122 cp869 (DOSGreek2)
123 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
124 Turkish iso-8859-9 cp857 cp1254 MacTurkish
125 Nordics iso-8859-10 cp865
126 cp861 MacIcelandic
127 MacSami
128 Thai iso-8859-11 [3] cp874 MacThai
a999c27c 129 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 130 Baltics iso-8859-13 cp775 cp1257
a999c27c 131 Celtics iso-8859-14
962111ca 132 Latin9 [4] iso-8859-15
a999c27c 133 Latin10 iso-8859-16
962111ca 134 Vietnamese viscii cp1258 MacVietnamese
a999c27c 135 ----------------------------------------------------------------
136
0ab8f81e 137 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
138 [2] Baltics. Now on 8859-10, except for Latvian.
962111ca 139 [3] Also know as TIS 620.
0ab8f81e 140 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
141 letters that are missing from 8859-1 were added.
a999c27c 142
143All cp* are also available as ibm-*, ms-*, and windows-* . See also
144L<http://czyborra.com/charsets/codepages.html>.
145
146Macintosh encodings don't seem to be registered in such entities as
147IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1481150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
0ab8f81e 149for details.
a999c27c 150
0ab8f81e 151=item KOI8 - De Facto Standard for the Cyrillic world
a999c27c 152
0ab8f81e 153Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
154popular in the Net. L<Encode> comes with the following KOI charsets.
962111ca 155For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 156
67d7b5ef 157 ----------------------------------------------------------------
962111ca 158 koi8-f
159 koi8-r cp878 [RFC1489]
160 koi8-u [RFC2319]
85982a32 161 ----------------------------------------------------------------
962111ca 162
a999c27c 163=item gsm0338 - Hentai Latin 1
164
962111ca 165GSM0338 is for GSM handsets. Though it shares alphanumerals with
166ASCII, control character ranges and other parts are mapped very
167differently, presumably to store Greek and Cyrillic alphabets.
0ab8f81e 168This is also covered in Encode::Byte even though it is not an
169"extended ASCII" encoding.
a999c27c 170
171=back
5d030b67 172
0ab8f81e 173=head2 CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 174
962111ca 175Note that Vietnamese is listed above. Also read "Encoding vs Charset"
0ab8f81e 176below. Also note that these are implemented in distinct modules by
177countries, due the the size concerns (simplified Chinese is mapped
178to 'CN', continental China, while traditional Chinese is mapped to
179'TW', Taiwan). Please refer to their respective documentataion pages.
5d030b67 180
5129552c 181=over 4
182
183=item Encode::CN -- Continental China
184
962111ca 185 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 186 ----------------------------------------------------------------
962111ca 187 euc-cn [1] MacChineseSimp
188 (gbk) cp936 [2]
189 gb12345-raw { GB12345 without CES }
190 gb2312-raw { GB2312 without CES }
5129552c 191 hz
192 iso-ir-165
67d7b5ef 193 ----------------------------------------------------------------
5129552c 194
0ab8f81e 195 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
196 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
f2a2953c 197
5129552c 198=item Encode::JP -- Japan
199
962111ca 200 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 201 ----------------------------------------------------------------
a999c27c 202 euc-jp
962111ca 203 shiftjis cp932 macJapanese
f2a2953c 204 7bit-jis
962111ca 205 iso-2022-jp [RFC1468]
206 iso-2022-jp-1 [RFC2237]
f2a2953c 207 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
208 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
209 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 210 ----------------------------------------------------------------
5129552c 211
212=item Encode::KR -- Korea
213
962111ca 214 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 215 ----------------------------------------------------------------
962111ca 216 euc-kr MacKorean [RFC1557]
217 cp949 [1]
218 iso-2022-kr [RFC1557]
a999c27c 219 johab [KS X 1001:1998, Annex 3]
f2a2953c 220 ksc5601-raw { KSC5601 without CES }
67d7b5ef 221 ----------------------------------------------------------------
5129552c 222
962111ca 223 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
224 See below.
225
5129552c 226=item Encode::TW -- Taiwan
227
962111ca 228 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 229 ----------------------------------------------------------------
b0b300a3 230 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
231 big5-hkscs
67d7b5ef 232 ----------------------------------------------------------------
5129552c 233
234=item Encode::HanExtra -- More Chinese via CPAN
235
236Due to size concerns, additional Chinese encodings below are
237distributed separately on CPAN, under the name Encode::HanExtra.
238
962111ca 239 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 240 ----------------------------------------------------------------
e8c86ba6 241 big5ext CMEX's Big5e Extension
242 big5plus CMEX's Big5+ Extension
243 cccii Chinese Character Code for Information Interchange
244 euc-tw EUC (Extended Unix Character)
245 gb18030 GBK with Traditional Characters
246 ----------------------------------------------------------------
247
248=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
249
250Due to size concerns, additional Japanese encodings below are
251distributed separately on CPAN, under the name Encode::JIS2K.
252
253 Standard DOS/Win Macintosh Comment/Reference
254 ----------------------------------------------------------------
255 euc-jisx0213
256 shiftjisx0123
257 iso-2022-jp-3
258 jis0213-1-raw
259 jis0213-2-raw
67d7b5ef 260 ----------------------------------------------------------------
5129552c 261
262=back
263
264=head2 Miscellaneous encodings
265
266=over 4
267
268=item Encode::EBCDIC
5d030b67 269
a999c27c 270See L<perlebcdic> for details.
5d030b67 271
67d7b5ef 272 ----------------------------------------------------------------
5d030b67 273 cp37
a999c27c 274 cp500
275 cp875
276 cp1026
277 cp1047
5d030b67 278 posix-bc
67d7b5ef 279 ----------------------------------------------------------------
5129552c 280
a63c962f 281=item Encode::Symbols
5d030b67 282
5129552c 283For symbols and dingbats.
5d030b67 284
67d7b5ef 285 ----------------------------------------------------------------
5d030b67 286 symbol
287 dingbats
a999c27c 288 MacDingbats
289 AdobeZdingbat
290 AdobeSymbol
67d7b5ef 291 ----------------------------------------------------------------
292
e8c86ba6 293=item Encode::MIME::Header
294
295Strictly speaking, MIME header encoding documented in RFC 2047 is more
296of encapsulation than encoding. But included anyway.
297
298 ----------------------------------------------------------------
299 MIME-Header [RFC2047]
300 MIME-B [RFC2047]
301 MIME-Q [RFC2047]
302 ----------------------------------------------------------------
303
304=item Encode::Guess
305
306This one is not a name of encoding but a utility that lets you pick up
307the most appropriate encoding for a data out of given I<suspects>. See
308L<Encode::Guess> for details.
309
67d7b5ef 310=back
311
312=head1 Unsupported encodings
313
0ab8f81e 314The following encodings are not supported as yet; some because they
315are rarely used, some because of technical difficulties. They may
316be supported by external modules via CPAN in the future, however.
67d7b5ef 317
318=over 4
319
320=item ISO-2022-JP-2 [RFC1554]
321
322Not very popular yet. Needs Unicode Database or equivalent to
0ab8f81e 323implement encode() (because it includes JIS X 0208/0212, KSC5601, and
324GB2312 simultaneously, whose code points in Unicode overlap. So you
325need to lookup the database to determine to what character set a given
67d7b5ef 326Unicode character should belong).
327
962111ca 328=item ISO-2022-CN [RFC1922]
67d7b5ef 329
0ab8f81e 330Not very popular. Needs CNS 11643-1 and -2 which are not available in
962111ca 331this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
0ab8f81e 332Autrijus Tang may add support for this encoding in his module in future.
67d7b5ef 333
0ab8f81e 334=item Various HP-UX encodings
67d7b5ef 335
962111ca 336The following are unsupported due to the lack of mapping data.
337
67d7b5ef 338 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 339 '15' - japanese15, korean15, and roi15
67d7b5ef 340
341=item Cyrillic encoding ISO-IR-111
342
0ab8f81e 343Anton Tagunov doubts its usefulness.
67d7b5ef 344
345=item ISO-8859-8-1 [Hebrew]
346
a999c27c 347None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
348MacHebrew are supported because and just because there were mappings
962111ca 349available at L<http://www.unicode.org/>). Contributions welcome.
350
351=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
352
353Ditto.
67d7b5ef 354
355=item Thai encoding TCVN
356
357Ditto.
358
359=item Vietnamese encodings VPS
360
0ab8f81e 361Though Jungshik Shin has reported that Mozilla supports this encoding,
362it was too late before 5.8.0 for us to add it. In the future, it
363may be available via a separate module. See
962111ca 364L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
365and
a999c27c 366L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
367if you are interested in helping us.
67d7b5ef 368
962111ca 369=item Various Mac encodings
67d7b5ef 370
962111ca 371The following are unsupported due to the lack of mapping data.
a999c27c 372
373 MacArmenian, MacBengali, MacBurmese, MacEthiopic
374 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
375 MacLaotian, MacMalayalam, MacMongolian, MacOriya
376 MacSinhalese, MacTamil, MacTelugu, MacTibetan
377 MacVietnamese
378
0ab8f81e 379The rest which are already available are based upon the vendor mappings
962111ca 380at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 381
382=item (Mac) Indic encodings
383
0ab8f81e 384The maps for the following are available at L<http://www.unicode.org/>
385but remain unsupport because those encodings need algorithmical
386approach, currently unsupported by F<enc2xs>:
67d7b5ef 387
a999c27c 388 MacDevanagari
389 MacGurmukhi
390 MacGujarati
67d7b5ef 391
a999c27c 392For details, please see C<Unicode mapping issues and notes:> at
393L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
394
395I believe this issue is prevalent not only for Mac Indics but also in
962111ca 396other Indic encodings, but the above were the only Indic encodings
a999c27c 397maps that I could find at L<http://www.unicode.org/> .
5129552c 398
399=back
5d030b67 400
a999c27c 401=head1 Encoding vs. Charset -- terminology
5d030b67 402
0ab8f81e 403We are used to using the term (character) I<encoding> and I<character
404set> interchangeably. But just as confusing the terms byte and
405character is dangerous and the terms should be differentiated when
406needed, we need to differentiate I<encoding> and I<character set>.
5d030b67 407
0ab8f81e 408To understand that, here is a description of how we make computers
409grok our characters.
a999c27c 410
411=over 4
412
413=item *
67d7b5ef 414
a999c27c 415First we start with which characters to include. We call this
416collection of characters I<character repertoire>.
5d030b67 417
a999c27c 418=item *
5d030b67 419
a999c27c 420Then we have to give each character a unique ID so your computer can
0ab8f81e 421tell the difference between 'a' and 'A'. This itemized character
962111ca 422repertoire is now a I<character set>.
a63c962f 423
a999c27c 424=item *
425
426If your computer can grow the character set without further
0ab8f81e 427processing, you can go ahead and use it. This is called a I<coded
a999c27c 428character set> (CCS) or I<raw character encoding>. ASCII is used this
429way for most cases.
430
431=item *
432
0ab8f81e 433But in many cases, especially multi-byte CJK encodings, you have to
a999c27c 434tweak a little more. Your network connection may not accept any data
0ab8f81e 435with the Most Significant Bit set, and your computer may not be able to
a999c27c 436tell if a given byte is a whole character or just half of it. So you
437have to I<encode> the character set to use it.
438
439A I<character encoding scheme> (CES) determines how to encode a given
440character set, or a set of multiple character sets. 7bit ISO-2022 is
0ab8f81e 441an example of a CES. You switch between character sets via I<escape
442sequences>.
67d7b5ef 443
444=back
445
0ab8f81e 446Technically, or mathematically, speaking, a character set encoded in
a999c27c 447such a CES that maps character by character may form a CCS. EUC is such
0ab8f81e 448an example. The CES of EUC is as follows:
67d7b5ef 449
a999c27c 450=over 4
5d030b67 451
a999c27c 452=item *
5d030b67 453
a999c27c 454Map ASCII unchanged.
455
456=item *
457
458Map such a character set that consists of 94 or 96 powered by N
459members by adding 0x80 to each byte.
460
461=item *
462
0ab8f81e 463You can also use 0x8e and 0x8f to indicate that the following sequence of
464characters belongs to yet another character set. To each following byte
465is added the value 0x80.
a999c27c 466
467=back
468
0ab8f81e 469By carefully looking at the encoded byte sequence, you can find that the
470byte sequence conforms a unique number. In that sense, EUC is a CCS
a999c27c 471generated by a CES above from up to four CCS (complicated?). UTF-8
0ab8f81e 472falls into this category. See L<perlUnicode/"UTF-8"> to find out how
a999c27c 473UTF-8 maps Unicode to a byte sequence.
474
0ab8f81e 475You may also have found out by now why 7bit ISO-2022 cannot comprise
476a CCS. If you look at a byte sequence \x21\x21, you can't tell if
477it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
478so you have no trouble differentiating between "!!". and S<" ">.
67d7b5ef 479
a63c962f 480=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
481
482This section tries to classify the supported encodings by their
483applicability for information exchange over the Internet and to
484choose the most suitable aliases to name them in the context of
485such communication.
486
0ab8f81e 487=over 4
67d7b5ef 488
489=item *
490
0ab8f81e 491To (en|de)code encodings marked by C<(**)>, you need
a999c27c 492C<Encode::HanExtra>, available from CPAN.
67d7b5ef 493
494=back
495
a63c962f 496Encoding names
5d030b67 497
f2a2953c 498 US-ASCII UTF-8 ISO-8859-* KOI8-R
499 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
500 EUC-KR Big5 GB2312
a999c27c 501
0ab8f81e 502are registered with IANA as preferred MIME names and may
a999c27c 503be used over the Internet.
5d030b67 504
c731e18e 505C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 506L<Microsoft-related naming mess> gives details.
5d030b67 507
a999c27c 508C<GB2312> is the IANA name for C<EUC-CN>.
509See L<Microsoft-related naming mess> for details.
510
511C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 512with Encode. See L<Encode::CN> for details.
5d030b67 513
a63c962f 514 EUC-CN
f2a2953c 515 KOI8-U [RFC2319]
5d030b67 516
a999c27c 517have not been registered with IANA (as of March 2002) but
518seem to be supported by major web browsers.
0ab8f81e 519The IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 520
521 KS_C_5601-1987
522
a999c27c 523is heavily misused.
524See L<Microsoft-related naming mess> for details.
525
526C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 527with Encode. See L<Encode::KR> for details.
528
529 UTF-16 UTF-16BE UTF-16LE
530
448e90bb 531are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 532Jungshik Shin reports that UTF-16 with a BOM is well accepted
533by MS IE 5/6 and NS 4/6. Beware however that
534
0ab8f81e 535=over 4
f2a2953c 536
537=item *
5d030b67 538
f2a2953c 539C<UTF-16> support in any software you're going to be
540using/interoperating with has probably been less tested
541then C<UTF-8> support
5d030b67 542
f2a2953c 543=item *
544
c731e18e 545C<UTF-8> coded data seamlessly passes traditional
546command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
0ab8f81e 547data is likely to cause confusion (with its zero bytes,
f2a2953c 548for example)
549
550=item *
551
552it is beyond the power of words to describe the way HTML browsers
0ab8f81e 553encode non-C<ASCII> form data. To get a general impression, visit
f2a2953c 554L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
0ab8f81e 555While encoding of form data has stabilized for C<UTF-8> encoded pages
556(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
557expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
f2a2953c 558pages!
559
560=back
561
562The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 563you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 564
f2a2953c 565 ISO-IR-165 [RFC1345]
5d030b67 566 VISCII
a63c962f 567 GB 12345
f2a2953c 568 GB 18030 (**) (see links bellow)
569 EUC-TW (**)
5d030b67 570
571are totally valid encodings but not registered at IANA.
a63c962f 572The names under which they are listed here are probably the
573most widely-known names for these encodings and are recommended
574names.
575
f2a2953c 576 BIG5PLUS (**)
a63c962f 577
0ab8f81e 578is a proprietary name.
5d030b67 579
a999c27c 580=head2 Microsoft-related naming mess
581
582Microsoft products misuse the following names:
5d030b67 583
0ab8f81e 584=over 4
a63c962f 585
a999c27c 586=item KS_C_5601-1987
5d030b67 587
a999c27c 588Microsoft extension to C<EUC-KR>.
5d030b67 589
c731e18e 590Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 591
f2a2953c 592See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 593for details.
5d030b67 594
f2a2953c 595Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
596misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
597C<kcs5601-raw>.
5d030b67 598
f2a2953c 599See L<Encode::KR> for details.
67d7b5ef 600
a999c27c 601=item GB2312
67d7b5ef 602
a999c27c 603Microsoft extension to C<EUC-CN>.
a63c962f 604
a999c27c 605Proper names: C<CP936>, C<GBK>.
a63c962f 606
a999c27c 607C<GB2312> has been registered in the C<EUC-CN> meaning at
608IANA. This has partially repaired the situation: Microsoft's
609C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 610
a999c27c 611Encode aliases C<GB2312> to C<euc-cn> in full agreement with
612IANA registration. C<cp936> is supported separately.
f2a2953c 613I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 614
f2a2953c 615See L<Encode::CN> for details.
a999c27c 616
617=item Big5
618
619Microsoft extension to C<Big5>.
620
621Proper name: C<CP950>.
622
623Encode separately supports C<Big5> and C<cp950>.
624
625=item Shift_JIS
626
627Microsoft's understanding of C<Shift_JIS>.
628
629JIS has not endorsed the full Microsoft standard however.
630The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
0ab8f81e 631character sets, while Microsoft has always used C<Shift_JIS>
85982a32 632to encode a wider character repertoire. See C<IANA> registration for
c731e18e 633C<Windows-31J>.
a999c27c 634
0ab8f81e 635As a historical predecessor, Microsoft's variant
636probably has more rights for the name, though it may be objected
a999c27c 637that Microsoft shouldn't have used JIS as part of the name
638in the first place.
639
fcb875d4 640Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c 641
642Encode separately supports C<Shift_JIS> and C<cp932>.
643
644=back
645
646=head1 Glossary
647
0ab8f81e 648=over 4
a999c27c 649
650=item character repertoire
651
0ab8f81e 652A collection of unique characters. A I<character> set in the strictest
653sense. At this stage, characters are not numbered.
a999c27c 654
655=item coded character set (CCS)
656
657A character set that is mapped in a way computers can use directly.
0ab8f81e 658Many character encodings, including EUC, fall in this category.
a999c27c 659
660=item character encoding scheme (CES)
661
662An algorithm to map a character set to a byte sequence. You don't
663have to be able to tell which character set a given byte sequence
664belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
665example of being both a CCS and CES.
666
f2a2953c 667=item charset (in MIME context)
668
669has long been used in the meaning of C<encoding>, CES.
670
0ab8f81e 671While the word combination C<character set> has lost this meaning
672in MIME context since [RFC 2130], the C<charset> abbreviation has
673retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
f2a2953c 674
675 This document uses the term "charset" to mean a set of rules for
676 mapping from a sequence of octets to a sequence of characters, such
677 as the combination of a coded character set and a character encoding
678 scheme; this is also what is used as an identifier in MIME "charset="
679 parameters, and registered in the IANA charset registry ... (Note
680 that this is NOT a term used by other standards bodies, such as ISO).
681 [RFC 2277]
682
a999c27c 683=item EUC
684
0ab8f81e 685Extended Unix Character. See ISO-2022.
a999c27c 686
687=item ISO-2022
688
0ab8f81e 689A CES that was carefully designed to coexist with ASCII. There are a 7
690bit version and an 8 bit version.
f2a2953c 691
0ab8f81e 692The 7 bit version switches character set via escape sequence so it
f2a2953c 693cannot form a CCS. Since this is more difficult to handle in programs
0ab8f81e 694than the 8 bit version, the 7 bit version is not very popular except for
695iso-2022-jp, the I<de facto> standard CES for e-mails.
f2a2953c 696
0ab8f81e 697The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
962111ca 698thereof. Pre-5.6 perl could use them as string literals.
a999c27c 699
700=item UCS
701
702Short for I<Universal Character Set>. When you say just UCS, it means
0ab8f81e 703I<Unicode>.
a999c27c 704
705=item UCS-2
706
707ISO/IEC 10646 encoding form: Universal Character Set coded in two
708octets.
709
710=item Unicode
711
0ab8f81e 712A character set that aims to include all character repertoires of the
962111ca 713world. Many character sets in various national as well as industrial
f2a2953c 714standards have become, in a way, just subsets of Unicode.
a999c27c 715
716=item UTF
717
f2a2953c 718Short for I<Unicode Transformation Format>. Determines how to map a
0ab8f81e 719Unicode character into a byte sequence.
a999c27c 720
721=item UTF-16
722
723A UTF in 16-bit encoding. Can either be in big endian or little
0ab8f81e 724endian. The big endian version is called UTF-16BE (equal to UCS-2 +
725surrogate support) and the little endian version is called UTF-16LE.
67d7b5ef 726
727=back
5d030b67 728
729=head1 See Also
730
5129552c 731L<Encode>,
732L<Encode::Byte>,
a63c962f 733L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 734L<Encode::EBCDIC>, L<Encode::Symbol>
e8c86ba6 735L<Encode::MIME::Header>, L<Encode::Guess>
5d030b67 736
a999c27c 737=head1 References
738
0ab8f81e 739=over 4
a999c27c 740
741=item ECMA
742
743European Computer Manufacturers Association
744L<http://www.ecma.ch>
745
0ab8f81e 746=over 4
a999c27c 747
0ab8f81e 748=item ECMA-035 (eq C<ISO-2022>)
a999c27c 749
750L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
751
0ab8f81e 752The specification of ISO-2022 is available from the link above.
a999c27c 753
754=back
755
756=item IANA
757
758Internet Assigned Numbers Authority
759L<http://www.iana.org/>
760
0ab8f81e 761=over 4
a999c27c 762
763=item Assigned Charset Names by IANA
764
765L<http://www.iana.org/assignments/character-sets>
766
767Most of the C<canonical names> in Encode derive from this list
768so you can directly apply the string you have extracted from MIME
448e90bb 769header of mails and web pages.
a999c27c 770
771=back
772
773=item ISO
774
775International Organization for Standardization
776L<http://www.iso.ch/>
777
778=item RFC
779
962111ca 780Request For Comments -- need I say more?
0ab8f81e 781L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
782L<http://www.faqs.org/rfcs/>
a999c27c 783
784=item UC
785
786Unicode Consortium
787L<http://www.unicode.org/>
788
0ab8f81e 789=over 4
a999c27c 790
791=item Unicode Glossary
792
793L<http://www.unicode.org/glossary/>
794
962111ca 795The glossary of this document is based upon this site.
a999c27c 796
797=back
798
799=back
800
801=head2 Other Notable Sites
802
0ab8f81e 803=over 4
a999c27c 804
805=item czyborra.com
806
f2a2953c 807L<http://czyborra.com/>
a999c27c 808
809Contains a a lot of useful information, especially gory details of ISO
810vs. vendor mappings.
811
812=item CJK.inf
813
814L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
815
816Somewhat obsolete (last update in 1996), but still useful. Also try
817
818L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
819
0ab8f81e 820You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
a999c27c 821
f2a2953c 822=item Jungshik Shin's Hangul FAQ
823
824L<http://jshin.net/faq>
825
0ab8f81e 826And especially its subject 8.
f2a2953c 827
828L<http://jshin.net/faq/qa8.html>
829
962111ca 830A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 831
0ab8f81e 832=item debian.org: "Introduction to i18n"
833
834A brief description for most of the mentioned CJK encodings is
835contained in
836L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
837
f2a2953c 838=back
839
840=head2 Offline sources
841
0ab8f81e 842=over 4
f2a2953c 843
844=item C<CJKV Information Processing> by Ken Lunde
845
846CJKV Information Processing
8471999 O'Reilly & Associates, ISBN : 1-56592-224-7
848
0ab8f81e 849The modern successor of C<CJK.inf>.
f2a2953c 850
0ab8f81e 851Features a comprehensive coverage of CJKV character sets and
f2a2953c 852encodings along with many other issues faced by anyone trying
853to better support CJKV languages/scripts in all the areas of
854information processing.
855
0ab8f81e 856To purchase this book, visit
f2a2953c 857L<http://www.oreilly.com/catalog/cjkvinfo/>
0ab8f81e 858or your favourite bookstore.
f2a2953c 859
a999c27c 860=back
861
5d030b67 862=cut