[Encode] 1.74 released -- final for 5.8.0-RC1
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
0ab8f81e 3Encode::Supported -- Encodings supported by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
0ab8f81e 10is ignored. In addition, an encoding may have aliases.
5d030b67 11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
0ab8f81e 15=over 4
a999c27c 16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
0ab8f81e 25The MIME name as defined in IETF RFCs. This includes all "iso-"s.
a999c27c 26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
962111ca 63 ascii US-ascii [ECMA]
f0a41339 64 ascii-ctrl Special Encoding
962111ca 65 iso-8859-1 latin1 [ISO]
f0a41339 66 null Special Encoding
962111ca 67 utf8 UTF-8 [RFC2279]
c731e18e 68 ----------------------------------------------------------------
69
f0a41339 70I<null> and I<ascii-ctrl> are special. "null" fails for all character
71so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
72CHARACTERS will fall back to character references. Ditto for
73"ascii-ctrl" except for control characters. For fallback modes, see
74L<Encode>.
75
c731e18e 76=head2 Encode::Unicode -- other Unicode encodings
77
78Unicode coding schemes other than native utf8 are supported by
0ab8f81e 79Encode::Unicode, which will be autoloaded on demand.
c731e18e 80
81 ----------------------------------------------------------------
f2a2953c 82 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
83 UCS-2LE [UC]
84 UTF-16 [UC]
85 UTF-16BE [UC]
86 UTF-16LE [UC]
87 UTF-32 [UC]
126bf8bf 88 UTF-32BE UCS-4 [UC]
f2a2953c 89 UTF-32LE [UC]
67d7b5ef 90 ----------------------------------------------------------------
5d030b67 91
0ab8f81e 92To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
f2a2953c 93see L<Encode::Unicode>.
94
a999c27c 95=head2 Encode::Byte -- Extended ASCII
5d030b67 96
0ab8f81e 97Encode::Byte implements most single-byte encodings except for
98Symbols and EBCDIC. The following encodings are based on single-byte
99encodings implemented as extended ASCII. Most of them map
100\x80-\xff (upper half) to non-ASCII characters.
a999c27c 101
0ab8f81e 102=over 4
a999c27c 103
104=item ISO-8859 and corresponding vendor mappings
105
962111ca 106Since there are so many, they are presented in table format with
0ab8f81e 107languages and corresponding encoding names by vendors. Note that
108the table is sorted in order of ISO-8859 and the corresponding vendor
109mappings are slightly different from that of ISO. See
a999c27c 110L<http://czyborra.com/charsets/iso8859.html> for details.
111
962111ca 112 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 113 ----------------------------------------------------------------
962111ca 114 N. America (ASCII) cp437 AdobeStandardEncoding
115 cp863 (DOSCanadaF)
0ab8f81e 116 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
962111ca 117 hp-roman8
118 cp860 (DOSPortuguese)
119 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
120 MacCroatian
121 MacRomanian
122 MacRumanian
123 Latin3 [1] iso-8859-3
124 Latin4 [2] iso-8859-4
125 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
0ab8f81e 126 (See also next section) cp866 MacUkrainian
962111ca 127 Arabic iso-8859-6 cp864 cp1256 MacArabic
128 cp1006 MacFarsi
129 Greek iso-8859-7 cp737 cp1253 MacGreek
130 cp869 (DOSGreek2)
131 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
132 Turkish iso-8859-9 cp857 cp1254 MacTurkish
133 Nordics iso-8859-10 cp865
134 cp861 MacIcelandic
135 MacSami
136 Thai iso-8859-11 [3] cp874 MacThai
a999c27c 137 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 138 Baltics iso-8859-13 cp775 cp1257
a999c27c 139 Celtics iso-8859-14
962111ca 140 Latin9 [4] iso-8859-15
a999c27c 141 Latin10 iso-8859-16
962111ca 142 Vietnamese viscii cp1258 MacVietnamese
a999c27c 143 ----------------------------------------------------------------
144
0ab8f81e 145 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
146 [2] Baltics. Now on 8859-10, except for Latvian.
962111ca 147 [3] Also know as TIS 620.
0ab8f81e 148 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
149 letters that are missing from 8859-1 were added.
a999c27c 150
151All cp* are also available as ibm-*, ms-*, and windows-* . See also
152L<http://czyborra.com/charsets/codepages.html>.
153
154Macintosh encodings don't seem to be registered in such entities as
155IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1561150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
0ab8f81e 157for details.
a999c27c 158
0ab8f81e 159=item KOI8 - De Facto Standard for the Cyrillic world
a999c27c 160
0ab8f81e 161Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
162popular in the Net. L<Encode> comes with the following KOI charsets.
962111ca 163For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 164
67d7b5ef 165 ----------------------------------------------------------------
962111ca 166 koi8-f
167 koi8-r cp878 [RFC1489]
168 koi8-u [RFC2319]
85982a32 169 ----------------------------------------------------------------
962111ca 170
a999c27c 171=item gsm0338 - Hentai Latin 1
172
962111ca 173GSM0338 is for GSM handsets. Though it shares alphanumerals with
174ASCII, control character ranges and other parts are mapped very
175differently, presumably to store Greek and Cyrillic alphabets.
0ab8f81e 176This is also covered in Encode::Byte even though it is not an
177"extended ASCII" encoding.
a999c27c 178
179=back
5d030b67 180
0ab8f81e 181=head2 CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 182
962111ca 183Note that Vietnamese is listed above. Also read "Encoding vs Charset"
0ab8f81e 184below. Also note that these are implemented in distinct modules by
185countries, due the the size concerns (simplified Chinese is mapped
186to 'CN', continental China, while traditional Chinese is mapped to
187'TW', Taiwan). Please refer to their respective documentataion pages.
5d030b67 188
5129552c 189=over 4
190
191=item Encode::CN -- Continental China
192
962111ca 193 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 194 ----------------------------------------------------------------
962111ca 195 euc-cn [1] MacChineseSimp
196 (gbk) cp936 [2]
197 gb12345-raw { GB12345 without CES }
198 gb2312-raw { GB2312 without CES }
5129552c 199 hz
200 iso-ir-165
67d7b5ef 201 ----------------------------------------------------------------
5129552c 202
0ab8f81e 203 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
204 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
f2a2953c 205
5129552c 206=item Encode::JP -- Japan
207
962111ca 208 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 209 ----------------------------------------------------------------
a999c27c 210 euc-jp
962111ca 211 shiftjis cp932 macJapanese
f2a2953c 212 7bit-jis
962111ca 213 iso-2022-jp [RFC1468]
214 iso-2022-jp-1 [RFC2237]
f2a2953c 215 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
216 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
217 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 218 ----------------------------------------------------------------
5129552c 219
220=item Encode::KR -- Korea
221
962111ca 222 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 223 ----------------------------------------------------------------
962111ca 224 euc-kr MacKorean [RFC1557]
225 cp949 [1]
226 iso-2022-kr [RFC1557]
a999c27c 227 johab [KS X 1001:1998, Annex 3]
f2a2953c 228 ksc5601-raw { KSC5601 without CES }
67d7b5ef 229 ----------------------------------------------------------------
5129552c 230
962111ca 231 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
232 See below.
233
5129552c 234=item Encode::TW -- Taiwan
235
962111ca 236 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 237 ----------------------------------------------------------------
b0b300a3 238 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
239 big5-hkscs
67d7b5ef 240 ----------------------------------------------------------------
5129552c 241
242=item Encode::HanExtra -- More Chinese via CPAN
243
244Due to size concerns, additional Chinese encodings below are
245distributed separately on CPAN, under the name Encode::HanExtra.
246
962111ca 247 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 248 ----------------------------------------------------------------
e8c86ba6 249 big5ext CMEX's Big5e Extension
250 big5plus CMEX's Big5+ Extension
251 cccii Chinese Character Code for Information Interchange
252 euc-tw EUC (Extended Unix Character)
253 gb18030 GBK with Traditional Characters
254 ----------------------------------------------------------------
255
256=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
257
258Due to size concerns, additional Japanese encodings below are
259distributed separately on CPAN, under the name Encode::JIS2K.
260
261 Standard DOS/Win Macintosh Comment/Reference
262 ----------------------------------------------------------------
263 euc-jisx0213
264 shiftjisx0123
265 iso-2022-jp-3
266 jis0213-1-raw
267 jis0213-2-raw
67d7b5ef 268 ----------------------------------------------------------------
5129552c 269
270=back
271
272=head2 Miscellaneous encodings
273
274=over 4
275
276=item Encode::EBCDIC
5d030b67 277
a999c27c 278See L<perlebcdic> for details.
5d030b67 279
67d7b5ef 280 ----------------------------------------------------------------
5d030b67 281 cp37
a999c27c 282 cp500
283 cp875
284 cp1026
285 cp1047
5d030b67 286 posix-bc
67d7b5ef 287 ----------------------------------------------------------------
5129552c 288
a63c962f 289=item Encode::Symbols
5d030b67 290
5129552c 291For symbols and dingbats.
5d030b67 292
67d7b5ef 293 ----------------------------------------------------------------
5d030b67 294 symbol
295 dingbats
a999c27c 296 MacDingbats
297 AdobeZdingbat
298 AdobeSymbol
67d7b5ef 299 ----------------------------------------------------------------
300
e8c86ba6 301=item Encode::MIME::Header
302
303Strictly speaking, MIME header encoding documented in RFC 2047 is more
304of encapsulation than encoding. But included anyway.
305
306 ----------------------------------------------------------------
307 MIME-Header [RFC2047]
308 MIME-B [RFC2047]
309 MIME-Q [RFC2047]
310 ----------------------------------------------------------------
311
312=item Encode::Guess
313
314This one is not a name of encoding but a utility that lets you pick up
315the most appropriate encoding for a data out of given I<suspects>. See
316L<Encode::Guess> for details.
317
67d7b5ef 318=back
319
320=head1 Unsupported encodings
321
0ab8f81e 322The following encodings are not supported as yet; some because they
323are rarely used, some because of technical difficulties. They may
324be supported by external modules via CPAN in the future, however.
67d7b5ef 325
326=over 4
327
328=item ISO-2022-JP-2 [RFC1554]
329
330Not very popular yet. Needs Unicode Database or equivalent to
0ab8f81e 331implement encode() (because it includes JIS X 0208/0212, KSC5601, and
332GB2312 simultaneously, whose code points in Unicode overlap. So you
333need to lookup the database to determine to what character set a given
67d7b5ef 334Unicode character should belong).
335
962111ca 336=item ISO-2022-CN [RFC1922]
67d7b5ef 337
0ab8f81e 338Not very popular. Needs CNS 11643-1 and -2 which are not available in
962111ca 339this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
0ab8f81e 340Autrijus Tang may add support for this encoding in his module in future.
67d7b5ef 341
0ab8f81e 342=item Various HP-UX encodings
67d7b5ef 343
962111ca 344The following are unsupported due to the lack of mapping data.
345
67d7b5ef 346 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 347 '15' - japanese15, korean15, and roi15
67d7b5ef 348
349=item Cyrillic encoding ISO-IR-111
350
0ab8f81e 351Anton Tagunov doubts its usefulness.
67d7b5ef 352
353=item ISO-8859-8-1 [Hebrew]
354
a999c27c 355None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
356MacHebrew are supported because and just because there were mappings
962111ca 357available at L<http://www.unicode.org/>). Contributions welcome.
358
359=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
360
361Ditto.
67d7b5ef 362
363=item Thai encoding TCVN
364
365Ditto.
366
367=item Vietnamese encodings VPS
368
0ab8f81e 369Though Jungshik Shin has reported that Mozilla supports this encoding,
370it was too late before 5.8.0 for us to add it. In the future, it
371may be available via a separate module. See
962111ca 372L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
373and
a999c27c 374L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
375if you are interested in helping us.
67d7b5ef 376
962111ca 377=item Various Mac encodings
67d7b5ef 378
962111ca 379The following are unsupported due to the lack of mapping data.
a999c27c 380
381 MacArmenian, MacBengali, MacBurmese, MacEthiopic
382 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
383 MacLaotian, MacMalayalam, MacMongolian, MacOriya
384 MacSinhalese, MacTamil, MacTelugu, MacTibetan
385 MacVietnamese
386
0ab8f81e 387The rest which are already available are based upon the vendor mappings
962111ca 388at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 389
390=item (Mac) Indic encodings
391
0ab8f81e 392The maps for the following are available at L<http://www.unicode.org/>
393but remain unsupport because those encodings need algorithmical
394approach, currently unsupported by F<enc2xs>:
67d7b5ef 395
a999c27c 396 MacDevanagari
397 MacGurmukhi
398 MacGujarati
67d7b5ef 399
a999c27c 400For details, please see C<Unicode mapping issues and notes:> at
401L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
402
403I believe this issue is prevalent not only for Mac Indics but also in
962111ca 404other Indic encodings, but the above were the only Indic encodings
a999c27c 405maps that I could find at L<http://www.unicode.org/> .
5129552c 406
407=back
5d030b67 408
a999c27c 409=head1 Encoding vs. Charset -- terminology
5d030b67 410
0ab8f81e 411We are used to using the term (character) I<encoding> and I<character
412set> interchangeably. But just as confusing the terms byte and
413character is dangerous and the terms should be differentiated when
414needed, we need to differentiate I<encoding> and I<character set>.
5d030b67 415
0ab8f81e 416To understand that, here is a description of how we make computers
417grok our characters.
a999c27c 418
419=over 4
420
421=item *
67d7b5ef 422
a999c27c 423First we start with which characters to include. We call this
424collection of characters I<character repertoire>.
5d030b67 425
a999c27c 426=item *
5d030b67 427
a999c27c 428Then we have to give each character a unique ID so your computer can
0ab8f81e 429tell the difference between 'a' and 'A'. This itemized character
962111ca 430repertoire is now a I<character set>.
a63c962f 431
a999c27c 432=item *
433
434If your computer can grow the character set without further
0ab8f81e 435processing, you can go ahead and use it. This is called a I<coded
a999c27c 436character set> (CCS) or I<raw character encoding>. ASCII is used this
437way for most cases.
438
439=item *
440
0ab8f81e 441But in many cases, especially multi-byte CJK encodings, you have to
a999c27c 442tweak a little more. Your network connection may not accept any data
0ab8f81e 443with the Most Significant Bit set, and your computer may not be able to
a999c27c 444tell if a given byte is a whole character or just half of it. So you
445have to I<encode> the character set to use it.
446
447A I<character encoding scheme> (CES) determines how to encode a given
448character set, or a set of multiple character sets. 7bit ISO-2022 is
0ab8f81e 449an example of a CES. You switch between character sets via I<escape
450sequences>.
67d7b5ef 451
452=back
453
0ab8f81e 454Technically, or mathematically, speaking, a character set encoded in
a999c27c 455such a CES that maps character by character may form a CCS. EUC is such
0ab8f81e 456an example. The CES of EUC is as follows:
67d7b5ef 457
a999c27c 458=over 4
5d030b67 459
a999c27c 460=item *
5d030b67 461
a999c27c 462Map ASCII unchanged.
463
464=item *
465
466Map such a character set that consists of 94 or 96 powered by N
467members by adding 0x80 to each byte.
468
469=item *
470
0ab8f81e 471You can also use 0x8e and 0x8f to indicate that the following sequence of
472characters belongs to yet another character set. To each following byte
473is added the value 0x80.
a999c27c 474
475=back
476
0ab8f81e 477By carefully looking at the encoded byte sequence, you can find that the
478byte sequence conforms a unique number. In that sense, EUC is a CCS
a999c27c 479generated by a CES above from up to four CCS (complicated?). UTF-8
0ab8f81e 480falls into this category. See L<perlUnicode/"UTF-8"> to find out how
a999c27c 481UTF-8 maps Unicode to a byte sequence.
482
0ab8f81e 483You may also have found out by now why 7bit ISO-2022 cannot comprise
484a CCS. If you look at a byte sequence \x21\x21, you can't tell if
485it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
486so you have no trouble differentiating between "!!". and S<" ">.
67d7b5ef 487
a63c962f 488=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
489
490This section tries to classify the supported encodings by their
491applicability for information exchange over the Internet and to
492choose the most suitable aliases to name them in the context of
493such communication.
494
0ab8f81e 495=over 4
67d7b5ef 496
497=item *
498
0ab8f81e 499To (en|de)code encodings marked by C<(**)>, you need
a999c27c 500C<Encode::HanExtra>, available from CPAN.
67d7b5ef 501
502=back
503
a63c962f 504Encoding names
5d030b67 505
f2a2953c 506 US-ASCII UTF-8 ISO-8859-* KOI8-R
507 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
508 EUC-KR Big5 GB2312
a999c27c 509
0ab8f81e 510are registered with IANA as preferred MIME names and may
a999c27c 511be used over the Internet.
5d030b67 512
c731e18e 513C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 514L<Microsoft-related naming mess> gives details.
5d030b67 515
a999c27c 516C<GB2312> is the IANA name for C<EUC-CN>.
517See L<Microsoft-related naming mess> for details.
518
519C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 520with Encode. See L<Encode::CN> for details.
5d030b67 521
a63c962f 522 EUC-CN
f2a2953c 523 KOI8-U [RFC2319]
5d030b67 524
a999c27c 525have not been registered with IANA (as of March 2002) but
526seem to be supported by major web browsers.
0ab8f81e 527The IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 528
529 KS_C_5601-1987
530
a999c27c 531is heavily misused.
532See L<Microsoft-related naming mess> for details.
533
534C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 535with Encode. See L<Encode::KR> for details.
536
537 UTF-16 UTF-16BE UTF-16LE
538
448e90bb 539are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 540Jungshik Shin reports that UTF-16 with a BOM is well accepted
541by MS IE 5/6 and NS 4/6. Beware however that
542
0ab8f81e 543=over 4
f2a2953c 544
545=item *
5d030b67 546
f2a2953c 547C<UTF-16> support in any software you're going to be
548using/interoperating with has probably been less tested
549then C<UTF-8> support
5d030b67 550
f2a2953c 551=item *
552
c731e18e 553C<UTF-8> coded data seamlessly passes traditional
554command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
0ab8f81e 555data is likely to cause confusion (with its zero bytes,
f2a2953c 556for example)
557
558=item *
559
560it is beyond the power of words to describe the way HTML browsers
0ab8f81e 561encode non-C<ASCII> form data. To get a general impression, visit
f2a2953c 562L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
0ab8f81e 563While encoding of form data has stabilized for C<UTF-8> encoded pages
564(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
565expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
f2a2953c 566pages!
567
568=back
569
570The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 571you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 572
f2a2953c 573 ISO-IR-165 [RFC1345]
5d030b67 574 VISCII
a63c962f 575 GB 12345
f2a2953c 576 GB 18030 (**) (see links bellow)
577 EUC-TW (**)
5d030b67 578
579are totally valid encodings but not registered at IANA.
a63c962f 580The names under which they are listed here are probably the
581most widely-known names for these encodings and are recommended
582names.
583
f2a2953c 584 BIG5PLUS (**)
a63c962f 585
0ab8f81e 586is a proprietary name.
5d030b67 587
a999c27c 588=head2 Microsoft-related naming mess
589
590Microsoft products misuse the following names:
5d030b67 591
0ab8f81e 592=over 4
a63c962f 593
a999c27c 594=item KS_C_5601-1987
5d030b67 595
a999c27c 596Microsoft extension to C<EUC-KR>.
5d030b67 597
c731e18e 598Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 599
f2a2953c 600See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 601for details.
5d030b67 602
f2a2953c 603Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
604misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
605C<kcs5601-raw>.
5d030b67 606
f2a2953c 607See L<Encode::KR> for details.
67d7b5ef 608
a999c27c 609=item GB2312
67d7b5ef 610
a999c27c 611Microsoft extension to C<EUC-CN>.
a63c962f 612
a999c27c 613Proper names: C<CP936>, C<GBK>.
a63c962f 614
a999c27c 615C<GB2312> has been registered in the C<EUC-CN> meaning at
616IANA. This has partially repaired the situation: Microsoft's
617C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 618
a999c27c 619Encode aliases C<GB2312> to C<euc-cn> in full agreement with
620IANA registration. C<cp936> is supported separately.
f2a2953c 621I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 622
f2a2953c 623See L<Encode::CN> for details.
a999c27c 624
625=item Big5
626
627Microsoft extension to C<Big5>.
628
629Proper name: C<CP950>.
630
631Encode separately supports C<Big5> and C<cp950>.
632
633=item Shift_JIS
634
635Microsoft's understanding of C<Shift_JIS>.
636
637JIS has not endorsed the full Microsoft standard however.
638The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
0ab8f81e 639character sets, while Microsoft has always used C<Shift_JIS>
85982a32 640to encode a wider character repertoire. See C<IANA> registration for
c731e18e 641C<Windows-31J>.
a999c27c 642
0ab8f81e 643As a historical predecessor, Microsoft's variant
644probably has more rights for the name, though it may be objected
a999c27c 645that Microsoft shouldn't have used JIS as part of the name
646in the first place.
647
fcb875d4 648Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
a999c27c 649
650Encode separately supports C<Shift_JIS> and C<cp932>.
651
652=back
653
654=head1 Glossary
655
0ab8f81e 656=over 4
a999c27c 657
658=item character repertoire
659
0ab8f81e 660A collection of unique characters. A I<character> set in the strictest
661sense. At this stage, characters are not numbered.
a999c27c 662
663=item coded character set (CCS)
664
665A character set that is mapped in a way computers can use directly.
0ab8f81e 666Many character encodings, including EUC, fall in this category.
a999c27c 667
668=item character encoding scheme (CES)
669
670An algorithm to map a character set to a byte sequence. You don't
671have to be able to tell which character set a given byte sequence
672belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
673example of being both a CCS and CES.
674
f2a2953c 675=item charset (in MIME context)
676
677has long been used in the meaning of C<encoding>, CES.
678
0ab8f81e 679While the word combination C<character set> has lost this meaning
680in MIME context since [RFC 2130], the C<charset> abbreviation has
681retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
f2a2953c 682
683 This document uses the term "charset" to mean a set of rules for
684 mapping from a sequence of octets to a sequence of characters, such
685 as the combination of a coded character set and a character encoding
686 scheme; this is also what is used as an identifier in MIME "charset="
687 parameters, and registered in the IANA charset registry ... (Note
688 that this is NOT a term used by other standards bodies, such as ISO).
689 [RFC 2277]
690
a999c27c 691=item EUC
692
0ab8f81e 693Extended Unix Character. See ISO-2022.
a999c27c 694
695=item ISO-2022
696
0ab8f81e 697A CES that was carefully designed to coexist with ASCII. There are a 7
698bit version and an 8 bit version.
f2a2953c 699
0ab8f81e 700The 7 bit version switches character set via escape sequence so it
f2a2953c 701cannot form a CCS. Since this is more difficult to handle in programs
0ab8f81e 702than the 8 bit version, the 7 bit version is not very popular except for
703iso-2022-jp, the I<de facto> standard CES for e-mails.
f2a2953c 704
0ab8f81e 705The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
962111ca 706thereof. Pre-5.6 perl could use them as string literals.
a999c27c 707
708=item UCS
709
710Short for I<Universal Character Set>. When you say just UCS, it means
0ab8f81e 711I<Unicode>.
a999c27c 712
713=item UCS-2
714
715ISO/IEC 10646 encoding form: Universal Character Set coded in two
716octets.
717
718=item Unicode
719
0ab8f81e 720A character set that aims to include all character repertoires of the
962111ca 721world. Many character sets in various national as well as industrial
f2a2953c 722standards have become, in a way, just subsets of Unicode.
a999c27c 723
724=item UTF
725
f2a2953c 726Short for I<Unicode Transformation Format>. Determines how to map a
0ab8f81e 727Unicode character into a byte sequence.
a999c27c 728
729=item UTF-16
730
731A UTF in 16-bit encoding. Can either be in big endian or little
0ab8f81e 732endian. The big endian version is called UTF-16BE (equal to UCS-2 +
733surrogate support) and the little endian version is called UTF-16LE.
67d7b5ef 734
735=back
5d030b67 736
737=head1 See Also
738
5129552c 739L<Encode>,
740L<Encode::Byte>,
a63c962f 741L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 742L<Encode::EBCDIC>, L<Encode::Symbol>
e8c86ba6 743L<Encode::MIME::Header>, L<Encode::Guess>
5d030b67 744
a999c27c 745=head1 References
746
0ab8f81e 747=over 4
a999c27c 748
749=item ECMA
750
751European Computer Manufacturers Association
752L<http://www.ecma.ch>
753
0ab8f81e 754=over 4
a999c27c 755
0ab8f81e 756=item ECMA-035 (eq C<ISO-2022>)
a999c27c 757
758L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
759
0ab8f81e 760The specification of ISO-2022 is available from the link above.
a999c27c 761
762=back
763
764=item IANA
765
766Internet Assigned Numbers Authority
767L<http://www.iana.org/>
768
0ab8f81e 769=over 4
a999c27c 770
771=item Assigned Charset Names by IANA
772
773L<http://www.iana.org/assignments/character-sets>
774
775Most of the C<canonical names> in Encode derive from this list
776so you can directly apply the string you have extracted from MIME
448e90bb 777header of mails and web pages.
a999c27c 778
779=back
780
781=item ISO
782
783International Organization for Standardization
784L<http://www.iso.ch/>
785
786=item RFC
787
962111ca 788Request For Comments -- need I say more?
0ab8f81e 789L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
790L<http://www.faqs.org/rfcs/>
a999c27c 791
792=item UC
793
794Unicode Consortium
795L<http://www.unicode.org/>
796
0ab8f81e 797=over 4
a999c27c 798
799=item Unicode Glossary
800
801L<http://www.unicode.org/glossary/>
802
962111ca 803The glossary of this document is based upon this site.
a999c27c 804
805=back
806
807=back
808
809=head2 Other Notable Sites
810
0ab8f81e 811=over 4
a999c27c 812
813=item czyborra.com
814
f2a2953c 815L<http://czyborra.com/>
a999c27c 816
817Contains a a lot of useful information, especially gory details of ISO
818vs. vendor mappings.
819
820=item CJK.inf
821
822L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
823
824Somewhat obsolete (last update in 1996), but still useful. Also try
825
826L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
827
0ab8f81e 828You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
a999c27c 829
f2a2953c 830=item Jungshik Shin's Hangul FAQ
831
832L<http://jshin.net/faq>
833
0ab8f81e 834And especially its subject 8.
f2a2953c 835
836L<http://jshin.net/faq/qa8.html>
837
962111ca 838A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 839
0ab8f81e 840=item debian.org: "Introduction to i18n"
841
842A brief description for most of the mentioned CJK encodings is
843contained in
844L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
845
f2a2953c 846=back
847
848=head2 Offline sources
849
0ab8f81e 850=over 4
f2a2953c 851
852=item C<CJKV Information Processing> by Ken Lunde
853
854CJKV Information Processing
8551999 O'Reilly & Associates, ISBN : 1-56592-224-7
856
0ab8f81e 857The modern successor of C<CJK.inf>.
f2a2953c 858
0ab8f81e 859Features a comprehensive coverage of CJKV character sets and
f2a2953c 860encodings along with many other issues faced by anyone trying
861to better support CJKV languages/scripts in all the areas of
862information processing.
863
0ab8f81e 864To purchase this book, visit
f2a2953c 865L<http://www.oreilly.com/catalog/cjkvinfo/>
0ab8f81e 866or your favourite bookstore.
f2a2953c 867
a999c27c 868=back
869
5d030b67 870=cut