Move Encode from ext/ to cpan/
[p5sagit/p5-mst-13.2.git] / cpan / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
0ab8f81e 3Encode::Supported -- Encodings supported by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
0ab8f81e 10is ignored. In addition, an encoding may have aliases.
5d030b67 11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
a999c27c 13the first in the following sequence (with a few exceptions).
5d030b67 14
44b3b9c7 15=over 2
a999c27c 16
17=item *
18
962111ca 19The name used by the Perl community. That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
a999c27c 22
23=item *
24
0ab8f81e 25The MIME name as defined in IETF RFCs. This includes all "iso-"s.
a999c27c 26
27=item *
28
29The name in the IANA registry.
962111ca 30
a999c27c 31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented. So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
5d030b67 41
5129552c 42Because of all the alias issues, and because in the general case
962111ca 43encodings have state, "Encode" uses an encoding object internally
5129552c 44once an operation is in progress.
5d030b67 45
5129552c 46=head1 Supported Encodings
5d030b67 47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
962111ca 50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67 52
5129552c 53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
962111ca 55most cases. Encode.pm will automatically load those modules on demand.
5d030b67 56
5129552c 57=head2 Built-in Encodings
5d030b67 58
5129552c 59The following encodings are always available.
5d030b67 60
962111ca 61 Canonical Aliases Comments & References
67d7b5ef 62 ----------------------------------------------------------------
2d06ad02 63 ascii US-ascii ISO-646-US [ECMA]
f0a41339 64 ascii-ctrl Special Encoding
962111ca 65 iso-8859-1 latin1 [ISO]
f0a41339 66 null Special Encoding
962111ca 67 utf8 UTF-8 [RFC2279]
c731e18e 68 ----------------------------------------------------------------
69
f0a41339 70I<null> and I<ascii-ctrl> are special. "null" fails for all character
71so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
72CHARACTERS will fall back to character references. Ditto for
73"ascii-ctrl" except for control characters. For fallback modes, see
74L<Encode>.
75
c731e18e 76=head2 Encode::Unicode -- other Unicode encodings
77
78Unicode coding schemes other than native utf8 are supported by
0ab8f81e 79Encode::Unicode, which will be autoloaded on demand.
c731e18e 80
81 ----------------------------------------------------------------
f2a2953c 82 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
83 UCS-2LE [UC]
84 UTF-16 [UC]
85 UTF-16BE [UC]
86 UTF-16LE [UC]
87 UTF-32 [UC]
126bf8bf 88 UTF-32BE UCS-4 [UC]
f2a2953c 89 UTF-32LE [UC]
1485817e 90 UTF-7 [RFC2152]
67d7b5ef 91 ----------------------------------------------------------------
5d030b67 92
0ab8f81e 93To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
f2a2953c 94see L<Encode::Unicode>.
95
1485817e 96UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
c2878c71 97encoding. It is implemented seperately by Encode::Unicode::UTF7.
1485817e 98
a999c27c 99=head2 Encode::Byte -- Extended ASCII
5d030b67 100
0ab8f81e 101Encode::Byte implements most single-byte encodings except for
102Symbols and EBCDIC. The following encodings are based on single-byte
103encodings implemented as extended ASCII. Most of them map
104\x80-\xff (upper half) to non-ASCII characters.
a999c27c 105
44b3b9c7 106=over 2
a999c27c 107
108=item ISO-8859 and corresponding vendor mappings
109
962111ca 110Since there are so many, they are presented in table format with
0ab8f81e 111languages and corresponding encoding names by vendors. Note that
112the table is sorted in order of ISO-8859 and the corresponding vendor
113mappings are slightly different from that of ISO. See
a999c27c 114L<http://czyborra.com/charsets/iso8859.html> for details.
115
962111ca 116 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
a999c27c 117 ----------------------------------------------------------------
962111ca 118 N. America (ASCII) cp437 AdobeStandardEncoding
119 cp863 (DOSCanadaF)
0ab8f81e 120 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
962111ca 121 hp-roman8
122 cp860 (DOSPortuguese)
123 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
124 MacCroatian
125 MacRomanian
126 MacRumanian
ab3374e4 127 Latin3[1] iso-8859-3
128 Latin4[2] iso-8859-4
962111ca 129 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
0ab8f81e 130 (See also next section) cp866 MacUkrainian
962111ca 131 Arabic iso-8859-6 cp864 cp1256 MacArabic
132 cp1006 MacFarsi
133 Greek iso-8859-7 cp737 cp1253 MacGreek
134 cp869 (DOSGreek2)
135 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
136 Turkish iso-8859-9 cp857 cp1254 MacTurkish
137 Nordics iso-8859-10 cp865
138 cp861 MacIcelandic
139 MacSami
ab3374e4 140 Thai iso-8859-11[3] cp874 MacThai
a999c27c 141 (iso-8859-12 is nonexistent. Reserved for Indics?)
962111ca 142 Baltics iso-8859-13 cp775 cp1257
a999c27c 143 Celtics iso-8859-14
962111ca 144 Latin9 [4] iso-8859-15
a999c27c 145 Latin10 iso-8859-16
962111ca 146 Vietnamese viscii cp1258 MacVietnamese
a999c27c 147 ----------------------------------------------------------------
148
0ab8f81e 149 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
150 [2] Baltics. Now on 8859-10, except for Latvian.
ab3374e4 151 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
0ab8f81e 152 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
153 letters that are missing from 8859-1 were added.
a999c27c 154
155All cp* are also available as ibm-*, ms-*, and windows-* . See also
156L<http://czyborra.com/charsets/codepages.html>.
157
158Macintosh encodings don't seem to be registered in such entities as
159IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1601150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
0ab8f81e 161for details.
a999c27c 162
0ab8f81e 163=item KOI8 - De Facto Standard for the Cyrillic world
a999c27c 164
0ab8f81e 165Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
166popular in the Net. L<Encode> comes with the following KOI charsets.
962111ca 167For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
5d030b67 168
67d7b5ef 169 ----------------------------------------------------------------
962111ca 170 koi8-f
171 koi8-r cp878 [RFC1489]
172 koi8-u [RFC2319]
85982a32 173 ----------------------------------------------------------------
962111ca 174
44b3b9c7 175=back
176
177=head2 gsm0338 - Hentai Latin 1
a999c27c 178
962111ca 179GSM0338 is for GSM handsets. Though it shares alphanumerals with
180ASCII, control character ranges and other parts are mapped very
e74d7437 181differently, mainly to store Greek characters. There are also escape
44b3b9c7 182sequences (starting with 0x1B) to cover e.g. the Euro sign.
183
184This was once handled by L<Encode::Bytes> but because of all those
185unusual specifications, Encode 2.20 has relocated the support to
186L<Encode::GSM0338>. See L<Encode::GSM0338> for details.
187
188=over 2
189
190=item gsm0338 support before 2.19
191
192Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
e74d7437 193well-defined and decode() will return an empty string for them.
194One possible workaround is
195
196 $gsm =~ s/\x00\z/\x00\x00/;
197 $uni = decode("gsm0338", $gsm);
198 $uni .= "\xA0" if $gsm =~ /\x1B\z/;
199
200Note that the Encode implementation of GSM0338 does not implement the
201reuse of Latin capital letters as Greek capital letters (for example,
202the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
203LETTER ZETA).
204
205The GSM0338 is also covered in Encode::Byte even though it is not
206an "extended ASCII" encoding.
a999c27c 207
208=back
5d030b67 209
0ab8f81e 210=head2 CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 211
962111ca 212Note that Vietnamese is listed above. Also read "Encoding vs Charset"
0ab8f81e 213below. Also note that these are implemented in distinct modules by
ab3374e4 214countries, due to the size concerns (simplified Chinese is mapped
0ab8f81e 215to 'CN', continental China, while traditional Chinese is mapped to
ab3374e4 216'TW', Taiwan). Please refer to their respective documentation pages.
5d030b67 217
44b3b9c7 218=over 2
5129552c 219
220=item Encode::CN -- Continental China
221
962111ca 222 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 223 ----------------------------------------------------------------
962111ca 224 euc-cn [1] MacChineseSimp
225 (gbk) cp936 [2]
226 gb12345-raw { GB12345 without CES }
227 gb2312-raw { GB2312 without CES }
5129552c 228 hz
229 iso-ir-165
67d7b5ef 230 ----------------------------------------------------------------
5129552c 231
0ab8f81e 232 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
233 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
f2a2953c 234
5129552c 235=item Encode::JP -- Japan
236
962111ca 237 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 238 ----------------------------------------------------------------
a999c27c 239 euc-jp
962111ca 240 shiftjis cp932 macJapanese
f2a2953c 241 7bit-jis
962111ca 242 iso-2022-jp [RFC1468]
243 iso-2022-jp-1 [RFC2237]
f2a2953c 244 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
245 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
246 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
67d7b5ef 247 ----------------------------------------------------------------
5129552c 248
249=item Encode::KR -- Korea
250
962111ca 251 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 252 ----------------------------------------------------------------
962111ca 253 euc-kr MacKorean [RFC1557]
254 cp949 [1]
255 iso-2022-kr [RFC1557]
a999c27c 256 johab [KS X 1001:1998, Annex 3]
f2a2953c 257 ksc5601-raw { KSC5601 without CES }
67d7b5ef 258 ----------------------------------------------------------------
5129552c 259
962111ca 260 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
261 See below.
262
5129552c 263=item Encode::TW -- Taiwan
264
962111ca 265 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 266 ----------------------------------------------------------------
b0b300a3 267 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
268 big5-hkscs
67d7b5ef 269 ----------------------------------------------------------------
5129552c 270
271=item Encode::HanExtra -- More Chinese via CPAN
272
ab3374e4 273Due to the size concerns, additional Chinese encodings below are
5129552c 274distributed separately on CPAN, under the name Encode::HanExtra.
275
962111ca 276 Standard DOS/Win Macintosh Comment/Reference
67d7b5ef 277 ----------------------------------------------------------------
e8c86ba6 278 big5ext CMEX's Big5e Extension
279 big5plus CMEX's Big5+ Extension
280 cccii Chinese Character Code for Information Interchange
281 euc-tw EUC (Extended Unix Character)
282 gb18030 GBK with Traditional Characters
283 ----------------------------------------------------------------
284
285=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
286
287Due to size concerns, additional Japanese encodings below are
288distributed separately on CPAN, under the name Encode::JIS2K.
289
290 Standard DOS/Win Macintosh Comment/Reference
291 ----------------------------------------------------------------
292 euc-jisx0213
293 shiftjisx0123
294 iso-2022-jp-3
295 jis0213-1-raw
296 jis0213-2-raw
67d7b5ef 297 ----------------------------------------------------------------
5129552c 298
299=back
300
301=head2 Miscellaneous encodings
302
44b3b9c7 303=over 2
5129552c 304
305=item Encode::EBCDIC
5d030b67 306
a999c27c 307See L<perlebcdic> for details.
5d030b67 308
67d7b5ef 309 ----------------------------------------------------------------
5d030b67 310 cp37
a999c27c 311 cp500
312 cp875
313 cp1026
314 cp1047
5d030b67 315 posix-bc
67d7b5ef 316 ----------------------------------------------------------------
5129552c 317
a63c962f 318=item Encode::Symbols
5d030b67 319
5129552c 320For symbols and dingbats.
5d030b67 321
67d7b5ef 322 ----------------------------------------------------------------
5d030b67 323 symbol
324 dingbats
a999c27c 325 MacDingbats
326 AdobeZdingbat
327 AdobeSymbol
67d7b5ef 328 ----------------------------------------------------------------
329
e8c86ba6 330=item Encode::MIME::Header
331
332Strictly speaking, MIME header encoding documented in RFC 2047 is more
ab3374e4 333of encapsulation than encoding. However, their support in modern
334world is imperative so they are supported.
e8c86ba6 335
336 ----------------------------------------------------------------
337 MIME-Header [RFC2047]
338 MIME-B [RFC2047]
339 MIME-Q [RFC2047]
340 ----------------------------------------------------------------
341
342=item Encode::Guess
343
344This one is not a name of encoding but a utility that lets you pick up
345the most appropriate encoding for a data out of given I<suspects>. See
346L<Encode::Guess> for details.
347
67d7b5ef 348=back
349
350=head1 Unsupported encodings
351
0ab8f81e 352The following encodings are not supported as yet; some because they
353are rarely used, some because of technical difficulties. They may
354be supported by external modules via CPAN in the future, however.
67d7b5ef 355
44b3b9c7 356=over 2
67d7b5ef 357
358=item ISO-2022-JP-2 [RFC1554]
359
360Not very popular yet. Needs Unicode Database or equivalent to
0ab8f81e 361implement encode() (because it includes JIS X 0208/0212, KSC5601, and
362GB2312 simultaneously, whose code points in Unicode overlap. So you
363need to lookup the database to determine to what character set a given
67d7b5ef 364Unicode character should belong).
365
962111ca 366=item ISO-2022-CN [RFC1922]
67d7b5ef 367
0ab8f81e 368Not very popular. Needs CNS 11643-1 and -2 which are not available in
962111ca 369this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
0ab8f81e 370Autrijus Tang may add support for this encoding in his module in future.
67d7b5ef 371
0ab8f81e 372=item Various HP-UX encodings
67d7b5ef 373
962111ca 374The following are unsupported due to the lack of mapping data.
375
67d7b5ef 376 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
962111ca 377 '15' - japanese15, korean15, and roi15
67d7b5ef 378
379=item Cyrillic encoding ISO-IR-111
380
0ab8f81e 381Anton Tagunov doubts its usefulness.
67d7b5ef 382
383=item ISO-8859-8-1 [Hebrew]
384
a999c27c 385None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
386MacHebrew are supported because and just because there were mappings
962111ca 387available at L<http://www.unicode.org/>). Contributions welcome.
388
389=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
390
391Ditto.
67d7b5ef 392
393=item Thai encoding TCVN
394
395Ditto.
396
397=item Vietnamese encodings VPS
398
0ab8f81e 399Though Jungshik Shin has reported that Mozilla supports this encoding,
400it was too late before 5.8.0 for us to add it. In the future, it
401may be available via a separate module. See
962111ca 402L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
403and
a999c27c 404L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
405if you are interested in helping us.
67d7b5ef 406
962111ca 407=item Various Mac encodings
67d7b5ef 408
962111ca 409The following are unsupported due to the lack of mapping data.
a999c27c 410
411 MacArmenian, MacBengali, MacBurmese, MacEthiopic
412 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
413 MacLaotian, MacMalayalam, MacMongolian, MacOriya
414 MacSinhalese, MacTamil, MacTelugu, MacTibetan
415 MacVietnamese
416
0ab8f81e 417The rest which are already available are based upon the vendor mappings
962111ca 418at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
a999c27c 419
420=item (Mac) Indic encodings
421
0ab8f81e 422The maps for the following are available at L<http://www.unicode.org/>
423but remain unsupport because those encodings need algorithmical
424approach, currently unsupported by F<enc2xs>:
67d7b5ef 425
a999c27c 426 MacDevanagari
427 MacGurmukhi
428 MacGujarati
67d7b5ef 429
a999c27c 430For details, please see C<Unicode mapping issues and notes:> at
431L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
432
433I believe this issue is prevalent not only for Mac Indics but also in
962111ca 434other Indic encodings, but the above were the only Indic encodings
a999c27c 435maps that I could find at L<http://www.unicode.org/> .
5129552c 436
437=back
5d030b67 438
a999c27c 439=head1 Encoding vs. Charset -- terminology
5d030b67 440
0ab8f81e 441We are used to using the term (character) I<encoding> and I<character
442set> interchangeably. But just as confusing the terms byte and
443character is dangerous and the terms should be differentiated when
444needed, we need to differentiate I<encoding> and I<character set>.
5d030b67 445
0ab8f81e 446To understand that, here is a description of how we make computers
447grok our characters.
a999c27c 448
44b3b9c7 449=over 2
a999c27c 450
451=item *
67d7b5ef 452
a999c27c 453First we start with which characters to include. We call this
454collection of characters I<character repertoire>.
5d030b67 455
a999c27c 456=item *
5d030b67 457
a999c27c 458Then we have to give each character a unique ID so your computer can
0ab8f81e 459tell the difference between 'a' and 'A'. This itemized character
962111ca 460repertoire is now a I<character set>.
a63c962f 461
a999c27c 462=item *
463
464If your computer can grow the character set without further
0ab8f81e 465processing, you can go ahead and use it. This is called a I<coded
a999c27c 466character set> (CCS) or I<raw character encoding>. ASCII is used this
467way for most cases.
468
469=item *
470
0ab8f81e 471But in many cases, especially multi-byte CJK encodings, you have to
a999c27c 472tweak a little more. Your network connection may not accept any data
0ab8f81e 473with the Most Significant Bit set, and your computer may not be able to
a999c27c 474tell if a given byte is a whole character or just half of it. So you
475have to I<encode> the character set to use it.
476
477A I<character encoding scheme> (CES) determines how to encode a given
478character set, or a set of multiple character sets. 7bit ISO-2022 is
0ab8f81e 479an example of a CES. You switch between character sets via I<escape
480sequences>.
67d7b5ef 481
482=back
483
0ab8f81e 484Technically, or mathematically, speaking, a character set encoded in
a999c27c 485such a CES that maps character by character may form a CCS. EUC is such
0ab8f81e 486an example. The CES of EUC is as follows:
67d7b5ef 487
44b3b9c7 488=over 2
5d030b67 489
a999c27c 490=item *
5d030b67 491
a999c27c 492Map ASCII unchanged.
493
494=item *
495
496Map such a character set that consists of 94 or 96 powered by N
497members by adding 0x80 to each byte.
498
499=item *
500
0ab8f81e 501You can also use 0x8e and 0x8f to indicate that the following sequence of
502characters belongs to yet another character set. To each following byte
503is added the value 0x80.
a999c27c 504
505=back
506
0ab8f81e 507By carefully looking at the encoded byte sequence, you can find that the
508byte sequence conforms a unique number. In that sense, EUC is a CCS
a999c27c 509generated by a CES above from up to four CCS (complicated?). UTF-8
0ab8f81e 510falls into this category. See L<perlUnicode/"UTF-8"> to find out how
a999c27c 511UTF-8 maps Unicode to a byte sequence.
512
0ab8f81e 513You may also have found out by now why 7bit ISO-2022 cannot comprise
514a CCS. If you look at a byte sequence \x21\x21, you can't tell if
515it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
516so you have no trouble differentiating between "!!". and S<" ">.
67d7b5ef 517
a63c962f 518=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
519
520This section tries to classify the supported encodings by their
521applicability for information exchange over the Internet and to
522choose the most suitable aliases to name them in the context of
523such communication.
524
44b3b9c7 525=over 2
67d7b5ef 526
527=item *
528
0ab8f81e 529To (en|de)code encodings marked by C<(**)>, you need
a999c27c 530C<Encode::HanExtra>, available from CPAN.
67d7b5ef 531
532=back
533
a63c962f 534Encoding names
5d030b67 535
f2a2953c 536 US-ASCII UTF-8 ISO-8859-* KOI8-R
537 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
538 EUC-KR Big5 GB2312
a999c27c 539
0ab8f81e 540are registered with IANA as preferred MIME names and may
a999c27c 541be used over the Internet.
5d030b67 542
c731e18e 543C<Shift_JIS> has been officialized by JIS X 0208:1997.
a999c27c 544L<Microsoft-related naming mess> gives details.
5d030b67 545
a999c27c 546C<GB2312> is the IANA name for C<EUC-CN>.
547See L<Microsoft-related naming mess> for details.
548
549C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
f2a2953c 550with Encode. See L<Encode::CN> for details.
5d030b67 551
a63c962f 552 EUC-CN
f2a2953c 553 KOI8-U [RFC2319]
5d030b67 554
a999c27c 555have not been registered with IANA (as of March 2002) but
556seem to be supported by major web browsers.
0ab8f81e 557The IANA name for C<EUC-CN> is C<GB2312>.
67d7b5ef 558
559 KS_C_5601-1987
560
a999c27c 561is heavily misused.
562See L<Microsoft-related naming mess> for details.
563
564C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
f2a2953c 565with Encode. See L<Encode::KR> for details.
566
567 UTF-16 UTF-16BE UTF-16LE
568
448e90bb 569are IANA-registered C<charset>s. See [RFC 2781] for details.
f2a2953c 570Jungshik Shin reports that UTF-16 with a BOM is well accepted
571by MS IE 5/6 and NS 4/6. Beware however that
572
44b3b9c7 573=over 2
f2a2953c 574
575=item *
5d030b67 576
f2a2953c 577C<UTF-16> support in any software you're going to be
578using/interoperating with has probably been less tested
579then C<UTF-8> support
5d030b67 580
f2a2953c 581=item *
582
c731e18e 583C<UTF-8> coded data seamlessly passes traditional
584command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
0ab8f81e 585data is likely to cause confusion (with its zero bytes,
f2a2953c 586for example)
587
588=item *
589
590it is beyond the power of words to describe the way HTML browsers
0ab8f81e 591encode non-C<ASCII> form data. To get a general impression, visit
f2a2953c 592L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
0ab8f81e 593While encoding of form data has stabilized for C<UTF-8> encoded pages
594(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
595expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
f2a2953c 596pages!
597
598=back
599
600The rule of thumb is to use C<UTF-8> unless you know what
c731e18e 601you're doing and unless you really benefit from using C<UTF-16>.
a999c27c 602
f2a2953c 603 ISO-IR-165 [RFC1345]
5d030b67 604 VISCII
a63c962f 605 GB 12345
f2a2953c 606 GB 18030 (**) (see links bellow)
607 EUC-TW (**)
5d030b67 608
609are totally valid encodings but not registered at IANA.
a63c962f 610The names under which they are listed here are probably the
611most widely-known names for these encodings and are recommended
612names.
613
f2a2953c 614 BIG5PLUS (**)
a63c962f 615
0ab8f81e 616is a proprietary name.
5d030b67 617
a999c27c 618=head2 Microsoft-related naming mess
619
620Microsoft products misuse the following names:
5d030b67 621
44b3b9c7 622=over 2
a63c962f 623
a999c27c 624=item KS_C_5601-1987
5d030b67 625
a999c27c 626Microsoft extension to C<EUC-KR>.
5d030b67 627
c731e18e 628Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
67d7b5ef 629
f2a2953c 630See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
a999c27c 631for details.
5d030b67 632
f2a2953c 633Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
634misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
635C<kcs5601-raw>.
5d030b67 636
f2a2953c 637See L<Encode::KR> for details.
67d7b5ef 638
a999c27c 639=item GB2312
67d7b5ef 640
a999c27c 641Microsoft extension to C<EUC-CN>.
a63c962f 642
a999c27c 643Proper names: C<CP936>, C<GBK>.
a63c962f 644
a999c27c 645C<GB2312> has been registered in the C<EUC-CN> meaning at
646IANA. This has partially repaired the situation: Microsoft's
647C<GB2312> has become a superset of the official C<GB2312>.
67d7b5ef 648
a999c27c 649Encode aliases C<GB2312> to C<euc-cn> in full agreement with
650IANA registration. C<cp936> is supported separately.
f2a2953c 651I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
a999c27c 652
f2a2953c 653See L<Encode::CN> for details.
a999c27c 654
655=item Big5
656
657Microsoft extension to C<Big5>.
658
659Proper name: C<CP950>.
660
661Encode separately supports C<Big5> and C<cp950>.
662
663=item Shift_JIS
664
665Microsoft's understanding of C<Shift_JIS>.
666
667JIS has not endorsed the full Microsoft standard however.
668The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
0ab8f81e 669character sets, while Microsoft has always used C<Shift_JIS>
85982a32 670to encode a wider character repertoire. See C<IANA> registration for
c731e18e 671C<Windows-31J>.
a999c27c 672
0ab8f81e 673As a historical predecessor, Microsoft's variant
674probably has more rights for the name, though it may be objected
a999c27c 675that Microsoft shouldn't have used JIS as part of the name
676in the first place.
677
8f1ed24a 678Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
679provided as an alias by Encode): C<Windows-31J>.
a999c27c 680
681Encode separately supports C<Shift_JIS> and C<cp932>.
682
683=back
684
685=head1 Glossary
686
44b3b9c7 687=over 2
a999c27c 688
689=item character repertoire
690
0ab8f81e 691A collection of unique characters. A I<character> set in the strictest
692sense. At this stage, characters are not numbered.
a999c27c 693
694=item coded character set (CCS)
695
696A character set that is mapped in a way computers can use directly.
0ab8f81e 697Many character encodings, including EUC, fall in this category.
a999c27c 698
699=item character encoding scheme (CES)
700
701An algorithm to map a character set to a byte sequence. You don't
702have to be able to tell which character set a given byte sequence
703belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
704example of being both a CCS and CES.
705
f2a2953c 706=item charset (in MIME context)
707
708has long been used in the meaning of C<encoding>, CES.
709
0ab8f81e 710While the word combination C<character set> has lost this meaning
711in MIME context since [RFC 2130], the C<charset> abbreviation has
712retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
f2a2953c 713
714 This document uses the term "charset" to mean a set of rules for
715 mapping from a sequence of octets to a sequence of characters, such
716 as the combination of a coded character set and a character encoding
717 scheme; this is also what is used as an identifier in MIME "charset="
718 parameters, and registered in the IANA charset registry ... (Note
719 that this is NOT a term used by other standards bodies, such as ISO).
ab3374e4 720 [RFC 2277]
f2a2953c 721
a999c27c 722=item EUC
723
0ab8f81e 724Extended Unix Character. See ISO-2022.
a999c27c 725
726=item ISO-2022
727
0ab8f81e 728A CES that was carefully designed to coexist with ASCII. There are a 7
729bit version and an 8 bit version.
f2a2953c 730
0ab8f81e 731The 7 bit version switches character set via escape sequence so it
f2a2953c 732cannot form a CCS. Since this is more difficult to handle in programs
0ab8f81e 733than the 8 bit version, the 7 bit version is not very popular except for
734iso-2022-jp, the I<de facto> standard CES for e-mails.
f2a2953c 735
0ab8f81e 736The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
962111ca 737thereof. Pre-5.6 perl could use them as string literals.
a999c27c 738
739=item UCS
740
741Short for I<Universal Character Set>. When you say just UCS, it means
0ab8f81e 742I<Unicode>.
a999c27c 743
744=item UCS-2
745
746ISO/IEC 10646 encoding form: Universal Character Set coded in two
747octets.
748
749=item Unicode
750
0ab8f81e 751A character set that aims to include all character repertoires of the
962111ca 752world. Many character sets in various national as well as industrial
f2a2953c 753standards have become, in a way, just subsets of Unicode.
a999c27c 754
755=item UTF
756
f2a2953c 757Short for I<Unicode Transformation Format>. Determines how to map a
0ab8f81e 758Unicode character into a byte sequence.
a999c27c 759
760=item UTF-16
761
762A UTF in 16-bit encoding. Can either be in big endian or little
0ab8f81e 763endian. The big endian version is called UTF-16BE (equal to UCS-2 +
764surrogate support) and the little endian version is called UTF-16LE.
67d7b5ef 765
766=back
5d030b67 767
768=head1 See Also
769
5129552c 770L<Encode>,
771L<Encode::Byte>,
a63c962f 772L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 773L<Encode::EBCDIC>, L<Encode::Symbol>
e8c86ba6 774L<Encode::MIME::Header>, L<Encode::Guess>
5d030b67 775
a999c27c 776=head1 References
777
44b3b9c7 778=over 2
a999c27c 779
780=item ECMA
781
782European Computer Manufacturers Association
783L<http://www.ecma.ch>
784
44b3b9c7 785=over 2
a999c27c 786
0ab8f81e 787=item ECMA-035 (eq C<ISO-2022>)
a999c27c 788
789L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
790
0ab8f81e 791The specification of ISO-2022 is available from the link above.
a999c27c 792
793=back
794
795=item IANA
796
797Internet Assigned Numbers Authority
798L<http://www.iana.org/>
799
44b3b9c7 800=over 2
a999c27c 801
802=item Assigned Charset Names by IANA
803
804L<http://www.iana.org/assignments/character-sets>
805
806Most of the C<canonical names> in Encode derive from this list
807so you can directly apply the string you have extracted from MIME
448e90bb 808header of mails and web pages.
a999c27c 809
810=back
811
812=item ISO
813
814International Organization for Standardization
815L<http://www.iso.ch/>
816
817=item RFC
818
962111ca 819Request For Comments -- need I say more?
0ab8f81e 820L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
821L<http://www.faqs.org/rfcs/>
a999c27c 822
823=item UC
824
825Unicode Consortium
826L<http://www.unicode.org/>
827
44b3b9c7 828=over 2
a999c27c 829
830=item Unicode Glossary
831
832L<http://www.unicode.org/glossary/>
833
962111ca 834The glossary of this document is based upon this site.
a999c27c 835
836=back
837
838=back
839
840=head2 Other Notable Sites
841
44b3b9c7 842=over 2
a999c27c 843
844=item czyborra.com
845
f2a2953c 846L<http://czyborra.com/>
a999c27c 847
cf525c36 848Contains a lot of useful information, especially gory details of ISO
a999c27c 849vs. vendor mappings.
850
851=item CJK.inf
852
853L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
854
855Somewhat obsolete (last update in 1996), but still useful. Also try
856
857L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
858
0ab8f81e 859You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
a999c27c 860
f2a2953c 861=item Jungshik Shin's Hangul FAQ
862
863L<http://jshin.net/faq>
864
0ab8f81e 865And especially its subject 8.
f2a2953c 866
867L<http://jshin.net/faq/qa8.html>
868
962111ca 869A comprehensive overview of the Korean (C<KS *>) standards.
f2a2953c 870
0ab8f81e 871=item debian.org: "Introduction to i18n"
872
873A brief description for most of the mentioned CJK encodings is
874contained in
875L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
876
f2a2953c 877=back
878
879=head2 Offline sources
880
44b3b9c7 881=over 2
f2a2953c 882
883=item C<CJKV Information Processing> by Ken Lunde
884
885CJKV Information Processing
8861999 O'Reilly & Associates, ISBN : 1-56592-224-7
887
0ab8f81e 888The modern successor of C<CJK.inf>.
f2a2953c 889
0ab8f81e 890Features a comprehensive coverage of CJKV character sets and
f2a2953c 891encodings along with many other issues faced by anyone trying
892to better support CJKV languages/scripts in all the areas of
893information processing.
894
0ab8f81e 895To purchase this book, visit
f2a2953c 896L<http://www.oreilly.com/catalog/cjkvinfo/>
0ab8f81e 897or your favourite bookstore.
f2a2953c 898
a999c27c 899=back
900
5d030b67 901=cut