3 Encode::Supported -- Supported encodings by Encode
9 Encoding names are case insensitive. White space in names
10 is ignored. In addition an encoding may have aliases.
11 Each encoding has one "canonical" name. The "canonical"
12 name is chosen from the names of the encoding by picking
13 the first in the following sequence (with a few exceptions).
19 The name used by the perl community. That includes 'utf8' and 'ascii'.
20 Unlike aliases, canonical names directly reaches the method so such
21 frequently used words like 'utf8' should do without alias lookups.
25 The MIME name as defined in IETF RFCs This includes all "iso-"'s.
29 The name in the IANA registry.
33 The name used by the organization that defined it.
37 In case I<de jure> canonical names differ from that of the Encode
38 module, they are always aliased if it ever be implemented. So you can
39 safely tell if a given encoding is implemented or not just by passing
42 Because of all the alias issues, and because in the general case
43 encodings have state, "Encode" uses the encoding object internally
44 once an operation is in progress.
46 =head1 Supported Encodings
48 As of Perl 5.8.0, at least the following encodings are recognized.
49 Note that unless otherwise specified, they are all case insensitive
50 (via alias) and all occurrance of spaces are replaced with '-'. In
51 other words, "ISO 8859 1" and "iso-8859-1" are identical.
53 Encodings are categorized and implemented in several different modules
54 but you don't have to C<use Encode::XX> to make them available for
55 most cases. Encode.pm will automatically load those modules in need.
57 =head2 Built-in Encodings
59 The following encodings are always available.
61 Canonical Aliases Comments & References
62 ----------------------------------------------------------------
64 iso-8859-1 latin1 [ISO]
66 UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC]
68 ----------------------------------------------------------------
70 =head2 Encode::Byte -- Extended ASCII
72 Encode::Byte implements most of single-byte encodings except for
73 Symbols and EBCDIC. The following encodings are based single-byte
74 encoding implemented as extended ASCII. For most cases it uses
75 \x80-\xff (upper half) to map non-ASCII characters.
79 =item ISO-8859 and corresponding vendor mappings
81 Since there are so many, They are presented in table format with
82 Languages and corresponding encoding names by vendors. Note the table
83 is sorted in order of ISO-8859 and the corresponding vendor mappings
84 are slightly different from that of ISO. See
85 L<http://czyborra.com/charsets/iso8859.html> for details.
87 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
88 ----------------------------------------------------------------
89 N. America (ASCII) cp437 AdobeStandardEncoding
91 W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
94 CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
100 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
101 (Also see next section) cp866 MacUkrainian
102 Arabic iso-8859-6 cp864 cp1256 MacArabic
104 Greek iso-8859-7 cp737 cp1253 MacGreek
106 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
107 Turkish iso-8859-9 cp857 cp1254 MacTurkish
108 Nordics iso-8859-10 cp865
111 Thai iso-8859-11 cp874 MacThai
112 (iso-8859-12 is nonexistent. Reserved for Indics?)
113 Baltics iso-8859-13 cp775 cp1257
115 Latin9(*15) iso-8859-15
117 Vietnamese viscii cp1258 MacVietnamese
118 ----------------------------------------------------------------
120 (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
121 (*4) Baltics. Now on 8859-10
122 (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
123 letters that are missing from 8859-1 are added.
125 All cp* are also available as ibm-*, ms-*, and windows-* . See also
126 L<http://czyborra.com/charsets/codepages.html>.
128 Macintosh encodings don't seem to be registered in such entities as
129 IANA. "Canonical" names in Encode are based upon Apple's Tech Note
130 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
133 =item KOI8 - De Facto Standard for Cyrillic world
135 Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
136 in the Net. L<Encode> comes with the following KOI charsets. for
137 gory details, See <http://czyborra.com/charsets/cyrillic.html> for
140 ----------------------------------------------------------------
142 koi8-r cp878 [RFC1489]
145 =item gsm0338 - Hentai Latin 1
147 GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
148 control character ranges and other parts are mapped very differently,
149 presumablly to store Cyrillics. This one is also covered in
150 Encode::Byte even thought this one does not comply extended ASCII.
154 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
156 Note Vietnamese is listed above. Also read "Encoding vs Charset"
157 below. Also note these are implemented in distinct module by
158 languages, due the the size concerns. Please also refer to their
159 respective document pages.
163 =item Encode::CN -- Continental China
165 Standard DOS/Win Macintosh Comment
166 ----------------------------------------------------------------
167 euc-cn MacChineseSimp GB2312 is aliased to this
168 (gbk) cp936 GBK is aliased to to this
169 gb12345-raw GB12345 as is
170 gb2312-raw GB2312 as is
173 ----------------------------------------------------------------
175 =item Encode::JP -- Japan
177 Standard DOS/Win Macintosh Comment/Reference
178 ----------------------------------------------------------------
180 shiftjis cp932 macJapanese
183 iso-2022-jp [RFC1468]
184 iso-2022-jp-1 [RFC2237]
185 ----------------------------------------------------------------
187 =item Encode::KR -- Korea
189 ----------------------------------------------------------------
190 euc-kr MacKorean [RFC1557]
191 cp949 ks_c_5601-1987 is an alias
193 iso-2022-kr [RFC1557]
194 johab [KS X 1001:1998, Annex 3]
195 ksc5601-raw KSC5601 as is
196 ----------------------------------------------------------------
198 =item Encode::TW -- Taiwan
200 ----------------------------------------------------------------
201 big5 cp950 MacChineseTrad
203 ----------------------------------------------------------------
205 =item Encode::HanExtra -- More Chinese via CPAN
207 Due to size concerns, additional Chinese encodings below are
208 distributed separately on CPAN, under the name Encode::HanExtra.
210 ----------------------------------------------------------------
214 ----------------------------------------------------------------
218 =head2 Miscellaneous encodings
224 See L<perlebcdic> for details.
226 ----------------------------------------------------------------
233 ----------------------------------------------------------------
235 =item Encode::Symbols
237 For symbols and dingbats.
239 ----------------------------------------------------------------
245 ----------------------------------------------------------------
249 =head1 Unsupported encodings
251 The following are not supported as yet. Some because they are rarely
252 usede, some because of technical difficulty. They may be supported by
253 external modules via CPAN in future, however.
257 =item ISO-2022-JP-2 [RFC1554]
259 Not very popular yet. Needs Unicode Database or equivalent to
260 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
261 GB2312 sumulteniously, which code points in unicode overlap. So you
262 need to lookup the database to determine what character set a given
263 Unicode character should belong).
265 =item ISO-2022-CN [RFC1922]
267 Not very popular. Needs CNS 11643-1 and 2 which are not available in
268 this module. CNS 11643 is supported (via euc-tw) in
269 Encode::HanExtra. Autrijus may add support for this encoding in his
272 =item various UP-UX encodings
274 The following are unsoported due to the lack of mapping data.
276 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
277 '15' - japanese15, korean15, and roi15
279 =item Cyrillic encoding ISO-IR-111
281 Anton doubts its usefulness.
283 =item ISO-8859-8-1 [Hebrew]
285 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
286 MacHebrew are supported because and just because there were mappings
287 available at L<http://www.unicode.org/>). Contribution welcome.
289 =item Thai encoding TCVN
293 =item Vietnamese encodings VPS
295 Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
296 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
297 L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
298 if you are interested in helping us.
300 =item various Mac encodings
302 The following are unsoported due to the lack of mapping data.
304 MacArmenian, MacBengali, MacBurmese, MacEthiopic
305 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
306 MacLaotian, MacMalayalam, MacMongolian, MacOriya
307 MacSinhalese, MacTamil, MacTelugu, MacTibetan
310 The rest of which already available are based upon the vendor mappings at
311 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
313 =item (Mac) Indic encodings
315 The maps for the following is available at L<http://www.unicode.org/>
316 but remains unsupport because those encordigs need algorithmical
317 approach, unsupported by F<enc2xs>
323 For details, please see C<Unicode mapping issues and notes:> at
324 L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
326 I believe this issue is prevalent not only for Mac Indics but also in
327 other Indic encodings but those mentions were the only Indic encodings
328 maps that I could find at L<http://www.unicode.org/> .
332 =head1 Encoding vs. Charset -- terminology
334 We are used to using the term (character) I<encoding> and I<character set>
335 interchangeably. But just as using the term byte and character is
336 dangerous and should be differenciated when needed, we need to
337 differenciate I<encoding> and I<character set>.
339 To understand that, it's follow how we make computers grok our character.
345 First we start with which characters to include. We call this
346 collection of characters I<character repertoire>.
350 Then we have to give each character a unique ID so your computer can
351 tell the differnce from 'a' to 'A'. This itemized character
352 repartoire is now a I<character set>.
356 If your computer can grow the character set without further
357 proccessing, you can go ahead use it. This is called a I<coded
358 character set> (CCS) or I<raw character encoding>. ASCII is used this
363 But in many cases especially multi-byte CJK encodings, you have to
364 tweak a little more. Your network connection may not accept any data
365 with the Most Significant Bit set, Your computer may not be able to
366 tell if a given byte is a whole character or just half of it. So you
367 have to I<encode> the character set to use it.
369 A I<character encoding scheme> (CES) determines how to encode a given
370 character set, or a set of multiple character sets. 7bit ISO-2022 is
371 an example of CES. You switch between character sets via I<escape
376 Technically, or Mathematically speaking, a character set encoded in
377 such a CES that maps character by character may form a CCS. EUC is such
378 an example. CES of EUC is as follows;
388 Map such a character set that consists of 94 or 96 powered by N
389 members by adding 0x80 to each byte.
393 You can also use 0x8e and 0x8f to tell the following sequence of
394 characters belong to yet another character set. each following byte
399 By carefully looking at at the encoded byte sequence, you may find the
400 byte sequence conforms a unique number. In that sense EUC is a CCS
401 generated by a CES above from up to four CCS (complicated?). UTF-8
402 falls into this category. See L<perlunicode/"UTF-8"> to find how
403 UTF-8 maps Unicode to a byte sequence.
405 You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
406 you look at a byte sequence \x21\x21, you can't tell if it is two !'s
407 or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
408 trouble between "!!". and " "
410 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
412 This section tries to classify the supported encodings by their
413 applicability for information exchange over the Internet and to
414 choose the most suitable aliases to name them in the context of
421 To (en|de) code Encodings marked as C<(*)>, You need
422 C<Encode::HanExtra>, available from CPAN.
428 US-ASCII UTF-8 ISO-8859-* KOI8-R
429 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
432 are registered to IANA as preferred MIME names and may probably
433 be used over the Internet.
435 C<Shift_JIS> has been officialized by JIS X 0208-1997.
436 L<Microsoft-related naming mess> gives details.
438 C<GB2312> is the IANA name for C<EUC-CN>.
439 See L<Microsoft-related naming mess> for details.
441 C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
442 with Encode. See L<Encode::CN -- Continental China> for details.
445 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
447 have not been registered with IANA (as of March 2002) but
448 seem to be supported by major web browsers.
449 IANA name for C<EUC-CN> is C<GB2312>.
454 See L<Microsoft-related naming mess> for details.
456 C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
457 with Encode. See L<Encode::KR -- Korea> for details.
462 waiting for comments from Jungshik Shin to soften this - Anton
464 is a IANA-registered preferred MIME name
465 but probably should be avoided as encoding for web pages due to
466 the lack of browser support.
468 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
472 GB 18030 (*) (see links bellow)
475 are totally valid encodings but not registered at IANA.
476 The names under which they are listed here are probably the
477 most widely-known names for these encodings and are recommended
482 is a bit proprietary name.
484 =head2 Microsoft-related naming mess
486 Microsoft products misuse the following names:
492 Microsoft extension to C<EUC-KR>.
494 Proper name: C<CP949>.
497 http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
500 Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
501 this common misusage.
502 I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
504 See L<Encode::KR -- Korea> for details.
508 Microsoft extension to C<EUC-CN>.
510 Proper names: C<CP936>, C<GBK>.
512 C<GB2312> has been registered in the C<EUC-CN> meaning at
513 IANA. This has partially repaired the situation: Microsoft's
514 C<GB2312> has become a superset of the official C<GB2312>.
516 Encode aliases C<GB2312> to C<euc-cn> in full agreement with
517 IANA registration. C<cp936> is supported separately.
518 I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
520 See L<Encode::CN -- Continental China> for details.
524 Microsoft extension to C<Big5>.
526 Proper name: C<CP950>.
528 Encode separately supports C<Big5> and C<cp950>.
532 Microsoft's understanding of C<Shift_JIS>.
534 JIS has not endorsed the full Microsoft standard however.
535 The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
536 subsets, while Microsoft has always been meaning C<Shift_JIS> to
537 encode a wider character repertoire.
539 As a historical predecessor Microsoft's variant
540 probably has more rights for the name, albeit it may be objected
541 that Microsoft shouldn't have used JIS as part of the name
544 Unabiguous name: C<CP932>.
546 Encode separately supports C<Shift_JIS> and C<cp932>.
554 =item character repertoire
556 A collection of unique characters. A I<character> set in the most
557 strict sense. At this stage characters are not numberd.
559 =item coded character set (CCS)
561 A character set that is mapped in a way computers can use directly.
562 Many character encodings including EUC falls in this category.
564 =item character encoding scheme (CES)
566 An algorithm to map a character set to a byte sequence. You don't
567 have to be able to tell which character set a given byte sequence
568 belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
569 example of being both a CCS and CES.
573 Extended Unix Character. See ISO-2022
577 A CES that was carefully designed to coexist with ASCII. There are 7
578 bit version and 8 bit version. 8 bit version can conform a CCS. EUC
579 and ISO-8859 are two examples thereof.
583 Short for I<Universal Character Set>. When you say just UCS, it means
588 ISO/IEC 10646 encoding form: Universal Character Set coded in two
593 A Character Set that aims to include all character character
594 repertoire of the world. Many character sets in various national as
595 well as industorial standards are therefore a subset thereof.
599 Short for I<Unicode Transformation Format>. Determinse how to map a
600 unicode character into byte sequnece.
604 A UTF in 16-bit encoding. Can either be in big endian or little
605 endian. Big endian version is called UTF-16BE and little endian
614 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
615 L<Encode::EBCDIC>, L<Encode::Symbol>
623 European Computer Manufacturers Association
624 L<http://www.ecma.ch>
628 =item EMCA-035 (eq C<ISO-2022>)
630 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
632 The very dspecification of ISO-2022 is available from the link above.
638 Internet Assigned Numbers Authority
639 L<http://www.iana.org/>
643 =item Assigned Charset Names by IANA
645 L<http://www.iana.org/assignments/character-sets>
647 Most of the C<canonical names> in Encode derive from this list
648 so you can directly apply the string you have extracted from MIME
649 header of mails and we pages.
655 International Organization for Standardization
656 L<http://www.iso.ch/>
660 Request For Comment -- need I say more?
661 L<http://www.rfc.net/>
666 L<http://www.unicode.org/>
670 =item Unicode Glossary
672 L<http://www.unicode.org/glossary/>
674 The glossary of this document is based opon this site.
680 =head2 Other Notable Sites
686 <http://czyborra.com/>
688 Contains a a lot of useful information, especially gory details of ISO
693 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
695 Somewhat obsolete (last update in 1996), but still useful. Also try
697 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
699 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
705 I could not find this page because the hostname doesn't resolve!
707 Brief description for most of the mentioned CJK encodings
708 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>