=head1 NAME Encode::Supports -- Supported encodings by Encode =head1 DESCRIPTION =head2 Encoding Names Encoding names are case insensitive. White space in names is ignored. In addition an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking he first in the following sequence: o The MIME name as defined in IETF RFCs. o The name in the IANA registry. o The name used by the organization that defined it. Because of all the alias issues, and because in the general case encodings have state, "Encode" uses the encoding object internally once an operation is in progress. =head1 Supported Encodings As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrance of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical. Encodings are categorized and implemented in several different modules but you don't have to C to make them available for most cases. Encode.pm will automatically load those modules in need. =head2 Built-in Encodings The following encodings are always available. Canonical Aliases ----------------------- iso-8859-1 latin1 US-ascii ascii UCS-2 ucs2, iso-10646-1 UCS-2le UTF-8 utf8 ----------------------- =head2 Encode::Byte The following encodings are based single-byte encoding implemented as extended ASCII. For most cases it uses \x80-\xff (upper half) to map non-ASCII characters. ----------------------- (iso-8859-1 is in built-in) iso-8859-2 latin2 iso-8859-3 latin3 iso-8859-4 latin4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 latin5 iso-8859-10 latin6 iso-8859-11 (iso-8859-12 is nonexistent) iso-8859-13 latin7 iso-8859-14 latin8 iso-8859-15 latin9 iso-8859-16 latin10 koi8-f koi8-r koi8-u viscii # ASCII + vietnamese cp1250 WinLatin2 cp1251 WinCyrillic cp1252 WinLatin1 cp1253 WinGreek cp1254 WinTurkiskh cp1255 WinHebrew cp1256 WinArabic cp1257 WinBaltic cp1258 WinVietnamese # all cp* are also available as ibm-* and ms-* maccentraleuropean maccroatian macroman maccyrillic macromanian macsami macgreek macthai macicelandic macturkish macukraine nextstep gsm0338 # used in GSM handsets roman8 # what is this? ----------------------- =head2 The CJK: Chinese, Japanese, Korean (Multibyte) Note Vietnamese is listed above. Also read "Encoding vs Charset" below. Also note these are implemented in distinct module by languages, due the the size concerns. Please also refer to their respective document pages. =over 4 =item Encode::CN -- Continental China ----------------------- cp936 gbk euc-cn gb12345 gb2312 hz iso-ir-165 ----------------------- =item Encode::JP -- Japan ----------------------- 7bit-jis jis cp932 euc-jp ujis iso-2022-jp iso-2022-jp-1 macjapan shiftjis Shift_JIS, sjis ----------------------- =item Encode::KR -- Korea ----------------------- euc-kr ksc5601 cp949 ----------------------- =item Encode::TW -- Taiwan ----------------------- big5 big5-hkscs cp950 ----------------------- =item Encode::HanExtra -- More Chinese via CPAN Due to size concerns, additional Chinese encodings below are distributed separately on CPAN, under the name Encode::HanExtra. ----------------------- gb18030 euc-tw big5plus ----------------------- =back =head2 Miscellaneous encodings =over 4 =item Encode::EBCDIC See perlebcdic for details. ----------------------- cp1047 cp37 posix-bc ----------------------- =item Encode::Symbols For symbols and dingbats. ----------------------- symbol dingbats macdingbats ----------------------- =back =head1 Encoding vs. Charset Character encoding (or just "encoding") and Character Set (or just "charset") are often used interchangeably but they are different concepts. Charset determines which characters to be included in a given text. Encoding actually maps charset(s) to stream of bits. Note a given encoding may contain multiple charsets and complex CJK encodings are usually implemented that way. For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana), JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended Kanji) in a single encoding. As the name suggests, the Encode module supports encodings, not individual charsets. =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) This section tries to classify the supported encodings by their applicability for information exchange over the Internet and to choose the most suitable aliases to name them in the context of such communication. Encoding names US-ASCII UTF-8 ISO-8859-* KOI8-R Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 EUC-KR Big5 are L-registered as preferred MIME names and may probably be used over the Internet. C is no longer Microsft proprietary since it has been officialized by JIS X 0208-1997. It is probably the most wide spread encoding for Japanese on the Internet. EUC-CN has not been registered with IANA (as of march 2002) but seems to be supported by major web browsers. (IANA has registered this encoding as C, but C currently has a different meaning to the C module. It will probably become alias to C in the future; until then it is safer to avoid using C as encoding name within Perl). UTF-16 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) are IANA-registered (C even as a preferred MIME name) but probably should be avoided as encoding for web pages due to lack of browser support. ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) GBK VISCII GB 12345 GB 18030 (*) (see links bellow) EUC-TW (*) are totally valid encodings but not registered at IANA. The names under which they are listed here are probably the most widely-known names for these encodings and are recommended names. =for comment this used to be listed as supported but do not work @15457 when it's clear they will be uncommented or deleted - Anton ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) CNS 11643 (only plains 1 and 2 available) BIG5PLUS (*) is a bit proprietary name. C<(*)>-marked encodings belong to C available from CPAN. You may probably get some info on CJK encodings at brief description for most of the mentioned CJK encodings L several years old, but still useful L and some in-depth reading for the heroes :-) L (eq C) gives brief info on C, C and mostly on C F The nature of information in this section is most fragile and error-prone; I is the most popular adverb :) Please feel free to send your comments, disagreements and additions to L<...>. (Note however, that the mission of this document is to cover the C-supported encodings only. =head1 See Also L, L, L, L, L, L, L, L =cut