=head1 NAME
-Encode::Supports -- Supported encodings by Encode
+Encode::Supported -- Supported encodings by Encode
=head1 DESCRIPTION
is ignored. In addition an encoding may have aliases.
Each encoding has one "canonical" name. The "canonical"
name is chosen from the names of the encoding by picking
-he first in the following sequence:
+the first in the following sequence (with a few exceptions).
- o The MIME name as defined in IETF RFCs.
- o The name in the IANA registry.
- o The name used by the organization that defined it.
+=over
+
+=item *
+
+The name used by the perl community. That includes 'utf8' and 'ascii'.
+Unlike aliases, canonical names directly reaches the method so such
+frequently used words like 'utf8' should do without alias lookups.
+
+=item *
+
+The MIME name as defined in IETF RFCs This includes all "iso-"'s.
+
+=item *
+
+The name in the IANA registry.
+
+=item *
+
+The name used by the organization that defined it.
+
+=back
+
+In case I<de jure> canonical names differ from that of the Encode
+module, they are always aliased if it ever be implemented. So you can
+safely tell if a given encoding is implemented or not just by passing
+the canonical name.
Because of all the alias issues, and because in the general case
encodings have state, "Encode" uses the encoding object internally
The following encodings are always available.
- Canonical Aliases
- -----------------------
- iso-8859-1 latin1
- US-ascii ascii
- UCS-2 ucs2, iso-10646-1
- UCS-2le
- UTF-8 utf8
- -----------------------
-
-=head2 Encode::Byte
-
-The following encodings are based single-byte encoding implemented as
-extended ASCII. For most cases it uses \x80-\xff (upper half) to map
-non-ASCII characters.
-
- -----------------------
- (iso-8859-1 is in built-in)
- iso-8859-2 latin2
- iso-8859-3 latin3
- iso-8859-4 latin4
- iso-8859-5
- iso-8859-6
- iso-8859-7
- iso-8859-8
- iso-8859-9 latin5
- iso-8859-10 latin6
- iso-8859-11
- (iso-8859-12 is nonexistent)
- iso-8859-13 latin7
- iso-8859-14 latin8
- iso-8859-15 latin9
- iso-8859-16 latin10
-
- koi8-f
- koi8-r
- koi8-u
-
- viscii # ASCII + vietnamese
-
- cp1250 WinLatin2
- cp1251 WinCyrillic
- cp1252 WinLatin1
- cp1253 WinGreek
- cp1254 WinTurkiskh
- cp1255 WinHebrew
- cp1256 WinArabic
- cp1257 WinBaltic
- cp1258 WinVietnamese
- # all cp* are also available as ibm-* and ms-*
-
- maccentraleuropean
- maccroatian
- macroman
- maccyrillic
- macromanian
- macsami
- macgreek
- macthai
- macicelandic
- macturkish
- macukraine
-
- nextstep
- gsm0338 # used in GSM handsets
- roman8 # what is this?
- -----------------------
+ Canonical Aliases Comments & References
+ ----------------------------------------------------------------
+ ascii US-ascii [ECMA]
+ iso-8859-1 latin1 [ISO]
+ utf8 UTF-8 [RFC2279]
+ UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
+ UCS-2LE [UC]
+ UTF-16 [UC]
+ UTF-16BE [UC]
+ UTF-16LE [UC]
+ UTF-32 [UC]
+ UTF-32BE [UC]
+ UTF-32LE [UC]
+ ----------------------------------------------------------------
+
+To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
+see L<Encode::Unicode>.
+
+=head2 Encode::Byte -- Extended ASCII
+
+Encode::Byte implements most of single-byte encodings except for
+Symbols and EBCDIC. The following encodings are based single-byte
+encoding implemented as extended ASCII. For most cases it uses
+\x80-\xff (upper half) to map non-ASCII characters.
+
+=over 2
+
+=item ISO-8859 and corresponding vendor mappings
+
+Since there are so many, They are presented in table format with
+Languages and corresponding encoding names by vendors. Note the table
+is sorted in order of ISO-8859 and the corresponding vendor mappings
+are slightly different from that of ISO. See
+L<http://czyborra.com/charsets/iso8859.html> for details.
+
+ Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
+ ----------------------------------------------------------------
+ N. America (ASCII) cp437 AdobeStandardEncoding
+ cp863 (DOSCanadaF)
+ W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
+ hp-roman8
+ cp860 (DOSPortuguese)
+ CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
+ MacCroatian
+ MacRomanian
+ MacRumanian
+ Latin3(*3) iso-8859-3
+ Latin4(*4) iso-8859-4
+ Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
+ (Also see next section) cp866 MacUkrainian
+ Arabic iso-8859-6 cp864 cp1256 MacArabic
+ cp1006 MacFarsi
+ Greek iso-8859-7 cp737 cp1253 MacGreek
+ cp869 (DOSGreek2)
+ Hebrew iso-8859-8 cp862 cp1255 MacHebrew
+ Turkish iso-8859-9 cp857 cp1254 MacTurkish
+ Nordics iso-8859-10 cp865
+ cp861 MacIcelandic
+ MacSami
+ Thai iso-8859-11 cp874 MacThai
+ (iso-8859-12 is nonexistent. Reserved for Indics?)
+ Baltics iso-8859-13 cp775 cp1257
+ Celtics iso-8859-14
+ Latin9(*15) iso-8859-15
+ Latin10 iso-8859-16
+ Vietnamese viscii cp1258 MacVietnamese
+ ----------------------------------------------------------------
+
+ (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
+ (*4) Baltics. Now on 8859-10
+ (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
+ letters that are missing from 8859-1 are added.
+
+All cp* are also available as ibm-*, ms-*, and windows-* . See also
+L<http://czyborra.com/charsets/codepages.html>.
+
+Macintosh encodings don't seem to be registered in such entities as
+IANA. "Canonical" names in Encode are based upon Apple's Tech Note
+1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
+for details
+
+=item KOI8 - De Facto Standard for Cyrillic world
+
+Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
+in the Net. L<Encode> comes with the following KOI charsets. for
+gory details, See <http://czyborra.com/charsets/cyrillic.html> for
+details.
+
+ ----------------------------------------------------------------
+ koi8-f
+ koi8-r cp878 [RFC1489]
+ koi8-u [RFC2319]
+
+=item gsm0338 - Hentai Latin 1
+
+GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
+control character ranges and other parts are mapped very differently,
+presumablly to store Greek and Cyrillic alphabets. This one is also
+covered in Encode::Byte even thought this one does not comply extended
+ASCII.
+
+=back
=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
=item Encode::CN -- Continental China
- -----------------------
- cp936 gbk
- euc-cn
- gb12345
- gb2312
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
+ euc-cn(*1) MacChineseSimp
+ (gbk) cp936 (*2)
+ gb12345-raw { GB12345 without CES }
+ gb2312-raw { GB2312 without CES }
hz
iso-ir-165
- -----------------------
+ ----------------------------------------------------------------
+
+ (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
+ (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
=item Encode::JP -- Japan
- -----------------------
- 7bit-jis jis
- cp932
- euc-jp ujis
- iso-2022-jp
- iso-2022-jp-1
- macjapan
- shiftjis Shift_JIS, sjis
- -----------------------
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
+ euc-jp
+ shiftjis cp932 macJapanese
+ 7bit-jis
+ euc-jp
+ iso-2022-jp [RFC1468]
+ iso-2022-jp-1 [RFC2237]
+ jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
+ jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
+ jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
+ ----------------------------------------------------------------
=item Encode::KR -- Korea
- -----------------------
- euc-kr
- ksc5601
- cp949
- -----------------------
-
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
+ euc-kr MacKorean [RFC1557]
+ cp949 (*)
+ iso-2022-kr [RFC1557]
+ johab [KS X 1001:1998, Annex 3]
+ ksc5601-raw { KSC5601 without CES }
+ ----------------------------------------------------------------
+
+ (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
+ this. See below.
+
+
=item Encode::TW -- Taiwan
- -----------------------
- big5
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
+ big5 cp950 MacChineseTrad
big5-hkscs
- cp950
- -----------------------
+ ----------------------------------------------------------------
=item Encode::HanExtra -- More Chinese via CPAN
Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.
- -----------------------
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
gb18030
euc-tw
big5plus
- -----------------------
+ ----------------------------------------------------------------
=back
=item Encode::EBCDIC
-See perlebcdic for details.
+See L<perlebcdic> for details.
- -----------------------
- cp1047
+ ----------------------------------------------------------------
cp37
+ cp500
+ cp875
+ cp1026
+ cp1047
posix-bc
- -----------------------
+ ----------------------------------------------------------------
=item Encode::Symbols
For symbols and dingbats.
- -----------------------
+ ----------------------------------------------------------------
symbol
dingbats
- macdingbats
- -----------------------
+ MacDingbats
+ AdobeZdingbat
+ AdobeSymbol
+ ----------------------------------------------------------------
+
+=back
+
+=head1 Unsupported encodings
+
+The following are not supported as yet. Some because they are rarely
+usede, some because of technical difficulty. They may be supported by
+external modules via CPAN in future, however.
+
+=over 4
+
+=item ISO-2022-JP-2 [RFC1554]
+
+Not very popular yet. Needs Unicode Database or equivalent to
+implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
+GB2312 sumulteniously, which code points in unicode overlap. So you
+need to lookup the database to determine what character set a given
+Unicode character should belong).
+
+=item ISO-2022-CN [RFC1922]
+
+Not very popular. Needs CNS 11643-1 and 2 which are not available in
+this module. CNS 11643 is supported (via euc-tw) in
+Encode::HanExtra. Autrijus may add support for this encoding in his
+module in future
+
+=item various UP-UX encodings
+
+The following are unsoported due to the lack of mapping data.
+
+ '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
+ '15' - japanese15, korean15, and roi15
+
+=item Cyrillic encoding ISO-IR-111
+
+Anton doubts its usefulness.
+
+=item ISO-8859-8-1 [Hebrew]
+
+None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
+MacHebrew are supported because and just because there were mappings
+available at L<http://www.unicode.org/>). Contribution welcome.
+
+=item Thai encoding TCVN
+
+Ditto.
+
+=item Vietnamese encodings VPS
+
+Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
+L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
+L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
+if you are interested in helping us.
+
+=item various Mac encodings
+
+The following are unsoported due to the lack of mapping data.
+
+ MacArmenian, MacBengali, MacBurmese, MacEthiopic
+ MacExtArabic, MacGeorgian, MacKannada, MacKhmer
+ MacLaotian, MacMalayalam, MacMongolian, MacOriya
+ MacSinhalese, MacTamil, MacTelugu, MacTibetan
+ MacVietnamese
+
+The rest of which already available are based upon the vendor mappings at
+L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
+
+=item (Mac) Indic encodings
+
+The maps for the following is available at L<http://www.unicode.org/>
+but remains unsupport because those encordigs need algorithmical
+approach, unsupported by F<enc2xs>
+
+ MacDevanagari
+ MacGurmukhi
+ MacGujarati
+
+For details, please see C<Unicode mapping issues and notes:> at
+L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
+
+I believe this issue is prevalent not only for Mac Indics but also in
+other Indic encodings but those mentions were the only Indic encodings
+maps that I could find at L<http://www.unicode.org/> .
+
+=back
+
+=head1 Encoding vs. Charset -- terminology
+
+We are used to using the term (character) I<encoding> and I<character set>
+interchangeably. But just as using the term byte and character is
+dangerous and should be differenciated when needed, we need to
+differenciate I<encoding> and I<character set>.
+
+To understand that, it's follow how we make computers grok our characters.
+
+=over 4
+
+=item *
+
+First we start with which characters to include. We call this
+collection of characters I<character repertoire>.
+
+=item *
+
+Then we have to give each character a unique ID so your computer can
+tell the differnce from 'a' to 'A'. This itemized character
+repartoire is now a I<character set>.
+
+=item *
+
+If your computer can grow the character set without further
+proccessing, you can go ahead use it. This is called a I<coded
+character set> (CCS) or I<raw character encoding>. ASCII is used this
+way for most cases.
+
+=item *
+
+But in many cases especially multi-byte CJK encodings, you have to
+tweak a little more. Your network connection may not accept any data
+with the Most Significant Bit set, Your computer may not be able to
+tell if a given byte is a whole character or just half of it. So you
+have to I<encode> the character set to use it.
+
+A I<character encoding scheme> (CES) determines how to encode a given
+character set, or a set of multiple character sets. 7bit ISO-2022 is
+an example of CES. You switch between character sets via I<escape
+sequence>.
=back
-=head1 Encoding vs. Charset
+Technically, or Mathematically speaking, a character set encoded in
+such a CES that maps character by character may form a CCS. EUC is such
+an example. CES of EUC is as follows;
+
+=over 4
+
+=item *
-Character encoding (or just "encoding") and Character Set (or just
-"charset") are often used interchangeably but they are different
-concepts.
+Map ASCII unchanged.
-Charset determines which characters to be included in a given text.
+=item *
-Encoding actually maps charset(s) to stream of bits.
+Map such a character set that consists of 94 or 96 powered by N
+members by adding 0x80 to each byte.
-Note a given encoding may contain multiple charsets and complex CJK
-encodings are usually implemented that way.
+=item *
-For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
-JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
-Kanji) in a single encoding.
+You can also use 0x8e and 0x8f to tell the following sequence of
+characters belong to yet another character set. each following byte
+is added by 0x80
-As the name suggests, the Encode module supports encodings, not
-individual charsets.
+=back
+
+By carefully looking at at the encoded byte sequence, you may find the
+byte sequence conforms a unique number. In that sense EUC is a CCS
+generated by a CES above from up to four CCS (complicated?). UTF-8
+falls into this category. See L<perlunicode/"UTF-8"> to find how
+UTF-8 maps Unicode to a byte sequence.
+
+You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
+you look at a byte sequence \x21\x21, you can't tell if it is two !'s
+or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
+trouble between "!!". and " "
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
choose the most suitable aliases to name them in the context of
such communication.
+=over 2
+
+=item *
+
+To (en|de) code Encodings marked as C<(**)>, You need
+C<Encode::HanExtra>, available from CPAN.
+
+=back
+
Encoding names
- US-ASCII UTF-8
- ISO-8859-* KOI8-R
- Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
- EUC-KR
- Big5
+ US-ASCII UTF-8 ISO-8859-* KOI8-R
+ Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
+ EUC-KR Big5 GB2312
+
+are registered to IANA as preferred MIME names and may probably
+be used over the Internet.
+
+C<Shift_JIS> has been officialized by JIS X 0208-1997.
+L<Microsoft-related naming mess> gives details.
-are L<http://www.iana.org/assignments/character-sets>-registered as
-preferred MIME names and may probably be used over the Internet.
+C<GB2312> is the IANA name for C<EUC-CN>.
+See L<Microsoft-related naming mess> for details.
-C<Shift_JIS> is no longer Microsft proprietary since it has been
-officialized by JIS X 0208-1997. It is probably the most wide
-spread encoding for Japanese on the Internet.
+C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
+with Encode. See L<Encode::CN> for details.
EUC-CN
+ KOI8-U [RFC2319]
+
+have not been registered with IANA (as of March 2002) but
+seem to be supported by major web browsers.
+IANA name for C<EUC-CN> is C<GB2312>.
+
+ KS_C_5601-1987
+
+is heavily misused.
+See L<Microsoft-related naming mess> for details.
-has not been registered with IANA (as of march 2002) but
-seems to be supported by major web browsers. (IANA has registered
-this encoding as C<GB2312>, but C<gb2312> currently has a different
-meaning to the C<Encode> module. It will probably become alias to
-C<EUC-CN> in the future; until then it is safer to avoid using
-C<gb2312> as encoding name within Perl).
+C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
+with Encode. See L<Encode::KR> for details.
- UTF-16
- KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
+ UTF-16 UTF-16BE UTF-16LE
-are IANA-registered (C<UTF-16> even as a preferred MIME name)
-but probably should be avoided as encoding for web pages due to
-lack of browser support.
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
- ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
+=over 2
+
+=item *
+
+C<UTF-16> support in any software you're going to be
+using/interoperating with has probably been less tested
+then C<UTF-8> support
+
+=item *
+
+data coded with C<UTF-8> seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while UTF-16 coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+it is beyond the power of words to describe the way HTML browsers
+encode non-C<ASCII> form data. To get a general impression refer to
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
+While encoding of form data has stabilzed for C<UTF-8> coded pages
+(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
+expect fun (and cross-browser discrepancies) with C<UTF-16> coded
+pages!
+
+=back
+
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really need from using C<UTF-16>.
+
+
+ ISO-IR-165 [RFC1345]
GBK
VISCII
GB 12345
- GB 18030 (*) (see links bellow)
- EUC-TW (*)
+ GB 18030 (**) (see links bellow)
+ EUC-TW (**)
are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.
+ BIG5PLUS (**)
-=for comment this used to be listed as supported but
+is a bit proprietary name.
-do not work @15457 when it's clear they will be uncommented
-or deleted - Anton
-ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
-CNS 11643 (only plains 1 and 2 available)
+=head2 Microsoft-related naming mess
- BIG5PLUS (*)
+Microsoft products misuse the following names:
-is a bit proprietary name. C<(*)>-marked encodings belong to
-C<Encode::HanExtra> available from CPAN.
+=over 2
-You may probably get some info on CJK encodings at
+=item KS_C_5601-1987
-brief description for most of the mentioned CJK encodings
-L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
+Microsoft extension to C<EUC-KR>.
-several years old, but still useful
-L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
+Proper name: C<CP949>.
+
+See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
+for details.
+
+Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
+misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
+C<kcs5601-raw>.
+
+See L<Encode::KR> for details.
+
+=item GB2312
+
+Microsoft extension to C<EUC-CN>.
+
+Proper names: C<CP936>, C<GBK>.
+
+C<GB2312> has been registered in the C<EUC-CN> meaning at
+IANA. This has partially repaired the situation: Microsoft's
+C<GB2312> has become a superset of the official C<GB2312>.
+
+Encode aliases C<GB2312> to C<euc-cn> in full agreement with
+IANA registration. C<cp936> is supported separately.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
+
+See L<Encode::CN> for details.
+
+=item Big5
+
+Microsoft extension to C<Big5>.
+
+Proper name: C<CP950>.
-and some in-depth reading for the heroes :-)
-L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
+Encode separately supports C<Big5> and C<cp950>.
-gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
-F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
+=item Shift_JIS
-The nature of information in this section is most fragile and
-error-prone; I<probably> is the most popular adverb :)
-Please feel free to send your comments, disagreements and
-additions to L<...>. (Note however,
-that the mission of this document is to cover the
-C<Encode>-supported encodings only.
+Microsoft's understanding of C<Shift_JIS>.
+
+JIS has not endorsed the full Microsoft standard however.
+The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
+subsets, while Microsoft has always been meaning C<Shift_JIS> to
+encode a wider character repertoire.
+
+As a historical predecessor Microsoft's variant
+probably has more rights for the name, albeit it may be objected
+that Microsoft shouldn't have used JIS as part of the name
+in the first place.
+
+Unabiguous name: C<CP932>.
+
+Encode separately supports C<Shift_JIS> and C<cp932>.
+
+=back
+
+=head1 Glossary
+
+=over 2
+
+=item character repertoire
+
+A collection of unique characters. A I<character> set in the most
+strict sense. At this stage characters are not numberd.
+
+=item coded character set (CCS)
+
+A character set that is mapped in a way computers can use directly.
+Many character encodings including EUC falls in this category.
+
+=item character encoding scheme (CES)
+
+An algorithm to map a character set to a byte sequence. You don't
+have to be able to tell which character set a given byte sequence
+belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
+example of being both a CCS and CES.
+
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this meaning
+in MIME context since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ... (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+ [RFC 2277]
+
+=item EUC
+
+Extended Unix Character. See ISO-2022
+
+=item ISO-2022
+
+A CES that was carefully designed to coexist with ASCII. There are 7
+bit version and 8 bit version.
+
+7 bit version switches character set via escape sequence so this
+cannot form a CCS. Since this is more difficult to handle in programs
+than the 8 bit version, 7 bit version is not very popular except for
+iso-2022-jp, the de facto standard CES for e-mails.
+
+8 bit version can conform a CCS. EUC and ISO-8859 are two examples
+thereof. pre-5.6 perl could use them as string literals.
+
+=item UCS
+
+Short for I<Universal Character Set>. When you say just UCS, it means
+I<Unicode>
+
+=item UCS-2
+
+ISO/IEC 10646 encoding form: Universal Character Set coded in two
+octets.
+
+=item Unicode
+
+A Character Set that aims to include all character repertoire of the
+world. Many character sets in various national as well as industorial
+standards have become, in a way, just subsets of Unicode.
+
+=item UTF
+
+Short for I<Unicode Transformation Format>. Determines how to map a
+unicode character into byte sequnece.
+
+=item UTF-16
+
+A UTF in 16-bit encoding. Can either be in big endian or little
+endian. Big endian version is called UTF-16BE (equals to UCS-2 +
+Surrogate Support) and little endian version is UTF-16LE.
+
+=back
=head1 See Also
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>
+=head1 References
+
+=over 2
+
+=item ECMA
+
+European Computer Manufacturers Association
+L<http://www.ecma.ch>
+
+=over 2
+
+=item EMCA-035 (eq C<ISO-2022>)
+
+L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
+
+The very dspecification of ISO-2022 is available from the link above.
+
+=back
+
+=item IANA
+
+Internet Assigned Numbers Authority
+L<http://www.iana.org/>
+
+=over 2
+
+=item Assigned Charset Names by IANA
+
+L<http://www.iana.org/assignments/character-sets>
+
+Most of the C<canonical names> in Encode derive from this list
+so you can directly apply the string you have extracted from MIME
+header of mails and we pages.
+
+=back
+
+=item ISO
+
+International Organization for Standardization
+L<http://www.iso.ch/>
+
+=item RFC
+
+Request For Comment -- need I say more?
+L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
+
+=item UC
+
+Unicode Consortium
+L<http://www.unicode.org/>
+
+=over 2
+
+=item Unicode Glossary
+
+L<http://www.unicode.org/glossary/>
+
+The glossary of this document is based opon this site.
+
+=back
+
+=back
+
+=head2 Other Notable Sites
+
+=over 2
+
+=item czyborra.com
+
+L<http://czyborra.com/>
+
+Contains a a lot of useful information, especially gory details of ISO
+vs. vendor mappings.
+
+=item CJK.inf
+
+L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
+
+Somewhat obsolete (last update in 1996), but still useful. Also try
+
+L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
+
+You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+a comprehensive overview of the Korean (C<KS *>) standards.
+
+=back
+
+=head2 Offline sources
+
+=over 2
+
+=item C<CJKV Information Processing> by Ken Lunde
+
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
+
+The modern successor of the C<CJK.inf>.
+
+Features a comprehensive coverage on CJKV character sets and
+encodings along with many other issues faced by anyone trying
+to better support CJKV languages/scripts in all the areas of
+information processing.
+
+To purchase this book visit
+L<http://www.oreilly.com/catalog/cjkvinfo/>
+
+=back
+
=cut
+
+I could not find this page because the hostname doesn't resolve!
+
+ Brief description for most of the mentioned CJK encodings
+L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>