=head1 NAME
-Encode::Supported -- Supported encodings by Encode
+Encode::Supported -- Encodings supported by Encode
=head1 DESCRIPTION
=head2 Encoding Names
Encoding names are case insensitive. White space in names
-is ignored. In addition an encoding may have aliases.
+is ignored. In addition, an encoding may have aliases.
Each encoding has one "canonical" name. The "canonical"
name is chosen from the names of the encoding by picking
the first in the following sequence (with a few exceptions).
-=over
+=over 4
=item *
-The name used by the perl community. That includes 'utf8' and 'ascii'.
-Unlike aliases, canonical names directly reaches the method so such
-frequently used words like 'utf8' should do without alias lookups.
+The name used by the Perl community. That includes 'utf8' and 'ascii'.
+Unlike aliases, canonical names directly reach the method so such
+frequently used words like 'utf8' don't need to do alias lookups.
=item *
-The MIME name as defined in IETF RFCs This includes all "iso-"'s.
+The MIME name as defined in IETF RFCs. This includes all "iso-"s.
=item *
The name in the IANA registry.
-
+
=item *
The name used by the organization that defined it.
the canonical name.
Because of all the alias issues, and because in the general case
-encodings have state, "Encode" uses the encoding object internally
+encodings have state, "Encode" uses an encoding object internally
once an operation is in progress.
=head1 Supported Encodings
As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
-(via alias) and all occurrance of spaces are replaced with '-'. In
-other words, "ISO 8859 1" and "iso-8859-1" are identical.
+(via alias) and all occurrence of spaces are replaced with '-'.
+In other words, "ISO 8859 1" and "iso-8859-1" are identical.
Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
-most cases. Encode.pm will automatically load those modules in need.
+most cases. Encode.pm will automatically load those modules on demand.
=head2 Built-in Encodings
The following encodings are always available.
- Canonical Aliases Comments & References
+ Canonical Aliases Comments & References
----------------------------------------------------------------
- ascii US-ascii [ECMA]
- iso-8859-1 latin1 [ISO]
- utf8 UTF-8 [RFC2279]
+ ascii US-ascii ISO-646-US [ECMA]
+ ascii-ctrl Special Encoding
+ iso-8859-1 latin1 [ISO]
+ null Special Encoding
+ utf8 UTF-8 [RFC2279]
----------------------------------------------------------------
+I<null> and I<ascii-ctrl> are special. "null" fails for all character
+so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
+CHARACTERS will fall back to character references. Ditto for
+"ascii-ctrl" except for control characters. For fallback modes, see
+L<Encode>.
=head2 Encode::Unicode -- other Unicode encodings
Unicode coding schemes other than native utf8 are supported by
-Encode::Unicode which will be autoloaded on demand.
+Encode::Unicode, which will be autoloaded on demand.
----------------------------------------------------------------
UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
UTF-16BE [UC]
UTF-16LE [UC]
UTF-32 [UC]
- UTF-32BE [UC]
+ UTF-32BE UCS-4 [UC]
UTF-32LE [UC]
+ UTF-7 [RFC2152]
----------------------------------------------------------------
-To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
+To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
see L<Encode::Unicode>.
+UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
+encoding. It is implemeneted seperately by Encode::Unicode::UTF7.
+
=head2 Encode::Byte -- Extended ASCII
-Encode::Byte implements most of single-byte encodings except for
-Symbols and EBCDIC. The following encodings are based single-byte
-encoding implemented as extended ASCII. For most cases it uses
-\x80-\xff (upper half) to map non-ASCII characters.
+Encode::Byte implements most single-byte encodings except for
+Symbols and EBCDIC. The following encodings are based on single-byte
+encodings implemented as extended ASCII. Most of them map
+\x80-\xff (upper half) to non-ASCII characters.
-=over 2
+=over 4
=item ISO-8859 and corresponding vendor mappings
-Since there are so many, They are presented in table format with
-Languages and corresponding encoding names by vendors. Note the table
-is sorted in order of ISO-8859 and the corresponding vendor mappings
-are slightly different from that of ISO. See
+Since there are so many, they are presented in table format with
+languages and corresponding encoding names by vendors. Note that
+the table is sorted in order of ISO-8859 and the corresponding vendor
+mappings are slightly different from that of ISO. See
L<http://czyborra.com/charsets/iso8859.html> for details.
- Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
+ Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
----------------------------------------------------------------
- N. America (ASCII) cp437 AdobeStandardEncoding
- cp863 (DOSCanadaF)
- W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
- hp-roman8
- cp860 (DOSPortuguese)
- CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
- MacCroatian
- MacRomanian
- MacRumanian
- Latin3(*3) iso-8859-3
- Latin4(*4) iso-8859-4
- Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
- (Also see next section) cp866 MacUkrainian
- Arabic iso-8859-6 cp864 cp1256 MacArabic
- cp1006 MacFarsi
- Greek iso-8859-7 cp737 cp1253 MacGreek
- cp869 (DOSGreek2)
- Hebrew iso-8859-8 cp862 cp1255 MacHebrew
- Turkish iso-8859-9 cp857 cp1254 MacTurkish
- Nordics iso-8859-10 cp865
- cp861 MacIcelandic
- MacSami
- Thai iso-8859-11 cp874 MacThai
+ N. America (ASCII) cp437 AdobeStandardEncoding
+ cp863 (DOSCanadaF)
+ W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
+ hp-roman8
+ cp860 (DOSPortuguese)
+ Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
+ MacCroatian
+ MacRomanian
+ MacRumanian
+ Latin3[1] iso-8859-3
+ Latin4[2] iso-8859-4
+ Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
+ (See also next section) cp866 MacUkrainian
+ Arabic iso-8859-6 cp864 cp1256 MacArabic
+ cp1006 MacFarsi
+ Greek iso-8859-7 cp737 cp1253 MacGreek
+ cp869 (DOSGreek2)
+ Hebrew iso-8859-8 cp862 cp1255 MacHebrew
+ Turkish iso-8859-9 cp857 cp1254 MacTurkish
+ Nordics iso-8859-10 cp865
+ cp861 MacIcelandic
+ MacSami
+ Thai iso-8859-11[3] cp874 MacThai
(iso-8859-12 is nonexistent. Reserved for Indics?)
- Baltics iso-8859-13 cp775 cp1257
+ Baltics iso-8859-13 cp775 cp1257
Celtics iso-8859-14
- Latin9(*15) iso-8859-15
+ Latin9 [4] iso-8859-15
Latin10 iso-8859-16
- Vietnamese viscii cp1258 MacVietnamese
+ Vietnamese viscii cp1258 MacVietnamese
----------------------------------------------------------------
- (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
- (*4) Baltics. Now on 8859-10
- (*9) Nicknamed Latin0; Euro sign as well as French and Finnish
- letters that are missing from 8859-1 are added.
+ [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
+ [2] Baltics. Now on 8859-10, except for Latvian.
+ [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
+ [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
+ letters that are missing from 8859-1 were added.
All cp* are also available as ibm-*, ms-*, and windows-* . See also
L<http://czyborra.com/charsets/codepages.html>.
Macintosh encodings don't seem to be registered in such entities as
IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
-for details
+for details.
-=item KOI8 - De Facto Standard for Cyrillic world
+=item KOI8 - De Facto Standard for the Cyrillic world
-Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
-in the Net. L<Encode> comes with the following KOI charsets. for
-gory details, See <http://czyborra.com/charsets/cyrillic.html> for
-details.
+Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
+popular in the Net. L<Encode> comes with the following KOI charsets.
+For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
----------------------------------------------------------------
- koi8-f
- koi8-r cp878 [RFC1489]
- koi8-u [RFC2319]
-
+ koi8-f
+ koi8-r cp878 [RFC1489]
+ koi8-u [RFC2319]
+ ----------------------------------------------------------------
+
=item gsm0338 - Hentai Latin 1
-GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
-control character ranges and other parts are mapped very differently,
-presumablly to store Greek and Cyrillic alphabets. This one is also
-covered in Encode::Byte even thought this one does not comply extended
-ASCII.
+GSM0338 is for GSM handsets. Though it shares alphanumerals with
+ASCII, control character ranges and other parts are mapped very
+differently, mainly to store Greek characters. There are also escape
+sequences (starting with 0x1B) to cover e.g. the Euro sign. Some
+special cases like a trailing 0x00 byte or a lone 0x1B byte are not
+well-defined and decode() will return an empty string for them.
+One possible workaround is
+
+ $gsm =~ s/\x00\z/\x00\x00/;
+ $uni = decode("gsm0338", $gsm);
+ $uni .= "\xA0" if $gsm =~ /\x1B\z/;
+
+Note that the Encode implementation of GSM0338 does not implement the
+reuse of Latin capital letters as Greek capital letters (for example,
+the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
+LETTER ZETA).
+
+The GSM0338 is also covered in Encode::Byte even though it is not
+an "extended ASCII" encoding.
=back
-=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
+=head2 CJK: Chinese, Japanese, Korean (Multibyte)
-Note Vietnamese is listed above. Also read "Encoding vs Charset"
-below. Also note these are implemented in distinct module by
-languages, due the the size concerns. Please also refer to their
-respective document pages.
+Note that Vietnamese is listed above. Also read "Encoding vs Charset"
+below. Also note that these are implemented in distinct modules by
+countries, due to the size concerns (simplified Chinese is mapped
+to 'CN', continental China, while traditional Chinese is mapped to
+'TW', Taiwan). Please refer to their respective documentation pages.
=over 4
=item Encode::CN -- Continental China
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- euc-cn(*1) MacChineseSimp
- (gbk) cp936 (*2)
- gb12345-raw { GB12345 without CES }
- gb2312-raw { GB2312 without CES }
+ euc-cn [1] MacChineseSimp
+ (gbk) cp936 [2]
+ gb12345-raw { GB12345 without CES }
+ gb2312-raw { GB2312 without CES }
hz
iso-ir-165
----------------------------------------------------------------
- (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
- (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
+ [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
+ [2] gbk is aliased to this. See L<Microsoft-related naming mess>
=item Encode::JP -- Japan
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
euc-jp
- shiftjis cp932 macJapanese
+ shiftjis cp932 macJapanese
7bit-jis
- euc-jp
- iso-2022-jp [RFC1468]
- iso-2022-jp-1 [RFC2237]
+ iso-2022-jp [RFC1468]
+ iso-2022-jp-1 [RFC2237]
jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
=item Encode::KR -- Korea
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- euc-kr MacKorean [RFC1557]
- cp949 (*)
- iso-2022-kr [RFC1557]
+ euc-kr MacKorean [RFC1557]
+ cp949 [1]
+ iso-2022-kr [RFC1557]
johab [KS X 1001:1998, Annex 3]
ksc5601-raw { KSC5601 without CES }
----------------------------------------------------------------
- (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
- this. See below.
-
-
+ [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
+ See below.
+
=item Encode::TW -- Taiwan
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- big5 cp950 MacChineseTrad
- big5-hkscs
+ big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
+ big5-hkscs
----------------------------------------------------------------
=item Encode::HanExtra -- More Chinese via CPAN
-Due to size concerns, additional Chinese encodings below are
+Due to the size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
+ ----------------------------------------------------------------
+ big5ext CMEX's Big5e Extension
+ big5plus CMEX's Big5+ Extension
+ cccii Chinese Character Code for Information Interchange
+ euc-tw EUC (Extended Unix Character)
+ gb18030 GBK with Traditional Characters
+ ----------------------------------------------------------------
+
+=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
+
+Due to size concerns, additional Japanese encodings below are
+distributed separately on CPAN, under the name Encode::JIS2K.
+
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- gb18030
- euc-tw
- big5plus
+ euc-jisx0213
+ shiftjisx0123
+ iso-2022-jp-3
+ jis0213-1-raw
+ jis0213-2-raw
----------------------------------------------------------------
=back
AdobeSymbol
----------------------------------------------------------------
+=item Encode::MIME::Header
+
+Strictly speaking, MIME header encoding documented in RFC 2047 is more
+of encapsulation than encoding. However, their support in modern
+world is imperative so they are supported.
+
+ ----------------------------------------------------------------
+ MIME-Header [RFC2047]
+ MIME-B [RFC2047]
+ MIME-Q [RFC2047]
+ ----------------------------------------------------------------
+
+=item Encode::Guess
+
+This one is not a name of encoding but a utility that lets you pick up
+the most appropriate encoding for a data out of given I<suspects>. See
+L<Encode::Guess> for details.
+
=back
=head1 Unsupported encodings
-The following are not supported as yet. Some because they are rarely
-usede, some because of technical difficulty. They may be supported by
-external modules via CPAN in future, however.
+The following encodings are not supported as yet; some because they
+are rarely used, some because of technical difficulties. They may
+be supported by external modules via CPAN in the future, however.
=over 4
=item ISO-2022-JP-2 [RFC1554]
Not very popular yet. Needs Unicode Database or equivalent to
-implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
-GB2312 sumulteniously, which code points in unicode overlap. So you
-need to lookup the database to determine what character set a given
+implement encode() (because it includes JIS X 0208/0212, KSC5601, and
+GB2312 simultaneously, whose code points in Unicode overlap. So you
+need to lookup the database to determine to what character set a given
Unicode character should belong).
-=item ISO-2022-CN [RFC1922]
+=item ISO-2022-CN [RFC1922]
+
+Not very popular. Needs CNS 11643-1 and -2 which are not available in
+this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
+Autrijus Tang may add support for this encoding in his module in future.
-Not very popular. Needs CNS 11643-1 and 2 which are not available in
-this module. CNS 11643 is supported (via euc-tw) in
-Encode::HanExtra. Autrijus may add support for this encoding in his
-module in future
+=item Various HP-UX encodings
-=item various UP-UX encodings
+The following are unsupported due to the lack of mapping data.
-The following are unsoported due to the lack of mapping data.
-
'8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
- '15' - japanese15, korean15, and roi15
+ '15' - japanese15, korean15, and roi15
=item Cyrillic encoding ISO-IR-111
-Anton doubts its usefulness.
+Anton Tagunov doubts its usefulness.
=item ISO-8859-8-1 [Hebrew]
None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
MacHebrew are supported because and just because there were mappings
-available at L<http://www.unicode.org/>). Contribution welcome.
+available at L<http://www.unicode.org/>). Contributions welcome.
+
+=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
+
+Ditto.
=item Thai encoding TCVN
=item Vietnamese encodings VPS
-Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See
-L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and
+Though Jungshik Shin has reported that Mozilla supports this encoding,
+it was too late before 5.8.0 for us to add it. In the future, it
+may be available via a separate module. See
+L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
+and
L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
if you are interested in helping us.
-=item various Mac encodings
+=item Various Mac encodings
-The following are unsoported due to the lack of mapping data.
+The following are unsupported due to the lack of mapping data.
MacArmenian, MacBengali, MacBurmese, MacEthiopic
MacExtArabic, MacGeorgian, MacKannada, MacKhmer
MacSinhalese, MacTamil, MacTelugu, MacTibetan
MacVietnamese
-The rest of which already available are based upon the vendor mappings at
-L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
+The rest which are already available are based upon the vendor mappings
+at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
=item (Mac) Indic encodings
-The maps for the following is available at L<http://www.unicode.org/>
-but remains unsupport because those encordigs need algorithmical
-approach, unsupported by F<enc2xs>
+The maps for the following are available at L<http://www.unicode.org/>
+but remain unsupport because those encodings need algorithmical
+approach, currently unsupported by F<enc2xs>:
MacDevanagari
MacGurmukhi
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
I believe this issue is prevalent not only for Mac Indics but also in
-other Indic encodings but those mentions were the only Indic encodings
+other Indic encodings, but the above were the only Indic encodings
maps that I could find at L<http://www.unicode.org/> .
=back
=head1 Encoding vs. Charset -- terminology
-We are used to using the term (character) I<encoding> and I<character set>
-interchangeably. But just as using the term byte and character is
-dangerous and should be differenciated when needed, we need to
-differenciate I<encoding> and I<character set>.
+We are used to using the term (character) I<encoding> and I<character
+set> interchangeably. But just as confusing the terms byte and
+character is dangerous and the terms should be differentiated when
+needed, we need to differentiate I<encoding> and I<character set>.
-To understand that, it's follow how we make computers grok our characters.
+To understand that, here is a description of how we make computers
+grok our characters.
=over 4
=item *
Then we have to give each character a unique ID so your computer can
-tell the differnce from 'a' to 'A'. This itemized character
-repartoire is now a I<character set>.
+tell the difference between 'a' and 'A'. This itemized character
+repertoire is now a I<character set>.
=item *
If your computer can grow the character set without further
-proccessing, you can go ahead use it. This is called a I<coded
+processing, you can go ahead and use it. This is called a I<coded
character set> (CCS) or I<raw character encoding>. ASCII is used this
way for most cases.
=item *
-But in many cases especially multi-byte CJK encodings, you have to
+But in many cases, especially multi-byte CJK encodings, you have to
tweak a little more. Your network connection may not accept any data
-with the Most Significant Bit set, Your computer may not be able to
+with the Most Significant Bit set, and your computer may not be able to
tell if a given byte is a whole character or just half of it. So you
have to I<encode> the character set to use it.
A I<character encoding scheme> (CES) determines how to encode a given
character set, or a set of multiple character sets. 7bit ISO-2022 is
-an example of CES. You switch between character sets via I<escape
-sequence>.
+an example of a CES. You switch between character sets via I<escape
+sequences>.
=back
-Technically, or Mathematically speaking, a character set encoded in
+Technically, or mathematically, speaking, a character set encoded in
such a CES that maps character by character may form a CCS. EUC is such
-an example. CES of EUC is as follows;
+an example. The CES of EUC is as follows:
=over 4
=item *
-You can also use 0x8e and 0x8f to tell the following sequence of
-characters belong to yet another character set. each following byte
-is added by 0x80
+You can also use 0x8e and 0x8f to indicate that the following sequence of
+characters belongs to yet another character set. To each following byte
+is added the value 0x80.
=back
-By carefully looking at at the encoded byte sequence, you may find the
-byte sequence conforms a unique number. In that sense EUC is a CCS
+By carefully looking at the encoded byte sequence, you can find that the
+byte sequence conforms a unique number. In that sense, EUC is a CCS
generated by a CES above from up to four CCS (complicated?). UTF-8
-falls into this category. See L<perlunicode/"UTF-8"> to find how
+falls into this category. See L<perlUnicode/"UTF-8"> to find out how
UTF-8 maps Unicode to a byte sequence.
-You may also find by now why 7bit ISO-2022 cannot conform a CCS. If
-you look at a byte sequence \x21\x21, you can't tell if it is two !'s
-or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no
-trouble between "!!". and " "
+You may also have found out by now why 7bit ISO-2022 cannot comprise
+a CCS. If you look at a byte sequence \x21\x21, you can't tell if
+it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
+so you have no trouble differentiating between "!!". and S<" ">.
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
choose the most suitable aliases to name them in the context of
such communication.
-=over 2
+=over 4
=item *
-To (en|de) code Encodings marked as C<(**)>, You need
+To (en|de)code encodings marked by C<(**)>, you need
C<Encode::HanExtra>, available from CPAN.
=back
Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
EUC-KR Big5 GB2312
-are registered to IANA as preferred MIME names and may probably
+are registered with IANA as preferred MIME names and may
be used over the Internet.
C<Shift_JIS> has been officialized by JIS X 0208:1997.
have not been registered with IANA (as of March 2002) but
seem to be supported by major web browsers.
-IANA name for C<EUC-CN> is C<GB2312>.
+The IANA name for C<EUC-CN> is C<GB2312>.
KS_C_5601-1987
UTF-16 UTF-16BE UTF-16LE
-are a IANA-registered C<charset>s. See [RFC 2781] for details.
+are IANA-registered C<charset>s. See [RFC 2781] for details.
Jungshik Shin reports that UTF-16 with a BOM is well accepted
by MS IE 5/6 and NS 4/6. Beware however that
-=over 2
+=over 4
=item *
C<UTF-8> coded data seamlessly passes traditional
command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
-data is likely to cause confusion (with it's zero bytes,
+data is likely to cause confusion (with its zero bytes,
for example)
=item *
it is beyond the power of words to describe the way HTML browsers
-encode non-C<ASCII> form data. To get a general impression visit
+encode non-C<ASCII> form data. To get a general impression, visit
L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
-While encoding of form data has stabilized for C<UTF-8> coded pages
-(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
-expect fun (and cross-browser discrepancies) with C<UTF-16> coded
+While encoding of form data has stabilized for C<UTF-8> encoded pages
+(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
+expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
pages!
=back
The rule of thumb is to use C<UTF-8> unless you know what
you're doing and unless you really benefit from using C<UTF-16>.
-
ISO-IR-165 [RFC1345]
- GBK
VISCII
GB 12345
GB 18030 (**) (see links bellow)
BIG5PLUS (**)
-is a bit proprietary name.
+is a proprietary name.
=head2 Microsoft-related naming mess
Microsoft products misuse the following names:
-=over 2
+=over 4
=item KS_C_5601-1987
JIS has not endorsed the full Microsoft standard however.
The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
-subsets, while Microsoft has always been meaning C<Shift_JIS> to
-encode a wider character repertoire. See C<IANA> registration for
+character sets, while Microsoft has always used C<Shift_JIS>
+to encode a wider character repertoire. See C<IANA> registration for
C<Windows-31J>.
-As a historical predecessor Microsoft's variant
-probably has more rights for the name, albeit it may be objected
+As a historical predecessor, Microsoft's variant
+probably has more rights for the name, though it may be objected
that Microsoft shouldn't have used JIS as part of the name
in the first place.
=head1 Glossary
-=over 2
+=over 4
=item character repertoire
-A collection of unique characters. A I<character> set in the most
-strict sense. At this stage characters are not numberd.
+A collection of unique characters. A I<character> set in the strictest
+sense. At this stage, characters are not numbered.
=item coded character set (CCS)
A character set that is mapped in a way computers can use directly.
-Many character encodings including EUC falls in this category.
+Many character encodings, including EUC, fall in this category.
=item character encoding scheme (CES)
has long been used in the meaning of C<encoding>, CES.
-While C<character set> word combination has lost this meaning
-in MIME context since [RFC 2130], C<charset> abbreviation has
-retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
-
+While the word combination C<character set> has lost this meaning
+in MIME context since [RFC 2130], the C<charset> abbreviation has
+retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
This document uses the term "charset" to mean a set of rules for
mapping from a sequence of octets to a sequence of characters, such
scheme; this is also what is used as an identifier in MIME "charset="
parameters, and registered in the IANA charset registry ... (Note
that this is NOT a term used by other standards bodies, such as ISO).
- [RFC 2277]
+ [RFC 2277]
=item EUC
-Extended Unix Character. See ISO-2022
+Extended Unix Character. See ISO-2022.
=item ISO-2022
-A CES that was carefully designed to coexist with ASCII. There are 7
-bit version and 8 bit version.
+A CES that was carefully designed to coexist with ASCII. There are a 7
+bit version and an 8 bit version.
-7 bit version switches character set via escape sequence so this
+The 7 bit version switches character set via escape sequence so it
cannot form a CCS. Since this is more difficult to handle in programs
-than the 8 bit version, 7 bit version is not very popular except for
-iso-2022-jp, the de facto standard CES for e-mails.
+than the 8 bit version, the 7 bit version is not very popular except for
+iso-2022-jp, the I<de facto> standard CES for e-mails.
-8 bit version can conform a CCS. EUC and ISO-8859 are two examples
-thereof. pre-5.6 perl could use them as string literals.
+The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
+thereof. Pre-5.6 perl could use them as string literals.
=item UCS
Short for I<Universal Character Set>. When you say just UCS, it means
-I<Unicode>
+I<Unicode>.
=item UCS-2
=item Unicode
-A Character Set that aims to include all character repertoire of the
-world. Many character sets in various national as well as industorial
+A character set that aims to include all character repertoires of the
+world. Many character sets in various national as well as industrial
standards have become, in a way, just subsets of Unicode.
=item UTF
Short for I<Unicode Transformation Format>. Determines how to map a
-unicode character into byte sequnece.
+Unicode character into a byte sequence.
=item UTF-16
A UTF in 16-bit encoding. Can either be in big endian or little
-endian. Big endian version is called UTF-16BE (equals to UCS-2 +
-Surrogate Support) and little endian version is UTF-16LE.
+endian. The big endian version is called UTF-16BE (equal to UCS-2 +
+surrogate support) and the little endian version is called UTF-16LE.
=back
L<Encode::Byte>,
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>
+L<Encode::MIME::Header>, L<Encode::Guess>
=head1 References
-=over 2
+=over 4
=item ECMA
European Computer Manufacturers Association
L<http://www.ecma.ch>
-=over 2
+=over 4
-=item EMCA-035 (eq C<ISO-2022>)
+=item ECMA-035 (eq C<ISO-2022>)
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
-The very dspecification of ISO-2022 is available from the link above.
+The specification of ISO-2022 is available from the link above.
=back
Internet Assigned Numbers Authority
L<http://www.iana.org/>
-=over 2
+=over 4
=item Assigned Charset Names by IANA
Most of the C<canonical names> in Encode derive from this list
so you can directly apply the string you have extracted from MIME
-header of mails and we pages.
+header of mails and web pages.
=back
=item RFC
-Request For Comment -- need I say more?
-L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
+Request For Comments -- need I say more?
+L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,
+L<http://www.faqs.org/rfcs/>
=item UC
Unicode Consortium
L<http://www.unicode.org/>
-=over 2
+=over 4
=item Unicode Glossary
L<http://www.unicode.org/glossary/>
-The glossary of this document is based opon this site.
+The glossary of this document is based upon this site.
=back
=head2 Other Notable Sites
-=over 2
+=over 4
=item czyborra.com
L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
-You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
=item Jungshik Shin's Hangul FAQ
L<http://jshin.net/faq>
-And especially it's subject 8
+And especially its subject 8.
L<http://jshin.net/faq/qa8.html>
-a comprehensive overview of the Korean (C<KS *>) standards.
+A comprehensive overview of the Korean (C<KS *>) standards.
+
+=item debian.org: "Introduction to i18n"
+
+A brief description for most of the mentioned CJK encodings is
+contained in
+L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
=back
=head2 Offline sources
-=over 2
+=over 4
=item C<CJKV Information Processing> by Ken Lunde
CJKV Information Processing
1999 O'Reilly & Associates, ISBN : 1-56592-224-7
-The modern successor of the C<CJK.inf>.
+The modern successor of C<CJK.inf>.
-Features a comprehensive coverage on CJKV character sets and
+Features a comprehensive coverage of CJKV character sets and
encodings along with many other issues faced by anyone trying
to better support CJKV languages/scripts in all the areas of
information processing.
-To purchase this book visit
+To purchase this book, visit
L<http://www.oreilly.com/catalog/cjkvinfo/>
+or your favourite bookstore.
=back
=cut
-
-I could not find this page because the hostname doesn't resolve!
-
- Brief description for most of the mentioned CJK encodings
-L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>