ascii US-ascii [ECMA]
iso-8859-1 latin1 [ISO]
utf8 UTF-8 [RFC2279]
- UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC]
- UTF-16LE UCS-2LE [UC]
----------------------------------------------------------------
+
+=head2 Encode::Unicode -- other Unicode encodings
+
+Unicode coding schemes other than native utf8 are supported by
+Encode::Unicode which will be autoloaded on demand.
+
+ ----------------------------------------------------------------
+ UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
+ UCS-2LE [UC]
+ UTF-16 [UC]
+ UTF-16BE [UC]
+ UTF-16LE [UC]
+ UTF-32 [UC]
+ UTF-32BE [UC]
+ UTF-32LE [UC]
+ ----------------------------------------------------------------
+
+To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another,
+see L<Encode::Unicode>.
+
=head2 Encode::Byte -- Extended ASCII
Encode::Byte implements most of single-byte encodings except for
GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
control character ranges and other parts are mapped very differently,
-presumablly to store Cyrillics. This one is also covered in
-Encode::Byte even thought this one does not comply extended ASCII.
+presumablly to store Greek and Cyrillic alphabets. This one is also
+covered in Encode::Byte even thought this one does not comply extended
+ASCII.
=back
=item Encode::CN -- Continental China
- Standard DOS/Win Macintosh Comment
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
- euc-cn MacChineseSimp GB2312 is aliased to this
- (gbk) cp936 GBK is aliased to to this
- gb12345-raw GB12345 as is
- gb2312-raw GB2312 as is
+ euc-cn(*1) MacChineseSimp
+ (gbk) cp936 (*2)
+ gb12345-raw { GB12345 without CES }
+ gb2312-raw { GB2312 without CES }
hz
iso-ir-165
----------------------------------------------------------------
+ (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess>
+ (*2) gbk is aliased to this. see L<Microsoft-related naming mess>
+
=item Encode::JP -- Japan
- Standard DOS/Win Macintosh Comment/Reference
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
euc-jp
shiftjis cp932 macJapanese
- 7bit-jis jis
- euc-jp ujis
- iso-2022-jp [RFC1468]
- iso-2022-jp-1 [RFC2237]
+ 7bit-jis
+ euc-jp
+ iso-2022-jp [RFC1468]
+ iso-2022-jp-1 [RFC2237]
+ jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
+ jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
+ jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
----------------------------------------------------------------
=item Encode::KR -- Korea
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
euc-kr MacKorean [RFC1557]
- cp949 ks_c_5601-1987 is an alias
- thereof.
+ cp949 (*)
iso-2022-kr [RFC1557]
johab [KS X 1001:1998, Annex 3]
- ksc5601-raw KSC5601 as is
+ ksc5601-raw { KSC5601 without CES }
----------------------------------------------------------------
+ (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to
+ this. See below.
+
+
=item Encode::TW -- Taiwan
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
big5 cp950 MacChineseTrad
big5-hkscs
Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.
+ Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
gb18030
euc-tw
dangerous and should be differenciated when needed, we need to
differenciate I<encoding> and I<character set>.
-To understand that, it's follow how we make computers grok our character.
+To understand that, it's follow how we make computers grok our characters.
=over 4
=item *
-To (en|de) code Encodings marked as C<(*)>, You need
+To (en|de) code Encodings marked as C<(**)>, You need
C<Encode::HanExtra>, available from CPAN.
=back
Encoding names
- US-ASCII UTF-8 ISO-8859-* KOI8-R
- Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
- EUC-KR Big5 GB2312
+ US-ASCII UTF-8 ISO-8859-* KOI8-R
+ Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
+ EUC-KR Big5 GB2312
are registered to IANA as preferred MIME names and may probably
be used over the Internet.
-C<Shift_JIS> has been officialized by JIS X 0208-1997.
+C<Shift_JIS> has been officialized by JIS X 0208:1997.
L<Microsoft-related naming mess> gives details.
C<GB2312> is the IANA name for C<EUC-CN>.
See L<Microsoft-related naming mess> for details.
C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
-with Encode. See L<Encode::CN -- Continental China> for details.
+with Encode. See L<Encode::CN> for details.
EUC-CN
- KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
+ KOI8-U [RFC2319]
have not been registered with IANA (as of March 2002) but
seem to be supported by major web browsers.
See L<Microsoft-related naming mess> for details.
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
-with Encode. See L<Encode::KR -- Korea> for details.
+with Encode. See L<Encode::KR> for details.
+
+ UTF-16 UTF-16BE UTF-16LE
+
+are a IANA-registered C<charset>s. See [RFC 2781] for details.
+Jungshik Shin reports that UTF-16 with a BOM is well accepted
+by MS IE 5/6 and NS 4/6. Beware however that
+
+=over 2
+
+=item *
+
+C<UTF-16> support in any software you're going to be
+using/interoperating with has probably been less tested
+then C<UTF-8> support
+
+=item *
+
+C<UTF-8> coded data seamlessly passes traditional
+command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
+data is likely to cause confusion (with it's zero bytes,
+for example)
+
+=item *
+
+it is beyond the power of words to describe the way HTML browsers
+encode non-C<ASCII> form data. To get a general impression visit
+L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
+While encoding of form data has stabilized for C<UTF-8> coded pages
+(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to
+expect fun (and cross-browser discrepancies) with C<UTF-16> coded
+pages!
- UTF-16
+=back
-=for comment
-waiting for comments from Jungshik Shin to soften this - Anton
+The rule of thumb is to use C<UTF-8> unless you know what
+you're doing and unless you really benefit from using C<UTF-16>.
-is a IANA-registered preferred MIME name
-but probably should be avoided as encoding for web pages due to
-the lack of browser support.
- ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
+ ISO-IR-165 [RFC1345]
GBK
VISCII
GB 12345
- GB 18030 (*) (see links bellow)
- EUC-TW (*)
+ GB 18030 (**) (see links bellow)
+ EUC-TW (**)
are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.
- BIG5PLUS (*)
+ BIG5PLUS (**)
is a bit proprietary name.
Microsoft extension to C<EUC-KR>.
-Proper name: C<CP949>.
+Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
-See
-http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
+See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
for details.
-Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect
-this common misusage.
-I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>.
+Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
+misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
+C<kcs5601-raw>.
-See L<Encode::KR -- Korea> for details.
+See L<Encode::KR> for details.
=item GB2312
Encode aliases C<GB2312> to C<euc-cn> in full agreement with
IANA registration. C<cp936> is supported separately.
-I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>.
+I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
-See L<Encode::CN -- Continental China> for details.
+See L<Encode::CN> for details.
=item Big5
JIS has not endorsed the full Microsoft standard however.
The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
subsets, while Microsoft has always been meaning C<Shift_JIS> to
-encode a wider character repertoire.
+encode a wider character repertoire. See C<IANA> registration for
+C<Windows-31J>.
As a historical predecessor Microsoft's variant
probably has more rights for the name, albeit it may be objected
that Microsoft shouldn't have used JIS as part of the name
in the first place.
-Unabiguous name: C<CP932>.
+Unambiguous name: C<CP932>. C<IANA> name (not used?): C<Windows-31J>.
Encode separately supports C<Shift_JIS> and C<cp932>.
belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
example of being both a CCS and CES.
+=item charset (in MIME context)
+
+has long been used in the meaning of C<encoding>, CES.
+
+While C<character set> word combination has lost this meaning
+in MIME context since [RFC 2130], C<charset> abbreviation has
+retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>:
+
+
+ This document uses the term "charset" to mean a set of rules for
+ mapping from a sequence of octets to a sequence of characters, such
+ as the combination of a coded character set and a character encoding
+ scheme; this is also what is used as an identifier in MIME "charset="
+ parameters, and registered in the IANA charset registry ... (Note
+ that this is NOT a term used by other standards bodies, such as ISO).
+ [RFC 2277]
+
=item EUC
Extended Unix Character. See ISO-2022
=item ISO-2022
A CES that was carefully designed to coexist with ASCII. There are 7
-bit version and 8 bit version. 8 bit version can conform a CCS. EUC
-and ISO-8859 are two examples thereof.
+bit version and 8 bit version.
+
+7 bit version switches character set via escape sequence so this
+cannot form a CCS. Since this is more difficult to handle in programs
+than the 8 bit version, 7 bit version is not very popular except for
+iso-2022-jp, the de facto standard CES for e-mails.
+
+8 bit version can conform a CCS. EUC and ISO-8859 are two examples
+thereof. pre-5.6 perl could use them as string literals.
=item UCS
=item Unicode
-A Character Set that aims to include all character character
-repertoire of the world. Many character sets in various national as
-well as industorial standards are therefore a subset thereof.
+A Character Set that aims to include all character repertoire of the
+world. Many character sets in various national as well as industorial
+standards have become, in a way, just subsets of Unicode.
=item UTF
-Short for I<Unicode Transformation Format>. Determinse how to map a
+Short for I<Unicode Transformation Format>. Determines how to map a
unicode character into byte sequnece.
=item UTF-16
A UTF in 16-bit encoding. Can either be in big endian or little
-endian. Big endian version is called UTF-16BE and little endian
-version is UTF-16LE.
+endian. Big endian version is called UTF-16BE (equals to UCS-2 +
+Surrogate Support) and little endian version is UTF-16LE.
=back
=item RFC
Request For Comment -- need I say more?
-L<http://www.rfc.net/>
+L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/>
=item UC
=item czyborra.com
-<http://czyborra.com/>
+L<http://czyborra.com/>
Contains a a lot of useful information, especially gory details of ISO
vs. vendor mappings.
You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
+=item Jungshik Shin's Hangul FAQ
+
+L<http://jshin.net/faq>
+
+And especially it's subject 8
+
+L<http://jshin.net/faq/qa8.html>
+
+a comprehensive overview of the Korean (C<KS *>) standards.
+
+=back
+
+=head2 Offline sources
+
+=over 2
+
+=item C<CJKV Information Processing> by Ken Lunde
+
+CJKV Information Processing
+1999 O'Reilly & Associates, ISBN : 1-56592-224-7
+
+The modern successor of the C<CJK.inf>.
+
+Features a comprehensive coverage on CJKV character sets and
+encodings along with many other issues faced by anyone trying
+to better support CJKV languages/scripts in all the areas of
+information processing.
+
+To purchase this book visit
+L<http://www.oreilly.com/catalog/cjkvinfo/>
+
=back
=cut