X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=ext%2FEncode%2Flib%2FEncode%2FSupported.pod;h=a0beca319315591d139bc432a057692b03a8f447;hb=f2a2953c25503948c9a5e44b5ee7fe84a7da6b46;hp=1dc4df4a50a797b35fa6f964bbad0c59d24d610c;hpb=735b7a62d039909fa334af8e05d4788f54c2c65a;p=p5sagit%2Fp5-mst-13.2.git diff --git a/ext/Encode/lib/Encode/Supported.pod b/ext/Encode/lib/Encode/Supported.pod index 1dc4df4..a0beca3 100644 --- a/ext/Encode/lib/Encode/Supported.pod +++ b/ext/Encode/lib/Encode/Supported.pod @@ -63,10 +63,19 @@ The following encodings are always available. ascii US-ascii [ECMA] iso-8859-1 latin1 [ISO] utf8 UTF-8 [RFC2279] - UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC] - UTF-16LE UCS-2LE [UC] + UCS-2BE UCS-2, iso-10646-1 [IANA, UC] + UCS-2LE [UC] + UTF-16 [UC] + UTF-16BE [UC] + UTF-16LE [UC] + UTF-32 [UC] + UTF-32BE [UC] + UTF-32LE [UC] ---------------------------------------------------------------- +To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another, +see L. + =head2 Encode::Byte -- Extended ASCII Encode::Byte implements most of single-byte encodings except for @@ -146,8 +155,9 @@ details. GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII, control character ranges and other parts are mapped very differently, -presumablly to store Cyrillics. This one is also covered in -Encode::Byte even thought this one does not comply extended ASCII. +presumablly to store Greek and Cyrillic alphabets. This one is also +covered in Encode::Byte even thought this one does not comply extended +ASCII. =back @@ -162,41 +172,52 @@ respective document pages. =item Encode::CN -- Continental China - Standard DOS/Win Macintosh Comment + Standard DOS/Win Macintosh Comment/Reference ---------------------------------------------------------------- - euc-cn MacChineseSimp GB2312 is aliased to this - (gbk) cp936 GBK is aliased to to this - gb12345-raw GB12345 as is - gb2312-raw GB2312 as is + euc-cn(*1) MacChineseSimp + (gbk) cp936 (*2) + gb12345-raw { GB12345 without CES } + gb2312-raw { GB2312 without CES } hz iso-ir-165 ---------------------------------------------------------------- + (*1) GB2312 is aliased to this. see L + (*2) gbk is aliased to this. see L + =item Encode::JP -- Japan - Standard DOS/Win Macintosh Comment/Reference + Standard DOS/Win Macintosh Comment/Reference ---------------------------------------------------------------- euc-jp shiftjis cp932 macJapanese - 7bit-jis jis - euc-jp ujis - iso-2022-jp [RFC1468] - iso-2022-jp-1 [RFC2237] + 7bit-jis + euc-jp + iso-2022-jp [RFC1468] + iso-2022-jp-1 [RFC2237] + jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } + jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } + jis0212-raw { JIS X 0212 (Extended Kanji) without CES } ---------------------------------------------------------------- =item Encode::KR -- Korea + Standard DOS/Win Macintosh Comment/Reference ---------------------------------------------------------------- euc-kr MacKorean [RFC1557] - cp949 ks_c_5601-1987 is an alias - thereof. + cp949 (*) iso-2022-kr [RFC1557] johab [KS X 1001:1998, Annex 3] - ksc5601-raw KSC5601 as is + ksc5601-raw { KSC5601 without CES } ---------------------------------------------------------------- + (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to + this. See below. + + =item Encode::TW -- Taiwan + Standard DOS/Win Macintosh Comment/Reference ---------------------------------------------------------------- big5 cp950 MacChineseTrad big5-hkscs @@ -207,6 +228,7 @@ respective document pages. Due to size concerns, additional Chinese encodings below are distributed separately on CPAN, under the name Encode::HanExtra. + Standard DOS/Win Macintosh Comment/Reference ---------------------------------------------------------------- gb18030 euc-tw @@ -336,7 +358,7 @@ interchangeably. But just as using the term byte and character is dangerous and should be differenciated when needed, we need to differenciate I and I. -To understand that, it's follow how we make computers grok our character. +To understand that, it's follow how we make computers grok our characters. =over 4 @@ -418,16 +440,16 @@ such communication. =item * -To (en|de) code Encodings marked as C<(*)>, You need +To (en|de) code Encodings marked as C<(**)>, You need C, available from CPAN. =back Encoding names - US-ASCII UTF-8 ISO-8859-* KOI8-R - Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 - EUC-KR Big5 GB2312 + US-ASCII UTF-8 ISO-8859-* KOI8-R + Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 + EUC-KR Big5 GB2312 are registered to IANA as preferred MIME names and may probably be used over the Internet. @@ -439,10 +461,10 @@ C is the IANA name for C. See L for details. C I encoding is available as C -with Encode. See L for details. +with Encode. See L for details. EUC-CN - KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) + KOI8-U [RFC2319] have not been registered with IANA (as of March 2002) but seem to be supported by major web browsers. @@ -454,30 +476,58 @@ is heavily misused. See L for details. C I encoding is available as C -with Encode. See L for details. +with Encode. See L for details. + + UTF-16 UTF-16BE UTF-16LE + +are a IANA-registered Cs. See [RFC 2781] for details. +Jungshik Shin reports that UTF-16 with a BOM is well accepted +by MS IE 5/6 and NS 4/6. Beware however that + +=over 2 + +=item * - UTF-16 +C support in any software you're going to be +using/interoperating with has probably been less tested +then C support -=for comment -waiting for comments from Jungshik Shin to soften this - Anton +=item * + +data coded with C seamlessly passes traditional +command piping (C, C, etc.) while UTF-16 coded +data is likely to cause confusion (with it's zero bytes, +for example) + +=item * + +it is beyond the power of words to describe the way HTML browsers +encode non-C form data. To get a general impression refer to +L. +While encoding of form data has stabilzed for C coded pages +(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to +expect fun (and cross-browser discrepancies) with C coded +pages! + +=back + +The rule of thumb is to use C unless you know what +you're doing and unless you really need from using C. -is a IANA-registered preferred MIME name -but probably should be avoided as encoding for web pages due to -the lack of browser support. - ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) + ISO-IR-165 [RFC1345] GBK VISCII GB 12345 - GB 18030 (*) (see links bellow) - EUC-TW (*) + GB 18030 (**) (see links bellow) + EUC-TW (**) are totally valid encodings but not registered at IANA. The names under which they are listed here are probably the most widely-known names for these encodings and are recommended names. - BIG5PLUS (*) + BIG5PLUS (**) is a bit proprietary name. @@ -493,15 +543,14 @@ Microsoft extension to C. Proper name: C. -See -http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html +See L for details. -Encode aliases C to C to reflect -this common misusage. -I C encoding is available as C. +Encode aliases C to C to reflect this common +misusage. I C encoding is available as +C. -See L for details. +See L for details. =item GB2312 @@ -515,9 +564,9 @@ C has become a superset of the official C. Encode aliases C to C in full agreement with IANA registration. C is supported separately. -I C encoding is available as C. +I C encoding is available as C. -See L for details. +See L for details. =item Big5 @@ -568,6 +617,23 @@ have to be able to tell which character set a given byte sequence belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an example of being both a CCS and CES. +=item charset (in MIME context) + +has long been used in the meaning of C, CES. + +While C word combination has lost this meaning +in MIME context since [RFC 2130], C abbreviation has +retained it. This is how [RFC 2277], [RFC 2278] bless C: + + + This document uses the term "charset" to mean a set of rules for + mapping from a sequence of octets to a sequence of characters, such + as the combination of a coded character set and a character encoding + scheme; this is also what is used as an identifier in MIME "charset=" + parameters, and registered in the IANA charset registry ... (Note + that this is NOT a term used by other standards bodies, such as ISO). + [RFC 2277] + =item EUC Extended Unix Character. See ISO-2022 @@ -575,8 +641,15 @@ Extended Unix Character. See ISO-2022 =item ISO-2022 A CES that was carefully designed to coexist with ASCII. There are 7 -bit version and 8 bit version. 8 bit version can conform a CCS. EUC -and ISO-8859 are two examples thereof. +bit version and 8 bit version. + +7 bit version switches character set via escape sequence so this +cannot form a CCS. Since this is more difficult to handle in programs +than the 8 bit version, 7 bit version is not very popular except for +iso-2022-jp, the de facto standard CES for e-mails. + +8 bit version can conform a CCS. EUC and ISO-8859 are two examples +thereof. pre-5.6 perl could use them as string literals. =item UCS @@ -590,20 +663,20 @@ octets. =item Unicode -A Character Set that aims to include all character character -repertoire of the world. Many character sets in various national as -well as industorial standards are therefore a subset thereof. +A Character Set that aims to include all character repertoire of the +world. Many character sets in various national as well as industorial +standards have become, in a way, just subsets of Unicode. =item UTF -Short for I. Determinse how to map a +Short for I. Determines how to map a unicode character into byte sequnece. =item UTF-16 A UTF in 16-bit encoding. Can either be in big endian or little -endian. Big endian version is called UTF-16BE and little endian -version is UTF-16LE. +endian. Big endian version is called UTF-16BE (equals to UCS-2 + +Surrogate Support) and little endian version is UTF-16LE. =back @@ -658,7 +731,7 @@ L =item RFC Request For Comment -- need I say more? -L +L, L =item UC @@ -683,7 +756,7 @@ The glossary of this document is based opon this site. =item czyborra.com - +L Contains a a lot of useful information, especially gory details of ISO vs. vendor mappings. @@ -698,6 +771,37 @@ L You will find brief info on C, C and mostly on C +=item Jungshik Shin's Hangul FAQ + +L + +And especially it's subject 8 + +L + +a comprehensive overview of the Korean (C) standards. + +=back + +=head2 Offline sources + +=over 2 + +=item C by Ken Lunde + +CJKV Information Processing +1999 O'Reilly & Associates, ISBN : 1-56592-224-7 + +The modern successor of the C. + +Features a comprehensive coverage on CJKV character sets and +encodings along with many other issues faced by anyone trying +to better support CJKV languages/scripts in all the areas of +information processing. + +To purchase this book visit +L + =back =cut