3 Encode::Supported -- Supported encodings by Encode
9 Encoding names are case insensitive. White space in names
10 is ignored. In addition an encoding may have aliases.
11 Each encoding has one "canonical" name. The "canonical"
12 name is chosen from the names of the encoding by picking
13 he first in the following sequence:
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
19 Because of all the alias issues, and because in the general case
20 encodings have state, "Encode" uses the encoding object internally
21 once an operation is in progress.
23 =head1 Supported Encodings
25 As of Perl 5.8.0, at least the following encodings are recognized.
26 Note that unless otherwise specified, they are all case insensitive
27 (via alias) and all occurrance of spaces are replaced with '-'. In
28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
30 Encodings are categorized and implemented in several different modules
31 but you don't have to C<use Encode::XX> to make them available for
32 most cases. Encode.pm will automatically load those modules in need.
34 =head2 Built-in Encodings
36 The following encodings are always available.
39 -----------------------
42 UCS-2 ucs2, iso-10646-1
45 -----------------------
49 The following encodings are based single-byte encoding implemented as
50 extended ASCII. For most cases it uses \x80-\xff (upper half) to map
53 -----------------------
54 (iso-8859-1 is in built-in)
65 (iso-8859-12 is nonexistent)
75 viscii # ASCII + vietnamese
86 # all cp* are also available as ibm-* and ms-*
100 -----------------------
102 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
104 Note Vietnamese is listed above. Also read "Encoding vs Charset"
105 below. Also note these are implemented in distinct module by
106 languages, due the the size concerns. Please also refer to their
107 respective document pages.
111 =item Encode::CN -- Continental China
113 -----------------------
120 -----------------------
122 =item Encode::JP -- Japan
124 -----------------------
131 shiftjis Shift_JIS, sjis
132 -----------------------
134 =item Encode::KR -- Korea
136 -----------------------
140 -----------------------
142 =item Encode::TW -- Taiwan
144 -----------------------
148 -----------------------
150 =item Encode::HanExtra -- More Chinese via CPAN
152 Due to size concerns, additional Chinese encodings below are
153 distributed separately on CPAN, under the name Encode::HanExtra.
155 -----------------------
159 -----------------------
163 =head2 Miscellaneous encodings
169 See perlebcdic for details.
171 -----------------------
175 -----------------------
177 =item Encode::Symbols
179 For symbols and dingbats.
181 -----------------------
184 -----------------------
188 =head1 Encoding vs. Charset
190 Character encoding (or just "encoding") and Character Set (or just
191 "charset") are often used interchangeably but they are different
194 Charset determines which characters to be included in a given text.
196 Encoding actually maps charset(s) to stream of bits.
198 Note a given encoding may contain multiple charsets and complex CJK
199 encodings are usually implemented that way.
201 For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
202 JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
203 Kanji) in a single encoding.
205 As the name suggests, the Encode module supports encodings, not
208 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
210 This section tries to classify the supported encodings by their
211 applicability for information exchange over the Internet and to
212 choose the most suitable aliases to name them in the context of
219 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
223 are L<http://www.iana.org/assignments/character-sets>-registered as
224 preferred MIME names and may probably be used over the Internet.
226 C<Shift_JIS> is no longer Microsft proprietary since it has been
227 officialized by JIS X 0208-1997. It is probably the most wide
228 spread encoding for Japanese on the Internet.
232 has not been registered with IANA (as of march 2002) but
233 seems to be supported by major web browsers. (IANA has registered
234 this encoding as C<GB2312>, but C<gb2312> currently has a different
235 meaning to the C<Encode> module. It will probably become alias to
236 C<EUC-CN> in the future; until then it is safer to avoid using
237 C<gb2312> as encoding name within Perl).
240 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
242 are IANA-registered (C<UTF-16> even as a preferred MIME name)
243 but probably should be avoided as encoding for web pages due to
244 lack of browser support.
246 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
250 GB 18030 (*) (see links bellow)
253 are totally valid encodings but not registered at IANA.
254 The names under which they are listed here are probably the
255 most widely-known names for these encodings and are recommended
259 =for comment this used to be listed as supported but
261 do not work @15457 when it's clear they will be uncommented
263 ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
264 CNS 11643 (only plains 1 and 2 available)
268 is a bit proprietary name. C<(*)>-marked encodings belong to
269 C<Encode::HanExtra> available from CPAN.
271 You may probably get some info on CJK encodings at
273 brief description for most of the mentioned CJK encodings
274 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
276 several years old, but still useful
277 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
279 and some in-depth reading for the heroes :-)
280 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
282 gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
283 F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
285 The nature of information in this section is most fragile and
286 error-prone; I<probably> is the most popular adverb :)
287 Please feel free to send your comments, disagreements and
288 additions to L<...>. (Note however,
289 that the mission of this document is to cover the
290 C<Encode>-supported encodings only.
296 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
297 L<Encode::EBCDIC>, L<Encode::Symbol>