3 Encode::Supports -- Supported encodings by Encode
9 Encoding names are case insensitive. White space in names
10 is ignored. In addition an encoding may have aliases.
11 Each encoding has one "canonical" name. The "canonical"
12 name is chosen from the names of the encoding by picking
13 he first in the following sequence:
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
19 Because of all the alias issues, and because in the general case
20 encodings have state, "Encode" uses the encoding object internally
21 once an operation is in progress.
23 =head1 Supported Encodings
25 As of Perl 5.8.0, at least the following encodings are recognized.
26 Note that unless otherwise specified, they are all case insensitive
27 (via alias) and all occurrance of spaces are replaced with '-'. In
28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
30 Encodings are categorized and implemented in several different modules
31 but you don't have to C<use Encode::XX> to make them available for
32 most cases. Encode.pm will automatically load those modules in need.
34 =head2 Built-in Encodings
36 The following encodings are always available.
39 -----------------------
42 UCS-2 ucs2, iso-10646-1
45 -----------------------
49 The following encodings are based single-byte encoding implemented as
50 extended ASCII. For most cases it uses \x80-\xff (upper half) to map
53 -----------------------
54 (iso-8859-1 is in built-in)
65 (iso-8859-12 is nonexistent)
75 viscii # ASCII + vietnamese
86 # all cp* are also available as ibm-* and ms-*
101 gsm0338 # used in GSM handsets
102 roman8 # what is this?
103 -----------------------
105 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
107 Note Vietnamese is listed above. Also read "Encoding vs Charset"
108 below. Also note these are implemented in distinct module by
109 languages, due the the size concerns. Please also refer to their
110 respective document pages.
114 =item Encode::CN -- Continental China
116 -----------------------
123 -----------------------
125 =item Encode::JP -- Japan
127 -----------------------
134 shiftjis Shift_JIS, sjis
135 -----------------------
137 =item Encode::KR -- Korea
139 -----------------------
143 -----------------------
145 =item Encode::TW -- Taiwan
147 -----------------------
151 -----------------------
153 =item Encode::HanExtra -- More Chinese via CPAN
155 Due to size concerns, additional Chinese encodings below are
156 distributed separately on CPAN, under the name Encode::HanExtra.
158 -----------------------
162 -----------------------
166 =head2 Miscellaneous encodings
172 See perlebcdic for details.
174 -----------------------
178 -----------------------
180 =item Encode::Symbols
182 For symbols and dingbats.
184 -----------------------
188 -----------------------
192 =head1 Encoding vs. Charset
194 Character encoding (or just "encoding") and Character Set (or just
195 "charset") are often used interchangeably but they are different
198 Charset determines which characters to be included in a given text.
200 Encoding actually maps charset(s) to stream of bits.
202 Note a given encoding may contain multiple charsets and complex CJK
203 encodings are usually implemented that way.
205 For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
206 JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
207 Kanji) in a single encoding.
209 As the name suggests, the Encode module supports encodings, not
212 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
214 This section tries to classify the supported encodings by their
215 applicability for information exchange over the Internet and to
216 choose the most suitable aliases to name them in the context of
223 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
227 are L<http://www.iana.org/assignments/character-sets>-registered as
228 preferred MIME names and may probably be used over the Internet.
230 C<Shift_JIS> is no longer Microsft proprietary since it has been
231 officialized by JIS X 0208-1997. It is probably the most wide
232 spread encoding for Japanese on the Internet.
236 has not been registered with IANA (as of march 2002) but
237 seems to be supported by major web browsers. (IANA has registered
238 this encoding as C<GB2312>, but C<gb2312> currently has a different
239 meaning to the C<Encode> module. It will probably become alias to
240 C<EUC-CN> in the future; until then it is safer to avoid using
241 C<gb2312> as encoding name within Perl).
244 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
246 are IANA-registered (C<UTF-16> even as a preferred MIME name)
247 but probably should be avoided as encoding for web pages due to
248 lack of browser support.
250 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
254 GB 18030 (*) (see links bellow)
257 are totally valid encodings but not registered at IANA.
258 The names under which they are listed here are probably the
259 most widely-known names for these encodings and are recommended
263 =for comment this used to be listed as supported but
265 do not work @15457 when it's clear they will be uncommented
267 ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
268 CNS 11643 (only plains 1 and 2 available)
272 is a bit proprietary name. C<(*)>-marked encodings belong to
273 C<Encode::HanExtra> available from CPAN.
275 You may probably get some info on CJK encodings at
277 brief description for most of the mentioned CJK encodings
278 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
280 several years old, but still useful
281 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
283 and some in-depth reading for the heroes :-)
284 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
286 gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
287 F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
289 The nature of information in this section is most fragile and
290 error-prone; I<probably> is the most popular adverb :)
291 Please feel free to send your comments, disagreements and
292 additions to L<...>. (Note however,
293 that the mission of this document is to cover the
294 C<Encode>-supported encodings only.
300 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
301 L<Encode::EBCDIC>, L<Encode::Symbol>