3 Encode::Supports -- Supported encodings by Encode
9 Encoding names are case insensitive. White space in names
10 is ignored. In addition an encoding may have aliases.
11 Each encoding has one "canonical" name. The "canonical"
12 name is chosen from the names of the encoding by picking
13 he first in the following sequence:
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
19 Because of all the alias issues, and because in the general case
20 encodings have state, "Encode" uses the encoding object internally
21 once an operation is in progress.
23 =head1 Supported Encodings
25 As of Perl 5.8.0, at least the following encodings are recognized.
26 Note that unless otherwise specified, they are all case insensitive
27 (via alias) and all occurrance of spaces are replaced with '-'. In
28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
30 Encodings are categorized and implemented in several different modules
31 but you don't have to C<use Encode::XX> to make them available for
32 most cases. Encode.pm will automatically load those modules in need.
34 =head2 Built-in Encodings
36 The following encodings are always available.
38 Canonical Aliases Comments & References
39 ----------------------------------------------------------------
40 iso-8859-1 latin1 [ISO]
42 UCS-2 ucs2, iso-10646-1 [IANA, et al]
45 ----------------------------------------------------------------
49 The following encodings are based single-byte encoding implemented as
50 extended ASCII. For most cases it uses \x80-\xff (upper half) to map
53 ----------------------------------------------------------------
55 (iso-8859-1 is in built-in)
56 iso-8859-2 latin2 [ISO]
57 iso-8859-3 latin3 [ISO]
58 iso-8859-4 latin4 [ISO]
63 iso-8859-9 latin5 [ISO]
64 iso-8859-10 latin6 [ISO]
66 (iso-8859-12 is nonexistent)
67 iso-8859-13 latin7 [ISO]
68 iso-8859-14 latin8 [ISO]
69 iso-8859-15 latin9 [ISO]
70 iso-8859-16 latin10 [ISO]
80 # all cp* are also available as ibm-*, ms-*, and windows-*
81 # also see L<http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp>
93 # Also see L<http://developer.apple.com/technotes/tn/tn1150.html>
106 # More vendor encodings
108 gsm0338 # used in GSM handsets
110 ----------------------------------------------------------------
112 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
114 Note Vietnamese is listed above. Also read "Encoding vs Charset"
115 below. Also note these are implemented in distinct module by
116 languages, due the the size concerns. Please also refer to their
117 respective document pages.
121 =item Encode::CN -- Continental China
123 ----------------------------------------------------------------
130 ----------------------------------------------------------------
132 =item Encode::JP -- Japan
134 ----------------------------------------------------------------
138 iso-2022-jp [RFC1468]
139 iso-2022-jp-1 [RFC2237]
141 shiftjis Shift_JIS, sjis
142 ----------------------------------------------------------------
144 =item Encode::KR -- Korea
146 ----------------------------------------------------------------
148 cp949 ks_c_5601-1987 x-windows-949 uhc
149 iso-2022-kr [RFC1557]
152 ----------------------------------------------------------------
154 =item Encode::TW -- Taiwan
156 ----------------------------------------------------------------
160 ----------------------------------------------------------------
162 =item Encode::HanExtra -- More Chinese via CPAN
164 Due to size concerns, additional Chinese encodings below are
165 distributed separately on CPAN, under the name Encode::HanExtra.
167 ----------------------------------------------------------------
171 ----------------------------------------------------------------
175 =head2 Miscellaneous encodings
181 See perlebcdic for details.
183 ----------------------------------------------------------------
187 ----------------------------------------------------------------
189 =item Encode::Symbols
191 For symbols and dingbats.
193 ----------------------------------------------------------------
197 ----------------------------------------------------------------
201 =head1 Unsupported encodings
203 The following are not supported as yet. Some because they are rarely
204 usede, some because of technical difficulty. They may be supported by
205 external modules via CPAN in future, however.
209 =item ISO-2022-JP-2 [RFC1554]
211 Not very popular yet. Needs Unicode Database or equivalent to
212 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
213 GB2312 sumulteniously, which code points in unicode overlap. So you
214 need to lookup the database to determine what character set a given
215 Unicode character should belong).
217 =item ISO-2022-CN [RFC1922]
219 Not very popular. Needs CNS 11643-1 and 2 which are not available in
220 this module. CNS 11643 is supported (via euc-tw) in
221 Encode::HanExtra. Autrijus may add support for this encoding in his
224 =item various UP-UX encodings
226 The following are unsoported due to the lack of mapping data.
228 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
229 '15' - japanese15, korean15, and roi15
231 =item Cyrillic encoding ISO-IR-111
233 Anton doubts its usefulness.
235 =item ISO-8859-8-1 [Hebrew]
237 None of the Encode team knows Hebrew enough. Contribution welcome.
239 =item Thai encoding TCVN
243 =item Vietnamese encodings VPS
247 =item various Mac encodings
249 The following are unsoported due to the lack of mapping data. "Mac"
250 that prepends the encoding names are omitted.
252 Arabic, Armenian, Bengali, Burmese
253 ChineseSimp, ChineseTrad, Devanagari, Ethiopic, ExtArabic
254 Farsi, Georgian, Gujarati, Gurmukhi, Hebrew
255 Kannada, Khmer, Korean, Laotian, Malayalam, Mongolian
256 Oriya Sinhalese Symbol Tamil Telugu Tibetan Vietnamese
258 The rest of which already available are based upon the vendor mapping
259 available at L<http://www.unicode.org/>
263 =head1 Encoding vs. Charset
265 Character encoding (or just "encoding") and Character Set (or just
266 "charset") are often used interchangeably but they are different
271 =item Character I<Set> (I<charset> for short)
273 Is a collection of characters in which each character is distinguished
274 with unique ID (in most cases, ID is number).
276 =item Character I<Encoding>
278 Is a way to represent character set(s) in a stream of bits.
282 A character encoding may contain a single character set
283 (i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
284 US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).
286 A character encoding may also encode character set as-is (also called
287 a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
288 as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
289 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
291 As the name suggests, the Encode module supports encodings, not
294 However, the word I<charset> is casually used even in Internet
295 Assigned Number Authority to actually mean I<encoding>. Encode tries
296 to soothe this misconception via aliases. For instance,
297 C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
298 available as C<gb2312-raw>.
300 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
302 This section tries to classify the supported encodings by their
303 applicability for information exchange over the Internet and to
304 choose the most suitable aliases to name them in the context of
311 To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
312 ,available from CPAN.
318 US-ASCII UTF-8 ISO-8859-* KOI8-R
319 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
322 are registered to IANA as preferred MIME names and may probably be used over the Internet.
324 C<Shift_JIS> is no longer Microsft proprietary since it has been
325 officialized by JIS X 0208-1997.
329 has not been registered with IANA (as of march 2002) but
330 seems to be supported by major web browsers. In Encode, GB2312
331 is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
332 as gb2312-raw. See L<Encode::CN> for details.
336 has been registered to IANA but when they are used, they are
337 EUC-coded. Internet community in Korea is not happy with this.
338 so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
339 of C<euc-kr>, with ksc5601-raw for "uncooked".
342 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
344 are IANA-registered (C<UTF-16> even as a preferred MIME name)
345 but probably should be avoided as encoding for web pages due to
346 the lack of browser supports.
348 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
352 GB 18030 (*) (see links bellow)
355 are totally valid encodings but not registered at IANA.
356 The names under which they are listed here are probably the
357 most widely-known names for these encodings and are recommended
362 is a bit proprietary name.
368 =item Assigned Charset Names by IANA
370 L<http://www.iana.org/assignments/character-sets>
372 Most of the C<canonical names> in Encode derive from this list
373 so you can directly apply the string you have extracted from MIME
374 header of mails and we pages.
378 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
380 Somewhat obsolete (last update in 1996), but still useful. Also try
382 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
384 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
386 =item EMCA-035 (eq C<ISO-2022>)
388 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
390 The very dspecification of ISO-2022 is available from the link above.
398 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
399 L<Encode::EBCDIC>, L<Encode::Symbol>
403 I could not find this page because the hostname doesn't resolve!
405 Brief description for most of the mentioned CJK encodings
406 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>