Upgrade to Encode 1.11, from Dan Kogai.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
64ffdd5e 3Encode::Supports -- Supported encodings by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
13he first in the following sequence:
14
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
18
5129552c 19Because of all the alias issues, and because in the general case
20encodings have state, "Encode" uses the encoding object internally
21once an operation is in progress.
5d030b67 22
5129552c 23=head1 Supported Encodings
5d030b67 24
25As of Perl 5.8.0, at least the following encodings are recognized.
26Note that unless otherwise specified, they are all case insensitive
a63c962f 27(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67 28other words, "ISO 8859 1" and "iso-8859-1" are identical.
29
5129552c 30Encodings are categorized and implemented in several different modules
31but you don't have to C<use Encode::XX> to make them available for
32most cases. Encode.pm will automatically load those modules in need.
5d030b67 33
5129552c 34=head2 Built-in Encodings
5d030b67 35
5129552c 36The following encodings are always available.
5d030b67 37
67d7b5ef 38 Canonical Aliases Comments & References
39 ----------------------------------------------------------------
40 iso-8859-1 latin1 [ISO]
41 US-ascii ascii [ECMA]
42 UCS-2 ucs2, iso-10646-1 [IANA, et al]
43 UCS-2l
44 UTF-8 utf8 [RFC2279]
45 ----------------------------------------------------------------
5d030b67 46
5129552c 47=head2 Encode::Byte
5d030b67 48
5129552c 49The following encodings are based single-byte encoding implemented as
50extended ASCII. For most cases it uses \x80-\xff (upper half) to map
51non-ASCII characters.
5d030b67 52
67d7b5ef 53 ----------------------------------------------------------------
54 # ISO 8859 series
a63c962f 55 (iso-8859-1 is in built-in)
67d7b5ef 56 iso-8859-2 latin2 [ISO]
57 iso-8859-3 latin3 [ISO]
58 iso-8859-4 latin4 [ISO]
59 iso-8859-5 [ISO]
60 iso-8859-6 [ISO]
61 iso-8859-7 [ISO]
62 iso-8859-8 [ISO]
63 iso-8859-9 latin5 [ISO]
64 iso-8859-10 latin6 [ISO]
5d030b67 65 iso-8859-11
66 (iso-8859-12 is nonexistent)
67d7b5ef 67 iso-8859-13 latin7 [ISO]
68 iso-8859-14 latin8 [ISO]
69 iso-8859-15 latin9 [ISO]
70 iso-8859-16 latin10 [ISO]
71
72 # Cyrillic
73 koi8-f
74 koi8-r [RFC1489]
75 koi8-u [RFC2319]
76
77 # Vietnamese
78 viscii
79
80 # all cp* are also available as ibm-*, ms-*, and windows-*
81 # also see L<http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp>
5d030b67 82 cp1250 WinLatin2
83 cp1251 WinCyrillic
84 cp1252 WinLatin1
85 cp1253 WinGreek
86 cp1254 WinTurkiskh
87 cp1255 WinHebrew
88 cp1256 WinArabic
89 cp1257 WinBaltic
90 cp1258 WinVietnamese
64ffdd5e 91
67d7b5ef 92 # Macintosh
93 # Also see L<http://developer.apple.com/technotes/tn/tn1150.html>
94 MacCentralEurRoman
95 MacCroatian
96 MacRoman
97 MacCyrillic
98 MacRomanian
99 MacSami
3ef515df 100 MacGreek
67d7b5ef 101 MacThai
3ef515df 102 MacIceland
67d7b5ef 103 MacTurkish
104 MacUkrainian
105
106 # More vendor encodings
64ffdd5e 107 nextstep
108 gsm0338 # used in GSM handsets
67d7b5ef 109 hp-roman8
110 ----------------------------------------------------------------
5d030b67 111
5129552c 112=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 113
114Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 115below. Also note these are implemented in distinct module by
116languages, due the the size concerns. Please also refer to their
117respective document pages.
5d030b67 118
5129552c 119=over 4
120
121=item Encode::CN -- Continental China
122
67d7b5ef 123 ----------------------------------------------------------------
5129552c 124 cp936 gbk
67d7b5ef 125 euc-cn gb2312
126 gb12345-raw
127 gb2312-raw
5129552c 128 hz
129 iso-ir-165
67d7b5ef 130 ----------------------------------------------------------------
5129552c 131
132=item Encode::JP -- Japan
133
67d7b5ef 134 ----------------------------------------------------------------
5129552c 135 7bit-jis jis
67d7b5ef 136 cp932 ms_Kanji
5129552c 137 euc-jp ujis
67d7b5ef 138 iso-2022-jp [RFC1468]
139 iso-2022-jp-1 [RFC2237]
140 macJapan
5129552c 141 shiftjis Shift_JIS, sjis
67d7b5ef 142 ----------------------------------------------------------------
5129552c 143
144=item Encode::KR -- Korea
145
67d7b5ef 146 ----------------------------------------------------------------
5129552c 147 euc-kr
67d7b5ef 148 cp949 ks_c_5601-1987 x-windows-949 uhc
149 iso-2022-kr [RFC1557]
150 johab
151 ksc5601-raw
152 ----------------------------------------------------------------
5129552c 153
154=item Encode::TW -- Taiwan
155
67d7b5ef 156 ----------------------------------------------------------------
5129552c 157 big5
158 big5-hkscs
159 cp950
67d7b5ef 160 ----------------------------------------------------------------
5129552c 161
162=item Encode::HanExtra -- More Chinese via CPAN
163
164Due to size concerns, additional Chinese encodings below are
165distributed separately on CPAN, under the name Encode::HanExtra.
166
67d7b5ef 167 ----------------------------------------------------------------
5129552c 168 gb18030
169 euc-tw
170 big5plus
67d7b5ef 171 ----------------------------------------------------------------
5129552c 172
173=back
174
175=head2 Miscellaneous encodings
176
177=over 4
178
179=item Encode::EBCDIC
5d030b67 180
181See perlebcdic for details.
182
67d7b5ef 183 ----------------------------------------------------------------
5d030b67 184 cp1047
185 cp37
186 posix-bc
67d7b5ef 187 ----------------------------------------------------------------
5129552c 188
a63c962f 189=item Encode::Symbols
5d030b67 190
5129552c 191For symbols and dingbats.
5d030b67 192
67d7b5ef 193 ----------------------------------------------------------------
5d030b67 194 symbol
195 dingbats
67d7b5ef 196 macDingbats
197 ----------------------------------------------------------------
198
199=back
200
201=head1 Unsupported encodings
202
203The following are not supported as yet. Some because they are rarely
204usede, some because of technical difficulty. They may be supported by
205external modules via CPAN in future, however.
206
207=over 4
208
209=item ISO-2022-JP-2 [RFC1554]
210
211Not very popular yet. Needs Unicode Database or equivalent to
212implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
213GB2312 sumulteniously, which code points in unicode overlap. So you
214need to lookup the database to determine what character set a given
215Unicode character should belong).
216
217=item ISO-2022-CN [RFC1922]
218
219Not very popular. Needs CNS 11643-1 and 2 which are not available in
220this module. CNS 11643 is supported (via euc-tw) in
221Encode::HanExtra. Autrijus may add support for this encoding in his
222module in future
223
224=item various UP-UX encodings
225
226The following are unsoported due to the lack of mapping data.
227
228 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
229 '15' - japanese15, korean15, and roi15
230
231=item Cyrillic encoding ISO-IR-111
232
233Anton doubts its usefulness.
234
235=item ISO-8859-8-1 [Hebrew]
236
237None of the Encode team knows Hebrew enough. Contribution welcome.
238
239=item Thai encoding TCVN
240
241Ditto.
242
243=item Vietnamese encodings VPS
244
245Ditto.
246
247=item various Mac encodings
248
249The following are unsoported due to the lack of mapping data. "Mac"
250that prepends the encoding names are omitted.
251
252 Arabic, Armenian, Bengali, Burmese
253 ChineseSimp, ChineseTrad, Devanagari, Ethiopic, ExtArabic
254 Farsi, Georgian, Gujarati, Gurmukhi, Hebrew
255 Kannada, Khmer, Korean, Laotian, Malayalam, Mongolian
256 Oriya Sinhalese Symbol Tamil Telugu Tibetan Vietnamese
257
258The rest of which already available are based upon the vendor mapping
259available at L<http://www.unicode.org/>
5129552c 260
261=back
5d030b67 262
263=head1 Encoding vs. Charset
264
265Character encoding (or just "encoding") and Character Set (or just
266"charset") are often used interchangeably but they are different
267concepts.
268
67d7b5ef 269=over 2
270
271=item Character I<Set> (I<charset> for short)
5d030b67 272
67d7b5ef 273Is a collection of characters in which each character is distinguished
274with unique ID (in most cases, ID is number).
5d030b67 275
67d7b5ef 276=item Character I<Encoding>
a63c962f 277
67d7b5ef 278Is a way to represent character set(s) in a stream of bits.
279
280=back
281
282A character encoding may contain a single character set
283(i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
284US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).
285
286A character encoding may also encode character set as-is (also called
287a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
288as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
2890x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
5d030b67 290
291As the name suggests, the Encode module supports encodings, not
292individual charsets.
293
67d7b5ef 294However, the word I<charset> is casually used even in Internet
295Assigned Number Authority to actually mean I<encoding>. Encode tries
296to soothe this misconception via aliases. For instance,
297C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
298available as C<gb2312-raw>.
299
a63c962f 300=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
301
302This section tries to classify the supported encodings by their
303applicability for information exchange over the Internet and to
304choose the most suitable aliases to name them in the context of
305such communication.
306
67d7b5ef 307=over 2
308
309=item *
310
311To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
312,available from CPAN.
313
314=back
315
a63c962f 316Encoding names
5d030b67 317
67d7b5ef 318 US-ASCII UTF-8 ISO-8859-* KOI8-R
a63c962f 319 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
67d7b5ef 320 EUC-KR Big5
5d030b67 321
67d7b5ef 322are registered to IANA as preferred MIME names and may probably be used over the Internet.
5d030b67 323
a63c962f 324C<Shift_JIS> is no longer Microsft proprietary since it has been
67d7b5ef 325officialized by JIS X 0208-1997.
5d030b67 326
a63c962f 327 EUC-CN
5d030b67 328
a63c962f 329has not been registered with IANA (as of march 2002) but
67d7b5ef 330seems to be supported by major web browsers. In Encode, GB2312
331is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
332as gb2312-raw. See L<Encode::CN> for details.
333
334 KS_C_5601-1987
335
336has been registered to IANA but when they are used, they are
337EUC-coded. Internet community in Korea is not happy with this.
338so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
339of C<euc-kr>, with ksc5601-raw for "uncooked".
5d030b67 340
a63c962f 341 UTF-16
342 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
5d030b67 343
a63c962f 344are IANA-registered (C<UTF-16> even as a preferred MIME name)
345but probably should be avoided as encoding for web pages due to
67d7b5ef 346the lack of browser supports.
5d030b67 347
5d030b67 348 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
349 GBK
350 VISCII
a63c962f 351 GB 12345
352 GB 18030 (*) (see links bellow)
353 EUC-TW (*)
5d030b67 354
355are totally valid encodings but not registered at IANA.
a63c962f 356The names under which they are listed here are probably the
357most widely-known names for these encodings and are recommended
358names.
359
67d7b5ef 360 BIG5PLUS (*)
a63c962f 361
67d7b5ef 362is a bit proprietary name.
5d030b67 363
67d7b5ef 364=head1 Bookmarks
5d030b67 365
67d7b5ef 366=over 2
a63c962f 367
67d7b5ef 368=item Assigned Charset Names by IANA
5d030b67 369
67d7b5ef 370L<http://www.iana.org/assignments/character-sets>
5d030b67 371
67d7b5ef 372Most of the C<canonical names> in Encode derive from this list
373so you can directly apply the string you have extracted from MIME
374header of mails and we pages.
375
376=item CJK.inf
5d030b67 377
a63c962f 378L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
5d030b67 379
67d7b5ef 380Somewhat obsolete (last update in 1996), but still useful. Also try
381
382L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
383
384You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
a63c962f 385
67d7b5ef 386=item EMCA-035 (eq C<ISO-2022>)
a63c962f 387
67d7b5ef 388L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
389
390The very dspecification of ISO-2022 is available from the link above.
391
392=back
5d030b67 393
394=head1 See Also
395
5129552c 396L<Encode>,
397L<Encode::Byte>,
a63c962f 398L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 399L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 400
401=cut
67d7b5ef 402
403I could not find this page because the hostname doesn't resolve!
404
405 Brief description for most of the mentioned CJK encodings
406L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>