Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
64ffdd5e |
3 | Encode::Supports -- Supported encodings by Encode |
5d030b67 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
13 | he first in the following sequence: |
14 | |
15 | o The MIME name as defined in IETF RFCs. |
16 | o The name in the IANA registry. |
17 | o The name used by the organization that defined it. |
18 | |
5129552c |
19 | Because of all the alias issues, and because in the general case |
20 | encodings have state, "Encode" uses the encoding object internally |
21 | once an operation is in progress. |
5d030b67 |
22 | |
5129552c |
23 | =head1 Supported Encodings |
5d030b67 |
24 | |
25 | As of Perl 5.8.0, at least the following encodings are recognized. |
26 | Note that unless otherwise specified, they are all case insensitive |
a63c962f |
27 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 |
28 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
29 | |
5129552c |
30 | Encodings are categorized and implemented in several different modules |
31 | but you don't have to C<use Encode::XX> to make them available for |
32 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
33 | |
5129552c |
34 | =head2 Built-in Encodings |
5d030b67 |
35 | |
5129552c |
36 | The following encodings are always available. |
5d030b67 |
37 | |
67d7b5ef |
38 | Canonical Aliases Comments & References |
39 | ---------------------------------------------------------------- |
40 | iso-8859-1 latin1 [ISO] |
41 | US-ascii ascii [ECMA] |
42 | UCS-2 ucs2, iso-10646-1 [IANA, et al] |
43 | UCS-2l |
44 | UTF-8 utf8 [RFC2279] |
45 | ---------------------------------------------------------------- |
5d030b67 |
46 | |
5129552c |
47 | =head2 Encode::Byte |
5d030b67 |
48 | |
5129552c |
49 | The following encodings are based single-byte encoding implemented as |
50 | extended ASCII. For most cases it uses \x80-\xff (upper half) to map |
51 | non-ASCII characters. |
5d030b67 |
52 | |
67d7b5ef |
53 | ---------------------------------------------------------------- |
54 | # ISO 8859 series |
a63c962f |
55 | (iso-8859-1 is in built-in) |
67d7b5ef |
56 | iso-8859-2 latin2 [ISO] |
57 | iso-8859-3 latin3 [ISO] |
58 | iso-8859-4 latin4 [ISO] |
59 | iso-8859-5 [ISO] |
60 | iso-8859-6 [ISO] |
61 | iso-8859-7 [ISO] |
62 | iso-8859-8 [ISO] |
63 | iso-8859-9 latin5 [ISO] |
64 | iso-8859-10 latin6 [ISO] |
5d030b67 |
65 | iso-8859-11 |
66 | (iso-8859-12 is nonexistent) |
67d7b5ef |
67 | iso-8859-13 latin7 [ISO] |
68 | iso-8859-14 latin8 [ISO] |
69 | iso-8859-15 latin9 [ISO] |
70 | iso-8859-16 latin10 [ISO] |
71 | |
72 | # Cyrillic |
73 | koi8-f |
74 | koi8-r [RFC1489] |
75 | koi8-u [RFC2319] |
76 | |
77 | # Vietnamese |
78 | viscii |
79 | |
80 | # all cp* are also available as ibm-*, ms-*, and windows-* |
81 | # also see L<http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp> |
5d030b67 |
82 | cp1250 WinLatin2 |
83 | cp1251 WinCyrillic |
84 | cp1252 WinLatin1 |
85 | cp1253 WinGreek |
86 | cp1254 WinTurkiskh |
87 | cp1255 WinHebrew |
88 | cp1256 WinArabic |
89 | cp1257 WinBaltic |
90 | cp1258 WinVietnamese |
64ffdd5e |
91 | |
67d7b5ef |
92 | # Macintosh |
93 | # Also see L<http://developer.apple.com/technotes/tn/tn1150.html> |
94 | MacCentralEurRoman |
95 | MacCroatian |
96 | MacRoman |
97 | MacCyrillic |
98 | MacRomanian |
99 | MacSami |
3ef515df |
100 | MacGreek |
67d7b5ef |
101 | MacThai |
3ef515df |
102 | MacIceland |
67d7b5ef |
103 | MacTurkish |
104 | MacUkrainian |
105 | |
106 | # More vendor encodings |
64ffdd5e |
107 | nextstep |
108 | gsm0338 # used in GSM handsets |
67d7b5ef |
109 | hp-roman8 |
110 | ---------------------------------------------------------------- |
5d030b67 |
111 | |
5129552c |
112 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
113 | |
114 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
a63c962f |
115 | below. Also note these are implemented in distinct module by |
116 | languages, due the the size concerns. Please also refer to their |
117 | respective document pages. |
5d030b67 |
118 | |
5129552c |
119 | =over 4 |
120 | |
121 | =item Encode::CN -- Continental China |
122 | |
67d7b5ef |
123 | ---------------------------------------------------------------- |
5129552c |
124 | cp936 gbk |
67d7b5ef |
125 | euc-cn gb2312 |
126 | gb12345-raw |
127 | gb2312-raw |
5129552c |
128 | hz |
129 | iso-ir-165 |
67d7b5ef |
130 | ---------------------------------------------------------------- |
5129552c |
131 | |
132 | =item Encode::JP -- Japan |
133 | |
67d7b5ef |
134 | ---------------------------------------------------------------- |
5129552c |
135 | 7bit-jis jis |
67d7b5ef |
136 | cp932 ms_Kanji |
5129552c |
137 | euc-jp ujis |
67d7b5ef |
138 | iso-2022-jp [RFC1468] |
139 | iso-2022-jp-1 [RFC2237] |
140 | macJapan |
5129552c |
141 | shiftjis Shift_JIS, sjis |
67d7b5ef |
142 | ---------------------------------------------------------------- |
5129552c |
143 | |
144 | =item Encode::KR -- Korea |
145 | |
67d7b5ef |
146 | ---------------------------------------------------------------- |
5129552c |
147 | euc-kr |
67d7b5ef |
148 | cp949 ks_c_5601-1987 x-windows-949 uhc |
149 | iso-2022-kr [RFC1557] |
150 | johab |
151 | ksc5601-raw |
152 | ---------------------------------------------------------------- |
5129552c |
153 | |
154 | =item Encode::TW -- Taiwan |
155 | |
67d7b5ef |
156 | ---------------------------------------------------------------- |
5129552c |
157 | big5 |
158 | big5-hkscs |
159 | cp950 |
67d7b5ef |
160 | ---------------------------------------------------------------- |
5129552c |
161 | |
162 | =item Encode::HanExtra -- More Chinese via CPAN |
163 | |
164 | Due to size concerns, additional Chinese encodings below are |
165 | distributed separately on CPAN, under the name Encode::HanExtra. |
166 | |
67d7b5ef |
167 | ---------------------------------------------------------------- |
5129552c |
168 | gb18030 |
169 | euc-tw |
170 | big5plus |
67d7b5ef |
171 | ---------------------------------------------------------------- |
5129552c |
172 | |
173 | =back |
174 | |
175 | =head2 Miscellaneous encodings |
176 | |
177 | =over 4 |
178 | |
179 | =item Encode::EBCDIC |
5d030b67 |
180 | |
181 | See perlebcdic for details. |
182 | |
67d7b5ef |
183 | ---------------------------------------------------------------- |
5d030b67 |
184 | cp1047 |
185 | cp37 |
186 | posix-bc |
67d7b5ef |
187 | ---------------------------------------------------------------- |
5129552c |
188 | |
a63c962f |
189 | =item Encode::Symbols |
5d030b67 |
190 | |
5129552c |
191 | For symbols and dingbats. |
5d030b67 |
192 | |
67d7b5ef |
193 | ---------------------------------------------------------------- |
5d030b67 |
194 | symbol |
195 | dingbats |
67d7b5ef |
196 | macDingbats |
197 | ---------------------------------------------------------------- |
198 | |
199 | =back |
200 | |
201 | =head1 Unsupported encodings |
202 | |
203 | The following are not supported as yet. Some because they are rarely |
204 | usede, some because of technical difficulty. They may be supported by |
205 | external modules via CPAN in future, however. |
206 | |
207 | =over 4 |
208 | |
209 | =item ISO-2022-JP-2 [RFC1554] |
210 | |
211 | Not very popular yet. Needs Unicode Database or equivalent to |
212 | implement encode() (Because it includes JIS X 0208/0212, KSC5601, and |
213 | GB2312 sumulteniously, which code points in unicode overlap. So you |
214 | need to lookup the database to determine what character set a given |
215 | Unicode character should belong). |
216 | |
217 | =item ISO-2022-CN [RFC1922] |
218 | |
219 | Not very popular. Needs CNS 11643-1 and 2 which are not available in |
220 | this module. CNS 11643 is supported (via euc-tw) in |
221 | Encode::HanExtra. Autrijus may add support for this encoding in his |
222 | module in future |
223 | |
224 | =item various UP-UX encodings |
225 | |
226 | The following are unsoported due to the lack of mapping data. |
227 | |
228 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 |
229 | '15' - japanese15, korean15, and roi15 |
230 | |
231 | =item Cyrillic encoding ISO-IR-111 |
232 | |
233 | Anton doubts its usefulness. |
234 | |
235 | =item ISO-8859-8-1 [Hebrew] |
236 | |
237 | None of the Encode team knows Hebrew enough. Contribution welcome. |
238 | |
239 | =item Thai encoding TCVN |
240 | |
241 | Ditto. |
242 | |
243 | =item Vietnamese encodings VPS |
244 | |
245 | Ditto. |
246 | |
247 | =item various Mac encodings |
248 | |
249 | The following are unsoported due to the lack of mapping data. "Mac" |
250 | that prepends the encoding names are omitted. |
251 | |
252 | Arabic, Armenian, Bengali, Burmese |
253 | ChineseSimp, ChineseTrad, Devanagari, Ethiopic, ExtArabic |
254 | Farsi, Georgian, Gujarati, Gurmukhi, Hebrew |
255 | Kannada, Khmer, Korean, Laotian, Malayalam, Mongolian |
256 | Oriya Sinhalese Symbol Tamil Telugu Tibetan Vietnamese |
257 | |
258 | The rest of which already available are based upon the vendor mapping |
259 | available at L<http://www.unicode.org/> |
5129552c |
260 | |
261 | =back |
5d030b67 |
262 | |
263 | =head1 Encoding vs. Charset |
264 | |
265 | Character encoding (or just "encoding") and Character Set (or just |
266 | "charset") are often used interchangeably but they are different |
267 | concepts. |
268 | |
67d7b5ef |
269 | =over 2 |
270 | |
271 | =item Character I<Set> (I<charset> for short) |
5d030b67 |
272 | |
67d7b5ef |
273 | Is a collection of characters in which each character is distinguished |
274 | with unique ID (in most cases, ID is number). |
5d030b67 |
275 | |
67d7b5ef |
276 | =item Character I<Encoding> |
a63c962f |
277 | |
67d7b5ef |
278 | Is a way to represent character set(s) in a stream of bits. |
279 | |
280 | =back |
281 | |
282 | A character encoding may contain a single character set |
283 | (i.e. US-ascii) or multiple character sets (i.e. EUC-JP; |
284 | US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212). |
285 | |
286 | A character encoding may also encode character set as-is (also called |
287 | a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is |
288 | as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by |
289 | 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F). |
5d030b67 |
290 | |
291 | As the name suggests, the Encode module supports encodings, not |
292 | individual charsets. |
293 | |
67d7b5ef |
294 | However, the word I<charset> is casually used even in Internet |
295 | Assigned Number Authority to actually mean I<encoding>. Encode tries |
296 | to soothe this misconception via aliases. For instance, |
297 | C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is |
298 | available as C<gb2312-raw>. |
299 | |
a63c962f |
300 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
301 | |
302 | This section tries to classify the supported encodings by their |
303 | applicability for information exchange over the Internet and to |
304 | choose the most suitable aliases to name them in the context of |
305 | such communication. |
306 | |
67d7b5ef |
307 | =over 2 |
308 | |
309 | =item * |
310 | |
311 | To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra> |
312 | ,available from CPAN. |
313 | |
314 | =back |
315 | |
a63c962f |
316 | Encoding names |
5d030b67 |
317 | |
67d7b5ef |
318 | US-ASCII UTF-8 ISO-8859-* KOI8-R |
a63c962f |
319 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
67d7b5ef |
320 | EUC-KR Big5 |
5d030b67 |
321 | |
67d7b5ef |
322 | are registered to IANA as preferred MIME names and may probably be used over the Internet. |
5d030b67 |
323 | |
a63c962f |
324 | C<Shift_JIS> is no longer Microsft proprietary since it has been |
67d7b5ef |
325 | officialized by JIS X 0208-1997. |
5d030b67 |
326 | |
a63c962f |
327 | EUC-CN |
5d030b67 |
328 | |
a63c962f |
329 | has not been registered with IANA (as of march 2002) but |
67d7b5ef |
330 | seems to be supported by major web browsers. In Encode, GB2312 |
331 | is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized |
332 | as gb2312-raw. See L<Encode::CN> for details. |
333 | |
334 | KS_C_5601-1987 |
335 | |
336 | has been registered to IANA but when they are used, they are |
337 | EUC-coded. Internet community in Korea is not happy with this. |
338 | so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version |
339 | of C<euc-kr>, with ksc5601-raw for "uncooked". |
5d030b67 |
340 | |
a63c962f |
341 | UTF-16 |
342 | KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) |
5d030b67 |
343 | |
a63c962f |
344 | are IANA-registered (C<UTF-16> even as a preferred MIME name) |
345 | but probably should be avoided as encoding for web pages due to |
67d7b5ef |
346 | the lack of browser supports. |
5d030b67 |
347 | |
5d030b67 |
348 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
349 | GBK |
350 | VISCII |
a63c962f |
351 | GB 12345 |
352 | GB 18030 (*) (see links bellow) |
353 | EUC-TW (*) |
5d030b67 |
354 | |
355 | are totally valid encodings but not registered at IANA. |
a63c962f |
356 | The names under which they are listed here are probably the |
357 | most widely-known names for these encodings and are recommended |
358 | names. |
359 | |
67d7b5ef |
360 | BIG5PLUS (*) |
a63c962f |
361 | |
67d7b5ef |
362 | is a bit proprietary name. |
5d030b67 |
363 | |
67d7b5ef |
364 | =head1 Bookmarks |
5d030b67 |
365 | |
67d7b5ef |
366 | =over 2 |
a63c962f |
367 | |
67d7b5ef |
368 | =item Assigned Charset Names by IANA |
5d030b67 |
369 | |
67d7b5ef |
370 | L<http://www.iana.org/assignments/character-sets> |
5d030b67 |
371 | |
67d7b5ef |
372 | Most of the C<canonical names> in Encode derive from this list |
373 | so you can directly apply the string you have extracted from MIME |
374 | header of mails and we pages. |
375 | |
376 | =item CJK.inf |
5d030b67 |
377 | |
a63c962f |
378 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
5d030b67 |
379 | |
67d7b5ef |
380 | Somewhat obsolete (last update in 1996), but still useful. Also try |
381 | |
382 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
383 | |
384 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> |
a63c962f |
385 | |
67d7b5ef |
386 | =item EMCA-035 (eq C<ISO-2022>) |
a63c962f |
387 | |
67d7b5ef |
388 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> |
389 | |
390 | The very dspecification of ISO-2022 is available from the link above. |
391 | |
392 | =back |
5d030b67 |
393 | |
394 | =head1 See Also |
395 | |
5129552c |
396 | L<Encode>, |
397 | L<Encode::Byte>, |
a63c962f |
398 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
399 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
400 | |
401 | =cut |
67d7b5ef |
402 | |
403 | I could not find this page because the hostname doesn't resolve! |
404 | |
405 | Brief description for most of the mentioned CJK encodings |
406 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |