Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
0ab8f81e |
3 | Encode::Supported -- Encodings supported by Encode |
5d030b67 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
0ab8f81e |
10 | is ignored. In addition, an encoding may have aliases. |
5d030b67 |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
a999c27c |
13 | the first in the following sequence (with a few exceptions). |
5d030b67 |
14 | |
44b3b9c7 |
15 | =over 2 |
a999c27c |
16 | |
17 | =item * |
18 | |
962111ca |
19 | The name used by the Perl community. That includes 'utf8' and 'ascii'. |
20 | Unlike aliases, canonical names directly reach the method so such |
21 | frequently used words like 'utf8' don't need to do alias lookups. |
a999c27c |
22 | |
23 | =item * |
24 | |
0ab8f81e |
25 | The MIME name as defined in IETF RFCs. This includes all "iso-"s. |
a999c27c |
26 | |
27 | =item * |
28 | |
29 | The name in the IANA registry. |
962111ca |
30 | |
a999c27c |
31 | =item * |
32 | |
33 | The name used by the organization that defined it. |
34 | |
35 | =back |
36 | |
37 | In case I<de jure> canonical names differ from that of the Encode |
38 | module, they are always aliased if it ever be implemented. So you can |
39 | safely tell if a given encoding is implemented or not just by passing |
40 | the canonical name. |
5d030b67 |
41 | |
5129552c |
42 | Because of all the alias issues, and because in the general case |
962111ca |
43 | encodings have state, "Encode" uses an encoding object internally |
5129552c |
44 | once an operation is in progress. |
5d030b67 |
45 | |
5129552c |
46 | =head1 Supported Encodings |
5d030b67 |
47 | |
48 | As of Perl 5.8.0, at least the following encodings are recognized. |
49 | Note that unless otherwise specified, they are all case insensitive |
962111ca |
50 | (via alias) and all occurrence of spaces are replaced with '-'. |
51 | In other words, "ISO 8859 1" and "iso-8859-1" are identical. |
5d030b67 |
52 | |
5129552c |
53 | Encodings are categorized and implemented in several different modules |
54 | but you don't have to C<use Encode::XX> to make them available for |
962111ca |
55 | most cases. Encode.pm will automatically load those modules on demand. |
5d030b67 |
56 | |
5129552c |
57 | =head2 Built-in Encodings |
5d030b67 |
58 | |
5129552c |
59 | The following encodings are always available. |
5d030b67 |
60 | |
962111ca |
61 | Canonical Aliases Comments & References |
67d7b5ef |
62 | ---------------------------------------------------------------- |
2d06ad02 |
63 | ascii US-ascii ISO-646-US [ECMA] |
f0a41339 |
64 | ascii-ctrl Special Encoding |
962111ca |
65 | iso-8859-1 latin1 [ISO] |
f0a41339 |
66 | null Special Encoding |
962111ca |
67 | utf8 UTF-8 [RFC2279] |
c731e18e |
68 | ---------------------------------------------------------------- |
69 | |
f0a41339 |
70 | I<null> and I<ascii-ctrl> are special. "null" fails for all character |
71 | so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL |
72 | CHARACTERS will fall back to character references. Ditto for |
73 | "ascii-ctrl" except for control characters. For fallback modes, see |
74 | L<Encode>. |
75 | |
c731e18e |
76 | =head2 Encode::Unicode -- other Unicode encodings |
77 | |
78 | Unicode coding schemes other than native utf8 are supported by |
0ab8f81e |
79 | Encode::Unicode, which will be autoloaded on demand. |
c731e18e |
80 | |
81 | ---------------------------------------------------------------- |
f2a2953c |
82 | UCS-2BE UCS-2, iso-10646-1 [IANA, UC] |
83 | UCS-2LE [UC] |
84 | UTF-16 [UC] |
85 | UTF-16BE [UC] |
86 | UTF-16LE [UC] |
87 | UTF-32 [UC] |
126bf8bf |
88 | UTF-32BE UCS-4 [UC] |
f2a2953c |
89 | UTF-32LE [UC] |
1485817e |
90 | UTF-7 [RFC2152] |
67d7b5ef |
91 | ---------------------------------------------------------------- |
5d030b67 |
92 | |
0ab8f81e |
93 | To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, |
f2a2953c |
94 | see L<Encode::Unicode>. |
95 | |
1485817e |
96 | UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit |
c2878c71 |
97 | encoding. It is implemented seperately by Encode::Unicode::UTF7. |
1485817e |
98 | |
a999c27c |
99 | =head2 Encode::Byte -- Extended ASCII |
5d030b67 |
100 | |
0ab8f81e |
101 | Encode::Byte implements most single-byte encodings except for |
102 | Symbols and EBCDIC. The following encodings are based on single-byte |
103 | encodings implemented as extended ASCII. Most of them map |
104 | \x80-\xff (upper half) to non-ASCII characters. |
a999c27c |
105 | |
44b3b9c7 |
106 | =over 2 |
a999c27c |
107 | |
108 | =item ISO-8859 and corresponding vendor mappings |
109 | |
962111ca |
110 | Since there are so many, they are presented in table format with |
0ab8f81e |
111 | languages and corresponding encoding names by vendors. Note that |
112 | the table is sorted in order of ISO-8859 and the corresponding vendor |
113 | mappings are slightly different from that of ISO. See |
a999c27c |
114 | L<http://czyborra.com/charsets/iso8859.html> for details. |
115 | |
962111ca |
116 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others |
a999c27c |
117 | ---------------------------------------------------------------- |
962111ca |
118 | N. America (ASCII) cp437 AdobeStandardEncoding |
119 | cp863 (DOSCanadaF) |
0ab8f81e |
120 | W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep |
962111ca |
121 | hp-roman8 |
122 | cp860 (DOSPortuguese) |
123 | Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman |
124 | MacCroatian |
125 | MacRomanian |
126 | MacRumanian |
ab3374e4 |
127 | Latin3[1] iso-8859-3 |
128 | Latin4[2] iso-8859-4 |
962111ca |
129 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic |
0ab8f81e |
130 | (See also next section) cp866 MacUkrainian |
962111ca |
131 | Arabic iso-8859-6 cp864 cp1256 MacArabic |
132 | cp1006 MacFarsi |
133 | Greek iso-8859-7 cp737 cp1253 MacGreek |
134 | cp869 (DOSGreek2) |
135 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew |
136 | Turkish iso-8859-9 cp857 cp1254 MacTurkish |
137 | Nordics iso-8859-10 cp865 |
138 | cp861 MacIcelandic |
139 | MacSami |
ab3374e4 |
140 | Thai iso-8859-11[3] cp874 MacThai |
a999c27c |
141 | (iso-8859-12 is nonexistent. Reserved for Indics?) |
962111ca |
142 | Baltics iso-8859-13 cp775 cp1257 |
a999c27c |
143 | Celtics iso-8859-14 |
962111ca |
144 | Latin9 [4] iso-8859-15 |
a999c27c |
145 | Latin10 iso-8859-16 |
962111ca |
146 | Vietnamese viscii cp1258 MacVietnamese |
a999c27c |
147 | ---------------------------------------------------------------- |
148 | |
0ab8f81e |
149 | [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9. |
150 | [2] Baltics. Now on 8859-10, except for Latvian. |
ab3374e4 |
151 | [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0) |
0ab8f81e |
152 | [4] Nicknamed Latin0; the Euro sign as well as French and Finnish |
153 | letters that are missing from 8859-1 were added. |
a999c27c |
154 | |
155 | All cp* are also available as ibm-*, ms-*, and windows-* . See also |
156 | L<http://czyborra.com/charsets/codepages.html>. |
157 | |
158 | Macintosh encodings don't seem to be registered in such entities as |
159 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note |
160 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> |
0ab8f81e |
161 | for details. |
a999c27c |
162 | |
0ab8f81e |
163 | =item KOI8 - De Facto Standard for the Cyrillic world |
a999c27c |
164 | |
0ab8f81e |
165 | Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more |
166 | popular in the Net. L<Encode> comes with the following KOI charsets. |
962111ca |
167 | For gory details, see L<http://czyborra.com/charsets/cyrillic.html> |
5d030b67 |
168 | |
67d7b5ef |
169 | ---------------------------------------------------------------- |
962111ca |
170 | koi8-f |
171 | koi8-r cp878 [RFC1489] |
172 | koi8-u [RFC2319] |
85982a32 |
173 | ---------------------------------------------------------------- |
962111ca |
174 | |
44b3b9c7 |
175 | =back |
176 | |
177 | =head2 gsm0338 - Hentai Latin 1 |
a999c27c |
178 | |
962111ca |
179 | GSM0338 is for GSM handsets. Though it shares alphanumerals with |
180 | ASCII, control character ranges and other parts are mapped very |
e74d7437 |
181 | differently, mainly to store Greek characters. There are also escape |
44b3b9c7 |
182 | sequences (starting with 0x1B) to cover e.g. the Euro sign. |
183 | |
184 | This was once handled by L<Encode::Bytes> but because of all those |
185 | unusual specifications, Encode 2.20 has relocated the support to |
186 | L<Encode::GSM0338>. See L<Encode::GSM0338> for details. |
187 | |
188 | =over 2 |
189 | |
190 | =item gsm0338 support before 2.19 |
191 | |
192 | Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not |
e74d7437 |
193 | well-defined and decode() will return an empty string for them. |
194 | One possible workaround is |
195 | |
196 | $gsm =~ s/\x00\z/\x00\x00/; |
197 | $uni = decode("gsm0338", $gsm); |
198 | $uni .= "\xA0" if $gsm =~ /\x1B\z/; |
199 | |
200 | Note that the Encode implementation of GSM0338 does not implement the |
201 | reuse of Latin capital letters as Greek capital letters (for example, |
202 | the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL |
203 | LETTER ZETA). |
204 | |
205 | The GSM0338 is also covered in Encode::Byte even though it is not |
206 | an "extended ASCII" encoding. |
a999c27c |
207 | |
208 | =back |
5d030b67 |
209 | |
0ab8f81e |
210 | =head2 CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
211 | |
962111ca |
212 | Note that Vietnamese is listed above. Also read "Encoding vs Charset" |
0ab8f81e |
213 | below. Also note that these are implemented in distinct modules by |
ab3374e4 |
214 | countries, due to the size concerns (simplified Chinese is mapped |
0ab8f81e |
215 | to 'CN', continental China, while traditional Chinese is mapped to |
ab3374e4 |
216 | 'TW', Taiwan). Please refer to their respective documentation pages. |
5d030b67 |
217 | |
44b3b9c7 |
218 | =over 2 |
5129552c |
219 | |
220 | =item Encode::CN -- Continental China |
221 | |
962111ca |
222 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
223 | ---------------------------------------------------------------- |
962111ca |
224 | euc-cn [1] MacChineseSimp |
225 | (gbk) cp936 [2] |
226 | gb12345-raw { GB12345 without CES } |
227 | gb2312-raw { GB2312 without CES } |
5129552c |
228 | hz |
229 | iso-ir-165 |
67d7b5ef |
230 | ---------------------------------------------------------------- |
5129552c |
231 | |
0ab8f81e |
232 | [1] GB2312 is aliased to this. See L<Microsoft-related naming mess> |
233 | [2] gbk is aliased to this. See L<Microsoft-related naming mess> |
f2a2953c |
234 | |
5129552c |
235 | =item Encode::JP -- Japan |
236 | |
962111ca |
237 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
238 | ---------------------------------------------------------------- |
a999c27c |
239 | euc-jp |
962111ca |
240 | shiftjis cp932 macJapanese |
f2a2953c |
241 | 7bit-jis |
962111ca |
242 | iso-2022-jp [RFC1468] |
243 | iso-2022-jp-1 [RFC2237] |
f2a2953c |
244 | jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } |
245 | jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } |
246 | jis0212-raw { JIS X 0212 (Extended Kanji) without CES } |
67d7b5ef |
247 | ---------------------------------------------------------------- |
5129552c |
248 | |
249 | =item Encode::KR -- Korea |
250 | |
962111ca |
251 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
252 | ---------------------------------------------------------------- |
962111ca |
253 | euc-kr MacKorean [RFC1557] |
254 | cp949 [1] |
255 | iso-2022-kr [RFC1557] |
a999c27c |
256 | johab [KS X 1001:1998, Annex 3] |
f2a2953c |
257 | ksc5601-raw { KSC5601 without CES } |
67d7b5ef |
258 | ---------------------------------------------------------------- |
5129552c |
259 | |
962111ca |
260 | [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this. |
261 | See below. |
262 | |
5129552c |
263 | =item Encode::TW -- Taiwan |
264 | |
962111ca |
265 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
266 | ---------------------------------------------------------------- |
b0b300a3 |
267 | big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten} |
268 | big5-hkscs |
67d7b5ef |
269 | ---------------------------------------------------------------- |
5129552c |
270 | |
271 | =item Encode::HanExtra -- More Chinese via CPAN |
272 | |
ab3374e4 |
273 | Due to the size concerns, additional Chinese encodings below are |
5129552c |
274 | distributed separately on CPAN, under the name Encode::HanExtra. |
275 | |
962111ca |
276 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
277 | ---------------------------------------------------------------- |
e8c86ba6 |
278 | big5ext CMEX's Big5e Extension |
279 | big5plus CMEX's Big5+ Extension |
280 | cccii Chinese Character Code for Information Interchange |
281 | euc-tw EUC (Extended Unix Character) |
282 | gb18030 GBK with Traditional Characters |
283 | ---------------------------------------------------------------- |
284 | |
285 | =item Encode::JIS2K -- JIS X 0213 encodings via CPAN |
286 | |
287 | Due to size concerns, additional Japanese encodings below are |
288 | distributed separately on CPAN, under the name Encode::JIS2K. |
289 | |
290 | Standard DOS/Win Macintosh Comment/Reference |
291 | ---------------------------------------------------------------- |
292 | euc-jisx0213 |
293 | shiftjisx0123 |
294 | iso-2022-jp-3 |
295 | jis0213-1-raw |
296 | jis0213-2-raw |
67d7b5ef |
297 | ---------------------------------------------------------------- |
5129552c |
298 | |
299 | =back |
300 | |
301 | =head2 Miscellaneous encodings |
302 | |
44b3b9c7 |
303 | =over 2 |
5129552c |
304 | |
305 | =item Encode::EBCDIC |
5d030b67 |
306 | |
a999c27c |
307 | See L<perlebcdic> for details. |
5d030b67 |
308 | |
67d7b5ef |
309 | ---------------------------------------------------------------- |
5d030b67 |
310 | cp37 |
a999c27c |
311 | cp500 |
312 | cp875 |
313 | cp1026 |
314 | cp1047 |
5d030b67 |
315 | posix-bc |
67d7b5ef |
316 | ---------------------------------------------------------------- |
5129552c |
317 | |
a63c962f |
318 | =item Encode::Symbols |
5d030b67 |
319 | |
5129552c |
320 | For symbols and dingbats. |
5d030b67 |
321 | |
67d7b5ef |
322 | ---------------------------------------------------------------- |
5d030b67 |
323 | symbol |
324 | dingbats |
a999c27c |
325 | MacDingbats |
326 | AdobeZdingbat |
327 | AdobeSymbol |
67d7b5ef |
328 | ---------------------------------------------------------------- |
329 | |
e8c86ba6 |
330 | =item Encode::MIME::Header |
331 | |
332 | Strictly speaking, MIME header encoding documented in RFC 2047 is more |
ab3374e4 |
333 | of encapsulation than encoding. However, their support in modern |
334 | world is imperative so they are supported. |
e8c86ba6 |
335 | |
336 | ---------------------------------------------------------------- |
337 | MIME-Header [RFC2047] |
338 | MIME-B [RFC2047] |
339 | MIME-Q [RFC2047] |
340 | ---------------------------------------------------------------- |
341 | |
342 | =item Encode::Guess |
343 | |
344 | This one is not a name of encoding but a utility that lets you pick up |
345 | the most appropriate encoding for a data out of given I<suspects>. See |
346 | L<Encode::Guess> for details. |
347 | |
67d7b5ef |
348 | =back |
349 | |
350 | =head1 Unsupported encodings |
351 | |
0ab8f81e |
352 | The following encodings are not supported as yet; some because they |
353 | are rarely used, some because of technical difficulties. They may |
354 | be supported by external modules via CPAN in the future, however. |
67d7b5ef |
355 | |
44b3b9c7 |
356 | =over 2 |
67d7b5ef |
357 | |
358 | =item ISO-2022-JP-2 [RFC1554] |
359 | |
360 | Not very popular yet. Needs Unicode Database or equivalent to |
0ab8f81e |
361 | implement encode() (because it includes JIS X 0208/0212, KSC5601, and |
362 | GB2312 simultaneously, whose code points in Unicode overlap. So you |
363 | need to lookup the database to determine to what character set a given |
67d7b5ef |
364 | Unicode character should belong). |
365 | |
962111ca |
366 | =item ISO-2022-CN [RFC1922] |
67d7b5ef |
367 | |
0ab8f81e |
368 | Not very popular. Needs CNS 11643-1 and -2 which are not available in |
962111ca |
369 | this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. |
0ab8f81e |
370 | Autrijus Tang may add support for this encoding in his module in future. |
67d7b5ef |
371 | |
0ab8f81e |
372 | =item Various HP-UX encodings |
67d7b5ef |
373 | |
962111ca |
374 | The following are unsupported due to the lack of mapping data. |
375 | |
67d7b5ef |
376 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 |
962111ca |
377 | '15' - japanese15, korean15, and roi15 |
67d7b5ef |
378 | |
379 | =item Cyrillic encoding ISO-IR-111 |
380 | |
0ab8f81e |
381 | Anton Tagunov doubts its usefulness. |
67d7b5ef |
382 | |
383 | =item ISO-8859-8-1 [Hebrew] |
384 | |
a999c27c |
385 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and |
386 | MacHebrew are supported because and just because there were mappings |
962111ca |
387 | available at L<http://www.unicode.org/>). Contributions welcome. |
388 | |
389 | =item ISIRI 3342, Iran System, ISIRI 2900 [Farsi] |
390 | |
391 | Ditto. |
67d7b5ef |
392 | |
393 | =item Thai encoding TCVN |
394 | |
395 | Ditto. |
396 | |
397 | =item Vietnamese encodings VPS |
398 | |
0ab8f81e |
399 | Though Jungshik Shin has reported that Mozilla supports this encoding, |
400 | it was too late before 5.8.0 for us to add it. In the future, it |
401 | may be available via a separate module. See |
962111ca |
402 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> |
403 | and |
a999c27c |
404 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> |
405 | if you are interested in helping us. |
67d7b5ef |
406 | |
962111ca |
407 | =item Various Mac encodings |
67d7b5ef |
408 | |
962111ca |
409 | The following are unsupported due to the lack of mapping data. |
a999c27c |
410 | |
411 | MacArmenian, MacBengali, MacBurmese, MacEthiopic |
412 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer |
413 | MacLaotian, MacMalayalam, MacMongolian, MacOriya |
414 | MacSinhalese, MacTamil, MacTelugu, MacTibetan |
415 | MacVietnamese |
416 | |
0ab8f81e |
417 | The rest which are already available are based upon the vendor mappings |
962111ca |
418 | at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . |
a999c27c |
419 | |
420 | =item (Mac) Indic encodings |
421 | |
0ab8f81e |
422 | The maps for the following are available at L<http://www.unicode.org/> |
423 | but remain unsupport because those encodings need algorithmical |
424 | approach, currently unsupported by F<enc2xs>: |
67d7b5ef |
425 | |
a999c27c |
426 | MacDevanagari |
427 | MacGurmukhi |
428 | MacGujarati |
67d7b5ef |
429 | |
a999c27c |
430 | For details, please see C<Unicode mapping issues and notes:> at |
431 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . |
432 | |
433 | I believe this issue is prevalent not only for Mac Indics but also in |
962111ca |
434 | other Indic encodings, but the above were the only Indic encodings |
a999c27c |
435 | maps that I could find at L<http://www.unicode.org/> . |
5129552c |
436 | |
437 | =back |
5d030b67 |
438 | |
a999c27c |
439 | =head1 Encoding vs. Charset -- terminology |
5d030b67 |
440 | |
0ab8f81e |
441 | We are used to using the term (character) I<encoding> and I<character |
442 | set> interchangeably. But just as confusing the terms byte and |
443 | character is dangerous and the terms should be differentiated when |
444 | needed, we need to differentiate I<encoding> and I<character set>. |
5d030b67 |
445 | |
0ab8f81e |
446 | To understand that, here is a description of how we make computers |
447 | grok our characters. |
a999c27c |
448 | |
44b3b9c7 |
449 | =over 2 |
a999c27c |
450 | |
451 | =item * |
67d7b5ef |
452 | |
a999c27c |
453 | First we start with which characters to include. We call this |
454 | collection of characters I<character repertoire>. |
5d030b67 |
455 | |
a999c27c |
456 | =item * |
5d030b67 |
457 | |
a999c27c |
458 | Then we have to give each character a unique ID so your computer can |
0ab8f81e |
459 | tell the difference between 'a' and 'A'. This itemized character |
962111ca |
460 | repertoire is now a I<character set>. |
a63c962f |
461 | |
a999c27c |
462 | =item * |
463 | |
464 | If your computer can grow the character set without further |
0ab8f81e |
465 | processing, you can go ahead and use it. This is called a I<coded |
a999c27c |
466 | character set> (CCS) or I<raw character encoding>. ASCII is used this |
467 | way for most cases. |
468 | |
469 | =item * |
470 | |
0ab8f81e |
471 | But in many cases, especially multi-byte CJK encodings, you have to |
a999c27c |
472 | tweak a little more. Your network connection may not accept any data |
0ab8f81e |
473 | with the Most Significant Bit set, and your computer may not be able to |
a999c27c |
474 | tell if a given byte is a whole character or just half of it. So you |
475 | have to I<encode> the character set to use it. |
476 | |
477 | A I<character encoding scheme> (CES) determines how to encode a given |
478 | character set, or a set of multiple character sets. 7bit ISO-2022 is |
0ab8f81e |
479 | an example of a CES. You switch between character sets via I<escape |
480 | sequences>. |
67d7b5ef |
481 | |
482 | =back |
483 | |
0ab8f81e |
484 | Technically, or mathematically, speaking, a character set encoded in |
a999c27c |
485 | such a CES that maps character by character may form a CCS. EUC is such |
0ab8f81e |
486 | an example. The CES of EUC is as follows: |
67d7b5ef |
487 | |
44b3b9c7 |
488 | =over 2 |
5d030b67 |
489 | |
a999c27c |
490 | =item * |
5d030b67 |
491 | |
a999c27c |
492 | Map ASCII unchanged. |
493 | |
494 | =item * |
495 | |
496 | Map such a character set that consists of 94 or 96 powered by N |
497 | members by adding 0x80 to each byte. |
498 | |
499 | =item * |
500 | |
0ab8f81e |
501 | You can also use 0x8e and 0x8f to indicate that the following sequence of |
502 | characters belongs to yet another character set. To each following byte |
503 | is added the value 0x80. |
a999c27c |
504 | |
505 | =back |
506 | |
0ab8f81e |
507 | By carefully looking at the encoded byte sequence, you can find that the |
508 | byte sequence conforms a unique number. In that sense, EUC is a CCS |
a999c27c |
509 | generated by a CES above from up to four CCS (complicated?). UTF-8 |
0ab8f81e |
510 | falls into this category. See L<perlUnicode/"UTF-8"> to find out how |
a999c27c |
511 | UTF-8 maps Unicode to a byte sequence. |
512 | |
0ab8f81e |
513 | You may also have found out by now why 7bit ISO-2022 cannot comprise |
514 | a CCS. If you look at a byte sequence \x21\x21, you can't tell if |
515 | it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 |
516 | so you have no trouble differentiating between "!!". and S<" ">. |
67d7b5ef |
517 | |
a63c962f |
518 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
519 | |
520 | This section tries to classify the supported encodings by their |
521 | applicability for information exchange over the Internet and to |
522 | choose the most suitable aliases to name them in the context of |
523 | such communication. |
524 | |
44b3b9c7 |
525 | =over 2 |
67d7b5ef |
526 | |
527 | =item * |
528 | |
0ab8f81e |
529 | To (en|de)code encodings marked by C<(**)>, you need |
a999c27c |
530 | C<Encode::HanExtra>, available from CPAN. |
67d7b5ef |
531 | |
532 | =back |
533 | |
a63c962f |
534 | Encoding names |
5d030b67 |
535 | |
f2a2953c |
536 | US-ASCII UTF-8 ISO-8859-* KOI8-R |
537 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
538 | EUC-KR Big5 GB2312 |
a999c27c |
539 | |
0ab8f81e |
540 | are registered with IANA as preferred MIME names and may |
a999c27c |
541 | be used over the Internet. |
5d030b67 |
542 | |
c731e18e |
543 | C<Shift_JIS> has been officialized by JIS X 0208:1997. |
a999c27c |
544 | L<Microsoft-related naming mess> gives details. |
5d030b67 |
545 | |
a999c27c |
546 | C<GB2312> is the IANA name for C<EUC-CN>. |
547 | See L<Microsoft-related naming mess> for details. |
548 | |
549 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> |
f2a2953c |
550 | with Encode. See L<Encode::CN> for details. |
5d030b67 |
551 | |
a63c962f |
552 | EUC-CN |
f2a2953c |
553 | KOI8-U [RFC2319] |
5d030b67 |
554 | |
a999c27c |
555 | have not been registered with IANA (as of March 2002) but |
556 | seem to be supported by major web browsers. |
0ab8f81e |
557 | The IANA name for C<EUC-CN> is C<GB2312>. |
67d7b5ef |
558 | |
559 | KS_C_5601-1987 |
560 | |
a999c27c |
561 | is heavily misused. |
562 | See L<Microsoft-related naming mess> for details. |
563 | |
564 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> |
f2a2953c |
565 | with Encode. See L<Encode::KR> for details. |
566 | |
567 | UTF-16 UTF-16BE UTF-16LE |
568 | |
448e90bb |
569 | are IANA-registered C<charset>s. See [RFC 2781] for details. |
f2a2953c |
570 | Jungshik Shin reports that UTF-16 with a BOM is well accepted |
571 | by MS IE 5/6 and NS 4/6. Beware however that |
572 | |
44b3b9c7 |
573 | =over 2 |
f2a2953c |
574 | |
575 | =item * |
5d030b67 |
576 | |
f2a2953c |
577 | C<UTF-16> support in any software you're going to be |
578 | using/interoperating with has probably been less tested |
579 | then C<UTF-8> support |
5d030b67 |
580 | |
f2a2953c |
581 | =item * |
582 | |
c731e18e |
583 | C<UTF-8> coded data seamlessly passes traditional |
584 | command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded |
0ab8f81e |
585 | data is likely to cause confusion (with its zero bytes, |
f2a2953c |
586 | for example) |
587 | |
588 | =item * |
589 | |
590 | it is beyond the power of words to describe the way HTML browsers |
0ab8f81e |
591 | encode non-C<ASCII> form data. To get a general impression, visit |
b2deda17 |
592 | L<http://www.alanflavell.org.uk/charset/form-i18n.html>. |
0ab8f81e |
593 | While encoding of form data has stabilized for C<UTF-8> encoded pages |
594 | (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to |
595 | expect fun (and cross-browser discrepancies) with C<UTF-16> encoded |
f2a2953c |
596 | pages! |
597 | |
598 | =back |
599 | |
600 | The rule of thumb is to use C<UTF-8> unless you know what |
c731e18e |
601 | you're doing and unless you really benefit from using C<UTF-16>. |
a999c27c |
602 | |
f2a2953c |
603 | ISO-IR-165 [RFC1345] |
5d030b67 |
604 | VISCII |
a63c962f |
605 | GB 12345 |
f2a2953c |
606 | GB 18030 (**) (see links bellow) |
607 | EUC-TW (**) |
5d030b67 |
608 | |
609 | are totally valid encodings but not registered at IANA. |
a63c962f |
610 | The names under which they are listed here are probably the |
611 | most widely-known names for these encodings and are recommended |
612 | names. |
613 | |
f2a2953c |
614 | BIG5PLUS (**) |
a63c962f |
615 | |
0ab8f81e |
616 | is a proprietary name. |
5d030b67 |
617 | |
a999c27c |
618 | =head2 Microsoft-related naming mess |
619 | |
620 | Microsoft products misuse the following names: |
5d030b67 |
621 | |
44b3b9c7 |
622 | =over 2 |
a63c962f |
623 | |
a999c27c |
624 | =item KS_C_5601-1987 |
5d030b67 |
625 | |
a999c27c |
626 | Microsoft extension to C<EUC-KR>. |
5d030b67 |
627 | |
c731e18e |
628 | Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla). |
67d7b5ef |
629 | |
f2a2953c |
630 | See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> |
a999c27c |
631 | for details. |
5d030b67 |
632 | |
f2a2953c |
633 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common |
634 | misusage. I<Raw> C<KS_C_5601-1987> encoding is available as |
635 | C<kcs5601-raw>. |
5d030b67 |
636 | |
f2a2953c |
637 | See L<Encode::KR> for details. |
67d7b5ef |
638 | |
a999c27c |
639 | =item GB2312 |
67d7b5ef |
640 | |
a999c27c |
641 | Microsoft extension to C<EUC-CN>. |
a63c962f |
642 | |
a999c27c |
643 | Proper names: C<CP936>, C<GBK>. |
a63c962f |
644 | |
a999c27c |
645 | C<GB2312> has been registered in the C<EUC-CN> meaning at |
646 | IANA. This has partially repaired the situation: Microsoft's |
647 | C<GB2312> has become a superset of the official C<GB2312>. |
67d7b5ef |
648 | |
a999c27c |
649 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with |
650 | IANA registration. C<cp936> is supported separately. |
f2a2953c |
651 | I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. |
a999c27c |
652 | |
f2a2953c |
653 | See L<Encode::CN> for details. |
a999c27c |
654 | |
655 | =item Big5 |
656 | |
657 | Microsoft extension to C<Big5>. |
658 | |
659 | Proper name: C<CP950>. |
660 | |
661 | Encode separately supports C<Big5> and C<cp950>. |
662 | |
663 | =item Shift_JIS |
664 | |
665 | Microsoft's understanding of C<Shift_JIS>. |
666 | |
667 | JIS has not endorsed the full Microsoft standard however. |
668 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 |
0ab8f81e |
669 | character sets, while Microsoft has always used C<Shift_JIS> |
85982a32 |
670 | to encode a wider character repertoire. See C<IANA> registration for |
c731e18e |
671 | C<Windows-31J>. |
a999c27c |
672 | |
0ab8f81e |
673 | As a historical predecessor, Microsoft's variant |
674 | probably has more rights for the name, though it may be objected |
a999c27c |
675 | that Microsoft shouldn't have used JIS as part of the name |
676 | in the first place. |
677 | |
8f1ed24a |
678 | Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and |
679 | provided as an alias by Encode): C<Windows-31J>. |
a999c27c |
680 | |
681 | Encode separately supports C<Shift_JIS> and C<cp932>. |
682 | |
683 | =back |
684 | |
685 | =head1 Glossary |
686 | |
44b3b9c7 |
687 | =over 2 |
a999c27c |
688 | |
689 | =item character repertoire |
690 | |
0ab8f81e |
691 | A collection of unique characters. A I<character> set in the strictest |
692 | sense. At this stage, characters are not numbered. |
a999c27c |
693 | |
694 | =item coded character set (CCS) |
695 | |
696 | A character set that is mapped in a way computers can use directly. |
0ab8f81e |
697 | Many character encodings, including EUC, fall in this category. |
a999c27c |
698 | |
699 | =item character encoding scheme (CES) |
700 | |
701 | An algorithm to map a character set to a byte sequence. You don't |
702 | have to be able to tell which character set a given byte sequence |
703 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an |
704 | example of being both a CCS and CES. |
705 | |
f2a2953c |
706 | =item charset (in MIME context) |
707 | |
708 | has long been used in the meaning of C<encoding>, CES. |
709 | |
0ab8f81e |
710 | While the word combination C<character set> has lost this meaning |
711 | in MIME context since [RFC 2130], the C<charset> abbreviation has |
712 | retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>: |
f2a2953c |
713 | |
714 | This document uses the term "charset" to mean a set of rules for |
715 | mapping from a sequence of octets to a sequence of characters, such |
716 | as the combination of a coded character set and a character encoding |
717 | scheme; this is also what is used as an identifier in MIME "charset=" |
718 | parameters, and registered in the IANA charset registry ... (Note |
719 | that this is NOT a term used by other standards bodies, such as ISO). |
ab3374e4 |
720 | [RFC 2277] |
f2a2953c |
721 | |
a999c27c |
722 | =item EUC |
723 | |
0ab8f81e |
724 | Extended Unix Character. See ISO-2022. |
a999c27c |
725 | |
726 | =item ISO-2022 |
727 | |
0ab8f81e |
728 | A CES that was carefully designed to coexist with ASCII. There are a 7 |
729 | bit version and an 8 bit version. |
f2a2953c |
730 | |
0ab8f81e |
731 | The 7 bit version switches character set via escape sequence so it |
f2a2953c |
732 | cannot form a CCS. Since this is more difficult to handle in programs |
0ab8f81e |
733 | than the 8 bit version, the 7 bit version is not very popular except for |
734 | iso-2022-jp, the I<de facto> standard CES for e-mails. |
f2a2953c |
735 | |
0ab8f81e |
736 | The 8 bit version can form a CCS. EUC and ISO-8859 are two examples |
962111ca |
737 | thereof. Pre-5.6 perl could use them as string literals. |
a999c27c |
738 | |
739 | =item UCS |
740 | |
741 | Short for I<Universal Character Set>. When you say just UCS, it means |
0ab8f81e |
742 | I<Unicode>. |
a999c27c |
743 | |
744 | =item UCS-2 |
745 | |
746 | ISO/IEC 10646 encoding form: Universal Character Set coded in two |
747 | octets. |
748 | |
749 | =item Unicode |
750 | |
0ab8f81e |
751 | A character set that aims to include all character repertoires of the |
962111ca |
752 | world. Many character sets in various national as well as industrial |
f2a2953c |
753 | standards have become, in a way, just subsets of Unicode. |
a999c27c |
754 | |
755 | =item UTF |
756 | |
f2a2953c |
757 | Short for I<Unicode Transformation Format>. Determines how to map a |
0ab8f81e |
758 | Unicode character into a byte sequence. |
a999c27c |
759 | |
760 | =item UTF-16 |
761 | |
762 | A UTF in 16-bit encoding. Can either be in big endian or little |
0ab8f81e |
763 | endian. The big endian version is called UTF-16BE (equal to UCS-2 + |
764 | surrogate support) and the little endian version is called UTF-16LE. |
67d7b5ef |
765 | |
766 | =back |
5d030b67 |
767 | |
768 | =head1 See Also |
769 | |
5129552c |
770 | L<Encode>, |
771 | L<Encode::Byte>, |
a63c962f |
772 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
773 | L<Encode::EBCDIC>, L<Encode::Symbol> |
e8c86ba6 |
774 | L<Encode::MIME::Header>, L<Encode::Guess> |
5d030b67 |
775 | |
a999c27c |
776 | =head1 References |
777 | |
44b3b9c7 |
778 | =over 2 |
a999c27c |
779 | |
780 | =item ECMA |
781 | |
782 | European Computer Manufacturers Association |
783 | L<http://www.ecma.ch> |
784 | |
44b3b9c7 |
785 | =over 2 |
a999c27c |
786 | |
0ab8f81e |
787 | =item ECMA-035 (eq C<ISO-2022>) |
a999c27c |
788 | |
789 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> |
790 | |
0ab8f81e |
791 | The specification of ISO-2022 is available from the link above. |
a999c27c |
792 | |
793 | =back |
794 | |
795 | =item IANA |
796 | |
797 | Internet Assigned Numbers Authority |
798 | L<http://www.iana.org/> |
799 | |
44b3b9c7 |
800 | =over 2 |
a999c27c |
801 | |
802 | =item Assigned Charset Names by IANA |
803 | |
804 | L<http://www.iana.org/assignments/character-sets> |
805 | |
806 | Most of the C<canonical names> in Encode derive from this list |
807 | so you can directly apply the string you have extracted from MIME |
448e90bb |
808 | header of mails and web pages. |
a999c27c |
809 | |
810 | =back |
811 | |
812 | =item ISO |
813 | |
814 | International Organization for Standardization |
815 | L<http://www.iso.ch/> |
816 | |
817 | =item RFC |
818 | |
962111ca |
819 | Request For Comments -- need I say more? |
b2deda17 |
820 | L<http://www.rfc-editor.org/>, L<http://www.ietf.org/rfc.html>, |
0ab8f81e |
821 | L<http://www.faqs.org/rfcs/> |
a999c27c |
822 | |
823 | =item UC |
824 | |
825 | Unicode Consortium |
826 | L<http://www.unicode.org/> |
827 | |
44b3b9c7 |
828 | =over 2 |
a999c27c |
829 | |
830 | =item Unicode Glossary |
831 | |
832 | L<http://www.unicode.org/glossary/> |
833 | |
962111ca |
834 | The glossary of this document is based upon this site. |
a999c27c |
835 | |
836 | =back |
837 | |
838 | =back |
839 | |
840 | =head2 Other Notable Sites |
841 | |
44b3b9c7 |
842 | =over 2 |
a999c27c |
843 | |
844 | =item czyborra.com |
845 | |
f2a2953c |
846 | L<http://czyborra.com/> |
a999c27c |
847 | |
cf525c36 |
848 | Contains a lot of useful information, especially gory details of ISO |
a999c27c |
849 | vs. vendor mappings. |
850 | |
851 | =item CJK.inf |
852 | |
b2deda17 |
853 | L<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf> |
a999c27c |
854 | |
855 | Somewhat obsolete (last update in 1996), but still useful. Also try |
856 | |
857 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
858 | |
0ab8f81e |
859 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>. |
a999c27c |
860 | |
f2a2953c |
861 | =item Jungshik Shin's Hangul FAQ |
862 | |
863 | L<http://jshin.net/faq> |
864 | |
0ab8f81e |
865 | And especially its subject 8. |
f2a2953c |
866 | |
867 | L<http://jshin.net/faq/qa8.html> |
868 | |
962111ca |
869 | A comprehensive overview of the Korean (C<KS *>) standards. |
f2a2953c |
870 | |
0ab8f81e |
871 | =item debian.org: "Introduction to i18n" |
872 | |
873 | A brief description for most of the mentioned CJK encodings is |
874 | contained in |
875 | L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html> |
876 | |
f2a2953c |
877 | =back |
878 | |
879 | =head2 Offline sources |
880 | |
44b3b9c7 |
881 | =over 2 |
f2a2953c |
882 | |
883 | =item C<CJKV Information Processing> by Ken Lunde |
884 | |
885 | CJKV Information Processing |
886 | 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 |
887 | |
0ab8f81e |
888 | The modern successor of C<CJK.inf>. |
f2a2953c |
889 | |
0ab8f81e |
890 | Features a comprehensive coverage of CJKV character sets and |
f2a2953c |
891 | encodings along with many other issues faced by anyone trying |
892 | to better support CJKV languages/scripts in all the areas of |
893 | information processing. |
894 | |
0ab8f81e |
895 | To purchase this book, visit |
b2deda17 |
896 | L<http://oreilly.com/catalog/9780596514471/> |
0ab8f81e |
897 | or your favourite bookstore. |
f2a2953c |
898 | |
a999c27c |
899 | =back |
900 | |
5d030b67 |
901 | =cut |