Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
a999c27c |
3 | Encode::Supported -- Supported encodings by Encode |
5d030b67 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
a999c27c |
13 | the first in the following sequence (with a few exceptions). |
5d030b67 |
14 | |
a999c27c |
15 | =over |
16 | |
17 | =item * |
18 | |
19 | The name used by the perl community. That includes 'utf8' and 'ascii'. |
20 | Unlike aliases, canonical names directly reaches the method so such |
21 | frequently used words like 'utf8' should do without alias lookups. |
22 | |
23 | =item * |
24 | |
25 | The MIME name as defined in IETF RFCs This includes all "iso-"'s. |
26 | |
27 | =item * |
28 | |
29 | The name in the IANA registry. |
30 | |
31 | =item * |
32 | |
33 | The name used by the organization that defined it. |
34 | |
35 | =back |
36 | |
37 | In case I<de jure> canonical names differ from that of the Encode |
38 | module, they are always aliased if it ever be implemented. So you can |
39 | safely tell if a given encoding is implemented or not just by passing |
40 | the canonical name. |
5d030b67 |
41 | |
5129552c |
42 | Because of all the alias issues, and because in the general case |
43 | encodings have state, "Encode" uses the encoding object internally |
44 | once an operation is in progress. |
5d030b67 |
45 | |
5129552c |
46 | =head1 Supported Encodings |
5d030b67 |
47 | |
48 | As of Perl 5.8.0, at least the following encodings are recognized. |
49 | Note that unless otherwise specified, they are all case insensitive |
a63c962f |
50 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 |
51 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
52 | |
5129552c |
53 | Encodings are categorized and implemented in several different modules |
54 | but you don't have to C<use Encode::XX> to make them available for |
55 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
56 | |
5129552c |
57 | =head2 Built-in Encodings |
5d030b67 |
58 | |
5129552c |
59 | The following encodings are always available. |
5d030b67 |
60 | |
67d7b5ef |
61 | Canonical Aliases Comments & References |
62 | ---------------------------------------------------------------- |
a999c27c |
63 | ascii US-ascii [ECMA] |
67d7b5ef |
64 | iso-8859-1 latin1 [ISO] |
a999c27c |
65 | utf8 UTF-8 [RFC2279] |
80a5d8e7 |
66 | UCS-2BE UCS-2, iso-10646-1 [IANA, UC] |
67 | UCS-2LE [UC] |
68 | UTF-16 [UC] |
69 | UTF-16BE [UC] |
70 | UTF-16LE [UC] |
71 | UTF-32 [UC] |
72 | UTF-32BE [UC] |
73 | UTF-32LE [UC] |
67d7b5ef |
74 | ---------------------------------------------------------------- |
5d030b67 |
75 | |
80a5d8e7 |
76 | To find how those (UCS-2|UTF-(16|32))(LE|BE)? differ to one another, |
77 | see L<Encode::Unicode>. |
78 | |
a999c27c |
79 | =head2 Encode::Byte -- Extended ASCII |
5d030b67 |
80 | |
a999c27c |
81 | Encode::Byte implements most of single-byte encodings except for |
82 | Symbols and EBCDIC. The following encodings are based single-byte |
83 | encoding implemented as extended ASCII. For most cases it uses |
84 | \x80-\xff (upper half) to map non-ASCII characters. |
85 | |
86 | =over 2 |
87 | |
88 | =item ISO-8859 and corresponding vendor mappings |
89 | |
90 | Since there are so many, They are presented in table format with |
91 | Languages and corresponding encoding names by vendors. Note the table |
92 | is sorted in order of ISO-8859 and the corresponding vendor mappings |
93 | are slightly different from that of ISO. See |
94 | L<http://czyborra.com/charsets/iso8859.html> for details. |
95 | |
96 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others |
97 | ---------------------------------------------------------------- |
98 | N. America (ASCII) cp437 AdobeStandardEncoding |
99 | cp863 (DOSCanadaF) |
100 | W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep |
101 | hp-roman8 |
102 | cp860 (DOSPortuguese) |
103 | CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman |
104 | MacCroatian |
105 | MacRomanian |
106 | MacRumanian |
107 | Latin3(*3) iso-8859-3 |
108 | Latin4(*4) iso-8859-4 |
109 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic |
110 | (Also see next section) cp866 MacUkrainian |
111 | Arabic iso-8859-6 cp864 cp1256 MacArabic |
112 | cp1006 MacFarsi |
113 | Greek iso-8859-7 cp737 cp1253 MacGreek |
114 | cp869 (DOSGreek2) |
115 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew |
116 | Turkish iso-8859-9 cp857 cp1254 MacTurkish |
117 | Nordics iso-8859-10 cp865 |
118 | cp861 MacIcelandic |
119 | MacSami |
120 | Thai iso-8859-11 cp874 MacThai |
121 | (iso-8859-12 is nonexistent. Reserved for Indics?) |
122 | Baltics iso-8859-13 cp775 cp1257 |
123 | Celtics iso-8859-14 |
124 | Latin9(*15) iso-8859-15 |
125 | Latin10 iso-8859-16 |
126 | Vietnamese viscii cp1258 MacVietnamese |
127 | ---------------------------------------------------------------- |
128 | |
129 | (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5 |
130 | (*4) Baltics. Now on 8859-10 |
131 | (*9) Nicknamed Latin0; Euro sign as well as French and Finnish |
132 | letters that are missing from 8859-1 are added. |
133 | |
134 | All cp* are also available as ibm-*, ms-*, and windows-* . See also |
135 | L<http://czyborra.com/charsets/codepages.html>. |
136 | |
137 | Macintosh encodings don't seem to be registered in such entities as |
138 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note |
139 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> |
140 | for details |
141 | |
142 | =item KOI8 - De Facto Standard for Cyrillic world |
143 | |
144 | Though ISO-8859 does have ISO-8859, KOI8 series is far more popular |
145 | in the Net. L<Encode> comes with the following KOI charsets. for |
146 | gory details, See <http://czyborra.com/charsets/cyrillic.html> for |
147 | details. |
5d030b67 |
148 | |
67d7b5ef |
149 | ---------------------------------------------------------------- |
67d7b5ef |
150 | koi8-f |
a999c27c |
151 | koi8-r cp878 [RFC1489] |
67d7b5ef |
152 | koi8-u [RFC2319] |
67d7b5ef |
153 | |
a999c27c |
154 | =item gsm0338 - Hentai Latin 1 |
155 | |
156 | GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII, |
157 | control character ranges and other parts are mapped very differently, |
80a5d8e7 |
158 | presumablly to store Greek and Cyrillic alphabets. This one is also |
159 | covered in Encode::Byte even thought this one does not comply extended |
160 | ASCII. |
a999c27c |
161 | |
162 | =back |
5d030b67 |
163 | |
5129552c |
164 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
165 | |
166 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
a63c962f |
167 | below. Also note these are implemented in distinct module by |
168 | languages, due the the size concerns. Please also refer to their |
169 | respective document pages. |
5d030b67 |
170 | |
5129552c |
171 | =over 4 |
172 | |
173 | =item Encode::CN -- Continental China |
174 | |
80a5d8e7 |
175 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
176 | ---------------------------------------------------------------- |
80a5d8e7 |
177 | euc-cn(*1) MacChineseSimp |
178 | (gbk) cp936 (*2) |
179 | gb12345-raw { GB12345 without CES } |
180 | gb2312-raw { GB2312 without CES } |
5129552c |
181 | hz |
182 | iso-ir-165 |
67d7b5ef |
183 | ---------------------------------------------------------------- |
5129552c |
184 | |
80a5d8e7 |
185 | (*1) GB2312 is aliased to this. see L<Microsoft-related naming mess> |
186 | (*2) gbk is aliased to this. see L<Microsoft-related naming mess> |
187 | |
5129552c |
188 | =item Encode::JP -- Japan |
189 | |
80a5d8e7 |
190 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
191 | ---------------------------------------------------------------- |
a999c27c |
192 | euc-jp |
193 | shiftjis cp932 macJapanese |
80a5d8e7 |
194 | 7bit-jis |
195 | euc-jp |
196 | iso-2022-jp [RFC1468] |
197 | iso-2022-jp-1 [RFC2237] |
198 | jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } |
199 | jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } |
200 | jis0212-raw { JIS X 0212 (Extended Kanji) without CES } |
67d7b5ef |
201 | ---------------------------------------------------------------- |
5129552c |
202 | |
203 | =item Encode::KR -- Korea |
204 | |
80a5d8e7 |
205 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
206 | ---------------------------------------------------------------- |
a999c27c |
207 | euc-kr MacKorean [RFC1557] |
80a5d8e7 |
208 | cp949 (*) |
a999c27c |
209 | iso-2022-kr [RFC1557] |
210 | johab [KS X 1001:1998, Annex 3] |
80a5d8e7 |
211 | ksc5601-raw { KSC5601 without CES } |
67d7b5ef |
212 | ---------------------------------------------------------------- |
5129552c |
213 | |
80a5d8e7 |
214 | (*) ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to |
215 | this. See below. |
216 | |
217 | |
5129552c |
218 | =item Encode::TW -- Taiwan |
219 | |
80a5d8e7 |
220 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
221 | ---------------------------------------------------------------- |
a999c27c |
222 | big5 cp950 MacChineseTrad |
5129552c |
223 | big5-hkscs |
67d7b5ef |
224 | ---------------------------------------------------------------- |
5129552c |
225 | |
226 | =item Encode::HanExtra -- More Chinese via CPAN |
227 | |
228 | Due to size concerns, additional Chinese encodings below are |
229 | distributed separately on CPAN, under the name Encode::HanExtra. |
230 | |
80a5d8e7 |
231 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
232 | ---------------------------------------------------------------- |
5129552c |
233 | gb18030 |
234 | euc-tw |
235 | big5plus |
67d7b5ef |
236 | ---------------------------------------------------------------- |
5129552c |
237 | |
238 | =back |
239 | |
240 | =head2 Miscellaneous encodings |
241 | |
242 | =over 4 |
243 | |
244 | =item Encode::EBCDIC |
5d030b67 |
245 | |
a999c27c |
246 | See L<perlebcdic> for details. |
5d030b67 |
247 | |
67d7b5ef |
248 | ---------------------------------------------------------------- |
5d030b67 |
249 | cp37 |
a999c27c |
250 | cp500 |
251 | cp875 |
252 | cp1026 |
253 | cp1047 |
5d030b67 |
254 | posix-bc |
67d7b5ef |
255 | ---------------------------------------------------------------- |
5129552c |
256 | |
a63c962f |
257 | =item Encode::Symbols |
5d030b67 |
258 | |
5129552c |
259 | For symbols and dingbats. |
5d030b67 |
260 | |
67d7b5ef |
261 | ---------------------------------------------------------------- |
5d030b67 |
262 | symbol |
263 | dingbats |
a999c27c |
264 | MacDingbats |
265 | AdobeZdingbat |
266 | AdobeSymbol |
67d7b5ef |
267 | ---------------------------------------------------------------- |
268 | |
269 | =back |
270 | |
271 | =head1 Unsupported encodings |
272 | |
273 | The following are not supported as yet. Some because they are rarely |
274 | usede, some because of technical difficulty. They may be supported by |
275 | external modules via CPAN in future, however. |
276 | |
277 | =over 4 |
278 | |
279 | =item ISO-2022-JP-2 [RFC1554] |
280 | |
281 | Not very popular yet. Needs Unicode Database or equivalent to |
282 | implement encode() (Because it includes JIS X 0208/0212, KSC5601, and |
283 | GB2312 sumulteniously, which code points in unicode overlap. So you |
284 | need to lookup the database to determine what character set a given |
285 | Unicode character should belong). |
286 | |
287 | =item ISO-2022-CN [RFC1922] |
288 | |
289 | Not very popular. Needs CNS 11643-1 and 2 which are not available in |
290 | this module. CNS 11643 is supported (via euc-tw) in |
291 | Encode::HanExtra. Autrijus may add support for this encoding in his |
292 | module in future |
293 | |
294 | =item various UP-UX encodings |
295 | |
296 | The following are unsoported due to the lack of mapping data. |
297 | |
298 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 |
299 | '15' - japanese15, korean15, and roi15 |
300 | |
301 | =item Cyrillic encoding ISO-IR-111 |
302 | |
303 | Anton doubts its usefulness. |
304 | |
305 | =item ISO-8859-8-1 [Hebrew] |
306 | |
a999c27c |
307 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and |
308 | MacHebrew are supported because and just because there were mappings |
309 | available at L<http://www.unicode.org/>). Contribution welcome. |
67d7b5ef |
310 | |
311 | =item Thai encoding TCVN |
312 | |
313 | Ditto. |
314 | |
315 | =item Vietnamese encodings VPS |
316 | |
a999c27c |
317 | Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See |
318 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and |
319 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> |
320 | if you are interested in helping us. |
67d7b5ef |
321 | |
322 | =item various Mac encodings |
323 | |
a999c27c |
324 | The following are unsoported due to the lack of mapping data. |
325 | |
326 | MacArmenian, MacBengali, MacBurmese, MacEthiopic |
327 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer |
328 | MacLaotian, MacMalayalam, MacMongolian, MacOriya |
329 | MacSinhalese, MacTamil, MacTelugu, MacTibetan |
330 | MacVietnamese |
331 | |
332 | The rest of which already available are based upon the vendor mappings at |
333 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . |
334 | |
335 | =item (Mac) Indic encodings |
336 | |
337 | The maps for the following is available at L<http://www.unicode.org/> |
338 | but remains unsupport because those encordigs need algorithmical |
339 | approach, unsupported by F<enc2xs> |
67d7b5ef |
340 | |
a999c27c |
341 | MacDevanagari |
342 | MacGurmukhi |
343 | MacGujarati |
67d7b5ef |
344 | |
a999c27c |
345 | For details, please see C<Unicode mapping issues and notes:> at |
346 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . |
347 | |
348 | I believe this issue is prevalent not only for Mac Indics but also in |
349 | other Indic encodings but those mentions were the only Indic encodings |
350 | maps that I could find at L<http://www.unicode.org/> . |
5129552c |
351 | |
352 | =back |
5d030b67 |
353 | |
a999c27c |
354 | =head1 Encoding vs. Charset -- terminology |
5d030b67 |
355 | |
a999c27c |
356 | We are used to using the term (character) I<encoding> and I<character set> |
357 | interchangeably. But just as using the term byte and character is |
358 | dangerous and should be differenciated when needed, we need to |
359 | differenciate I<encoding> and I<character set>. |
5d030b67 |
360 | |
80a5d8e7 |
361 | To understand that, it's follow how we make computers grok our characters. |
a999c27c |
362 | |
363 | =over 4 |
364 | |
365 | =item * |
67d7b5ef |
366 | |
a999c27c |
367 | First we start with which characters to include. We call this |
368 | collection of characters I<character repertoire>. |
5d030b67 |
369 | |
a999c27c |
370 | =item * |
5d030b67 |
371 | |
a999c27c |
372 | Then we have to give each character a unique ID so your computer can |
373 | tell the differnce from 'a' to 'A'. This itemized character |
374 | repartoire is now a I<character set>. |
a63c962f |
375 | |
a999c27c |
376 | =item * |
377 | |
378 | If your computer can grow the character set without further |
379 | proccessing, you can go ahead use it. This is called a I<coded |
380 | character set> (CCS) or I<raw character encoding>. ASCII is used this |
381 | way for most cases. |
382 | |
383 | =item * |
384 | |
385 | But in many cases especially multi-byte CJK encodings, you have to |
386 | tweak a little more. Your network connection may not accept any data |
387 | with the Most Significant Bit set, Your computer may not be able to |
388 | tell if a given byte is a whole character or just half of it. So you |
389 | have to I<encode> the character set to use it. |
390 | |
391 | A I<character encoding scheme> (CES) determines how to encode a given |
392 | character set, or a set of multiple character sets. 7bit ISO-2022 is |
393 | an example of CES. You switch between character sets via I<escape |
394 | sequence>. |
67d7b5ef |
395 | |
396 | =back |
397 | |
a999c27c |
398 | Technically, or Mathematically speaking, a character set encoded in |
399 | such a CES that maps character by character may form a CCS. EUC is such |
400 | an example. CES of EUC is as follows; |
67d7b5ef |
401 | |
a999c27c |
402 | =over 4 |
5d030b67 |
403 | |
a999c27c |
404 | =item * |
5d030b67 |
405 | |
a999c27c |
406 | Map ASCII unchanged. |
407 | |
408 | =item * |
409 | |
410 | Map such a character set that consists of 94 or 96 powered by N |
411 | members by adding 0x80 to each byte. |
412 | |
413 | =item * |
414 | |
415 | You can also use 0x8e and 0x8f to tell the following sequence of |
416 | characters belong to yet another character set. each following byte |
417 | is added by 0x80 |
418 | |
419 | =back |
420 | |
421 | By carefully looking at at the encoded byte sequence, you may find the |
422 | byte sequence conforms a unique number. In that sense EUC is a CCS |
423 | generated by a CES above from up to four CCS (complicated?). UTF-8 |
424 | falls into this category. See L<perlunicode/"UTF-8"> to find how |
425 | UTF-8 maps Unicode to a byte sequence. |
426 | |
427 | You may also find by now why 7bit ISO-2022 cannot conform a CCS. If |
428 | you look at a byte sequence \x21\x21, you can't tell if it is two !'s |
429 | or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no |
430 | trouble between "!!". and " " |
67d7b5ef |
431 | |
a63c962f |
432 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
433 | |
434 | This section tries to classify the supported encodings by their |
435 | applicability for information exchange over the Internet and to |
436 | choose the most suitable aliases to name them in the context of |
437 | such communication. |
438 | |
67d7b5ef |
439 | =over 2 |
440 | |
441 | =item * |
442 | |
80a5d8e7 |
443 | To (en|de) code Encodings marked as C<(**)>, You need |
a999c27c |
444 | C<Encode::HanExtra>, available from CPAN. |
67d7b5ef |
445 | |
446 | =back |
447 | |
a63c962f |
448 | Encoding names |
5d030b67 |
449 | |
80a5d8e7 |
450 | US-ASCII UTF-8 ISO-8859-* KOI8-R |
451 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
452 | EUC-KR Big5 GB2312 |
a999c27c |
453 | |
454 | are registered to IANA as preferred MIME names and may probably |
455 | be used over the Internet. |
5d030b67 |
456 | |
a999c27c |
457 | C<Shift_JIS> has been officialized by JIS X 0208-1997. |
458 | L<Microsoft-related naming mess> gives details. |
5d030b67 |
459 | |
a999c27c |
460 | C<GB2312> is the IANA name for C<EUC-CN>. |
461 | See L<Microsoft-related naming mess> for details. |
462 | |
463 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> |
80a5d8e7 |
464 | with Encode. See L<Encode::CN> for details. |
5d030b67 |
465 | |
a63c962f |
466 | EUC-CN |
80a5d8e7 |
467 | KOI8-U [RFC2319] |
5d030b67 |
468 | |
a999c27c |
469 | have not been registered with IANA (as of March 2002) but |
470 | seem to be supported by major web browsers. |
471 | IANA name for C<EUC-CN> is C<GB2312>. |
67d7b5ef |
472 | |
473 | KS_C_5601-1987 |
474 | |
a999c27c |
475 | is heavily misused. |
476 | See L<Microsoft-related naming mess> for details. |
477 | |
478 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> |
80a5d8e7 |
479 | with Encode. See L<Encode::KR> for details. |
480 | |
481 | UTF-16 UTF-16BE UTF-16LE |
482 | |
483 | are a IANA-registered C<charset>s. See [RFC 2781] for details. |
484 | Jungshik Shin reports that UTF-16 with a BOM is well accepted |
485 | by MS IE 5/6 and NS 4/6. Beware however that |
486 | |
487 | =over 2 |
488 | |
489 | =item * |
5d030b67 |
490 | |
80a5d8e7 |
491 | C<UTF-16> support in any software you're going to be |
492 | using/interoperating with has probably been less tested |
493 | then C<UTF-8> support |
5d030b67 |
494 | |
80a5d8e7 |
495 | =item * |
496 | |
497 | data coded with C<UTF-8> seamlessly passes traditional |
498 | command piping (C<cat>, C<more>, etc.) while UTF-16 coded |
499 | data is likely to cause confusion (with it's zero bytes, |
500 | for example) |
501 | |
502 | =item * |
503 | |
504 | it is beyond the power of words to describe the way HTML browsers |
505 | encode non-C<ASCII> form data. To get a general impression refer to |
506 | L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. |
507 | While encoding of form data has stabilzed for C<UTF-8> coded pages |
508 | (at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to |
509 | expect fun (and cross-browser discrepancies) with C<UTF-16> coded |
510 | pages! |
511 | |
512 | =back |
513 | |
514 | The rule of thumb is to use C<UTF-8> unless you know what |
515 | you're doing and unless you really need from using C<UTF-16>. |
a999c27c |
516 | |
5d030b67 |
517 | |
80a5d8e7 |
518 | ISO-IR-165 [RFC1345] |
5d030b67 |
519 | GBK |
520 | VISCII |
a63c962f |
521 | GB 12345 |
80a5d8e7 |
522 | GB 18030 (**) (see links bellow) |
523 | EUC-TW (**) |
5d030b67 |
524 | |
525 | are totally valid encodings but not registered at IANA. |
a63c962f |
526 | The names under which they are listed here are probably the |
527 | most widely-known names for these encodings and are recommended |
528 | names. |
529 | |
80a5d8e7 |
530 | BIG5PLUS (**) |
a63c962f |
531 | |
67d7b5ef |
532 | is a bit proprietary name. |
5d030b67 |
533 | |
a999c27c |
534 | =head2 Microsoft-related naming mess |
535 | |
536 | Microsoft products misuse the following names: |
5d030b67 |
537 | |
67d7b5ef |
538 | =over 2 |
a63c962f |
539 | |
a999c27c |
540 | =item KS_C_5601-1987 |
5d030b67 |
541 | |
a999c27c |
542 | Microsoft extension to C<EUC-KR>. |
5d030b67 |
543 | |
a999c27c |
544 | Proper name: C<CP949>. |
67d7b5ef |
545 | |
80a5d8e7 |
546 | See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> |
a999c27c |
547 | for details. |
5d030b67 |
548 | |
80a5d8e7 |
549 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common |
550 | misusage. I<Raw> C<KS_C_5601-1987> encoding is available as |
551 | C<kcs5601-raw>. |
5d030b67 |
552 | |
80a5d8e7 |
553 | See L<Encode::KR> for details. |
67d7b5ef |
554 | |
a999c27c |
555 | =item GB2312 |
67d7b5ef |
556 | |
a999c27c |
557 | Microsoft extension to C<EUC-CN>. |
a63c962f |
558 | |
a999c27c |
559 | Proper names: C<CP936>, C<GBK>. |
a63c962f |
560 | |
a999c27c |
561 | C<GB2312> has been registered in the C<EUC-CN> meaning at |
562 | IANA. This has partially repaired the situation: Microsoft's |
563 | C<GB2312> has become a superset of the official C<GB2312>. |
67d7b5ef |
564 | |
a999c27c |
565 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with |
566 | IANA registration. C<cp936> is supported separately. |
80a5d8e7 |
567 | I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. |
a999c27c |
568 | |
80a5d8e7 |
569 | See L<Encode::CN> for details. |
a999c27c |
570 | |
571 | =item Big5 |
572 | |
573 | Microsoft extension to C<Big5>. |
574 | |
575 | Proper name: C<CP950>. |
576 | |
577 | Encode separately supports C<Big5> and C<cp950>. |
578 | |
579 | =item Shift_JIS |
580 | |
581 | Microsoft's understanding of C<Shift_JIS>. |
582 | |
583 | JIS has not endorsed the full Microsoft standard however. |
584 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 |
585 | subsets, while Microsoft has always been meaning C<Shift_JIS> to |
586 | encode a wider character repertoire. |
587 | |
588 | As a historical predecessor Microsoft's variant |
589 | probably has more rights for the name, albeit it may be objected |
590 | that Microsoft shouldn't have used JIS as part of the name |
591 | in the first place. |
592 | |
593 | Unabiguous name: C<CP932>. |
594 | |
595 | Encode separately supports C<Shift_JIS> and C<cp932>. |
596 | |
597 | =back |
598 | |
599 | =head1 Glossary |
600 | |
601 | =over 2 |
602 | |
603 | =item character repertoire |
604 | |
605 | A collection of unique characters. A I<character> set in the most |
606 | strict sense. At this stage characters are not numberd. |
607 | |
608 | =item coded character set (CCS) |
609 | |
610 | A character set that is mapped in a way computers can use directly. |
611 | Many character encodings including EUC falls in this category. |
612 | |
613 | =item character encoding scheme (CES) |
614 | |
615 | An algorithm to map a character set to a byte sequence. You don't |
616 | have to be able to tell which character set a given byte sequence |
617 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an |
618 | example of being both a CCS and CES. |
619 | |
80a5d8e7 |
620 | =item charset (in MIME context) |
621 | |
622 | has long been used in the meaning of C<encoding>, CES. |
623 | |
624 | While C<character set> word combination has lost this meaning |
625 | in MIME context since [RFC 2130], C<charset> abbreviation has |
626 | retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>: |
627 | |
628 | |
629 | This document uses the term "charset" to mean a set of rules for |
630 | mapping from a sequence of octets to a sequence of characters, such |
631 | as the combination of a coded character set and a character encoding |
632 | scheme; this is also what is used as an identifier in MIME "charset=" |
633 | parameters, and registered in the IANA charset registry ... (Note |
634 | that this is NOT a term used by other standards bodies, such as ISO). |
635 | [RFC 2277] |
636 | |
a999c27c |
637 | =item EUC |
638 | |
639 | Extended Unix Character. See ISO-2022 |
640 | |
641 | =item ISO-2022 |
642 | |
643 | A CES that was carefully designed to coexist with ASCII. There are 7 |
80a5d8e7 |
644 | bit version and 8 bit version. |
645 | |
646 | 7 bit version switches character set via escape sequence so this |
647 | cannot form a CCS. Since this is more difficult to handle in programs |
648 | than the 8 bit version, 7 bit version is not very popular except for |
649 | iso-2022-jp, the de facto standard CES for e-mails. |
650 | |
651 | 8 bit version can conform a CCS. EUC and ISO-8859 are two examples |
652 | thereof. pre-5.6 perl could use them as string literals. |
a999c27c |
653 | |
654 | =item UCS |
655 | |
656 | Short for I<Universal Character Set>. When you say just UCS, it means |
657 | I<Unicode> |
658 | |
659 | =item UCS-2 |
660 | |
661 | ISO/IEC 10646 encoding form: Universal Character Set coded in two |
662 | octets. |
663 | |
664 | =item Unicode |
665 | |
80a5d8e7 |
666 | A Character Set that aims to include all character repertoire of the |
667 | world. Many character sets in various national as well as industorial |
668 | standards have become, in a way, just subsets of Unicode. |
a999c27c |
669 | |
670 | =item UTF |
671 | |
80a5d8e7 |
672 | Short for I<Unicode Transformation Format>. Determines how to map a |
a999c27c |
673 | unicode character into byte sequnece. |
674 | |
675 | =item UTF-16 |
676 | |
677 | A UTF in 16-bit encoding. Can either be in big endian or little |
80a5d8e7 |
678 | endian. Big endian version is called UTF-16BE (equals to UCS-2 + |
679 | Surrogate Support) and little endian version is UTF-16LE. |
67d7b5ef |
680 | |
681 | =back |
5d030b67 |
682 | |
683 | =head1 See Also |
684 | |
5129552c |
685 | L<Encode>, |
686 | L<Encode::Byte>, |
a63c962f |
687 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
688 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
689 | |
a999c27c |
690 | =head1 References |
691 | |
692 | =over 2 |
693 | |
694 | =item ECMA |
695 | |
696 | European Computer Manufacturers Association |
697 | L<http://www.ecma.ch> |
698 | |
699 | =over 2 |
700 | |
701 | =item EMCA-035 (eq C<ISO-2022>) |
702 | |
703 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> |
704 | |
705 | The very dspecification of ISO-2022 is available from the link above. |
706 | |
707 | =back |
708 | |
709 | =item IANA |
710 | |
711 | Internet Assigned Numbers Authority |
712 | L<http://www.iana.org/> |
713 | |
714 | =over 2 |
715 | |
716 | =item Assigned Charset Names by IANA |
717 | |
718 | L<http://www.iana.org/assignments/character-sets> |
719 | |
720 | Most of the C<canonical names> in Encode derive from this list |
721 | so you can directly apply the string you have extracted from MIME |
722 | header of mails and we pages. |
723 | |
724 | =back |
725 | |
726 | =item ISO |
727 | |
728 | International Organization for Standardization |
729 | L<http://www.iso.ch/> |
730 | |
731 | =item RFC |
732 | |
733 | Request For Comment -- need I say more? |
80a5d8e7 |
734 | L<http://www.rfc.net/>, L<http://www.faqs.org/rfcs/> |
a999c27c |
735 | |
736 | =item UC |
737 | |
738 | Unicode Consortium |
739 | L<http://www.unicode.org/> |
740 | |
741 | =over 2 |
742 | |
743 | =item Unicode Glossary |
744 | |
745 | L<http://www.unicode.org/glossary/> |
746 | |
747 | The glossary of this document is based opon this site. |
748 | |
749 | =back |
750 | |
751 | =back |
752 | |
753 | =head2 Other Notable Sites |
754 | |
755 | =over 2 |
756 | |
757 | =item czyborra.com |
758 | |
80a5d8e7 |
759 | L<http://czyborra.com/> |
a999c27c |
760 | |
761 | Contains a a lot of useful information, especially gory details of ISO |
762 | vs. vendor mappings. |
763 | |
764 | =item CJK.inf |
765 | |
766 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
767 | |
768 | Somewhat obsolete (last update in 1996), but still useful. Also try |
769 | |
770 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
771 | |
772 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> |
773 | |
80a5d8e7 |
774 | =item Jungshik Shin's Hangul FAQ |
775 | |
776 | L<http://jshin.net/faq> |
777 | |
778 | And especially it's subject 8 |
779 | |
780 | L<http://jshin.net/faq/qa8.html> |
781 | |
782 | a comprehensive overview of the Korean (C<KS *>) standards. |
783 | |
784 | =back |
785 | |
786 | =head2 Offline sources |
787 | |
788 | =over 2 |
789 | |
790 | =item C<CJKV Information Processing> by Ken Lunde |
791 | |
792 | CJKV Information Processing |
793 | 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 |
794 | |
795 | The modern successor of the C<CJK.inf>. |
796 | |
797 | Features a comprehensive coverage on CJKV character sets and |
798 | encodings along with many other issues faced by anyone trying |
799 | to better support CJKV languages/scripts in all the areas of |
800 | information processing. |
801 | |
802 | To purchase this book visit |
803 | L<http://www.oreilly.com/catalog/cjkvinfo/> |
804 | |
a999c27c |
805 | =back |
806 | |
5d030b67 |
807 | =cut |
67d7b5ef |
808 | |
809 | I could not find this page because the hostname doesn't resolve! |
810 | |
811 | Brief description for most of the mentioned CJK encodings |
812 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |