Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
a999c27c |
3 | Encode::Supported -- Supported encodings by Encode |
5d030b67 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
a999c27c |
13 | the first in the following sequence (with a few exceptions). |
5d030b67 |
14 | |
a999c27c |
15 | =over |
16 | |
17 | =item * |
18 | |
19 | The name used by the perl community. That includes 'utf8' and 'ascii'. |
20 | Unlike aliases, canonical names directly reaches the method so such |
21 | frequently used words like 'utf8' should do without alias lookups. |
22 | |
23 | =item * |
24 | |
25 | The MIME name as defined in IETF RFCs This includes all "iso-"'s. |
26 | |
27 | =item * |
28 | |
29 | The name in the IANA registry. |
30 | |
31 | =item * |
32 | |
33 | The name used by the organization that defined it. |
34 | |
35 | =back |
36 | |
37 | In case I<de jure> canonical names differ from that of the Encode |
38 | module, they are always aliased if it ever be implemented. So you can |
39 | safely tell if a given encoding is implemented or not just by passing |
40 | the canonical name. |
5d030b67 |
41 | |
5129552c |
42 | Because of all the alias issues, and because in the general case |
43 | encodings have state, "Encode" uses the encoding object internally |
44 | once an operation is in progress. |
5d030b67 |
45 | |
5129552c |
46 | =head1 Supported Encodings |
5d030b67 |
47 | |
48 | As of Perl 5.8.0, at least the following encodings are recognized. |
49 | Note that unless otherwise specified, they are all case insensitive |
a63c962f |
50 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 |
51 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
52 | |
5129552c |
53 | Encodings are categorized and implemented in several different modules |
54 | but you don't have to C<use Encode::XX> to make them available for |
55 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
56 | |
5129552c |
57 | =head2 Built-in Encodings |
5d030b67 |
58 | |
5129552c |
59 | The following encodings are always available. |
5d030b67 |
60 | |
67d7b5ef |
61 | Canonical Aliases Comments & References |
62 | ---------------------------------------------------------------- |
a999c27c |
63 | ascii US-ascii [ECMA] |
67d7b5ef |
64 | iso-8859-1 latin1 [ISO] |
a999c27c |
65 | utf8 UTF-8 [RFC2279] |
66 | UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC] |
67 | UTF-16LE UCS-2LE [UC] |
67d7b5ef |
68 | ---------------------------------------------------------------- |
5d030b67 |
69 | |
a999c27c |
70 | =head2 Encode::Byte -- Extended ASCII |
5d030b67 |
71 | |
a999c27c |
72 | Encode::Byte implements most of single-byte encodings except for |
73 | Symbols and EBCDIC. The following encodings are based single-byte |
74 | encoding implemented as extended ASCII. For most cases it uses |
75 | \x80-\xff (upper half) to map non-ASCII characters. |
76 | |
77 | =over 2 |
78 | |
79 | =item ISO-8859 and corresponding vendor mappings |
80 | |
81 | Since there are so many, They are presented in table format with |
82 | Languages and corresponding encoding names by vendors. Note the table |
83 | is sorted in order of ISO-8859 and the corresponding vendor mappings |
84 | are slightly different from that of ISO. See |
85 | L<http://czyborra.com/charsets/iso8859.html> for details. |
86 | |
87 | Lang/Regions ISO/Other Std. DOS Windows Macintosh Others |
88 | ---------------------------------------------------------------- |
89 | N. America (ASCII) cp437 AdobeStandardEncoding |
90 | cp863 (DOSCanadaF) |
91 | W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep |
92 | hp-roman8 |
93 | cp860 (DOSPortuguese) |
94 | CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman |
95 | MacCroatian |
96 | MacRomanian |
97 | MacRumanian |
98 | Latin3(*3) iso-8859-3 |
99 | Latin4(*4) iso-8859-4 |
100 | Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic |
101 | (Also see next section) cp866 MacUkrainian |
102 | Arabic iso-8859-6 cp864 cp1256 MacArabic |
103 | cp1006 MacFarsi |
104 | Greek iso-8859-7 cp737 cp1253 MacGreek |
105 | cp869 (DOSGreek2) |
106 | Hebrew iso-8859-8 cp862 cp1255 MacHebrew |
107 | Turkish iso-8859-9 cp857 cp1254 MacTurkish |
108 | Nordics iso-8859-10 cp865 |
109 | cp861 MacIcelandic |
110 | MacSami |
111 | Thai iso-8859-11 cp874 MacThai |
112 | (iso-8859-12 is nonexistent. Reserved for Indics?) |
113 | Baltics iso-8859-13 cp775 cp1257 |
114 | Celtics iso-8859-14 |
115 | Latin9(*15) iso-8859-15 |
116 | Latin10 iso-8859-16 |
117 | Vietnamese viscii cp1258 MacVietnamese |
118 | ---------------------------------------------------------------- |
119 | |
120 | (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5 |
121 | (*4) Baltics. Now on 8859-10 |
122 | (*9) Nicknamed Latin0; Euro sign as well as French and Finnish |
123 | letters that are missing from 8859-1 are added. |
124 | |
125 | All cp* are also available as ibm-*, ms-*, and windows-* . See also |
126 | L<http://czyborra.com/charsets/codepages.html>. |
127 | |
128 | Macintosh encodings don't seem to be registered in such entities as |
129 | IANA. "Canonical" names in Encode are based upon Apple's Tech Note |
130 | 1150. See L<http://developer.apple.com/technotes/tn/tn1150.html> |
131 | for details |
132 | |
133 | =item KOI8 - De Facto Standard for Cyrillic world |
134 | |
135 | Though ISO-8859 does have ISO-8859, KOI8 series is far more popular |
136 | in the Net. L<Encode> comes with the following KOI charsets. for |
137 | gory details, See <http://czyborra.com/charsets/cyrillic.html> for |
138 | details. |
5d030b67 |
139 | |
67d7b5ef |
140 | ---------------------------------------------------------------- |
67d7b5ef |
141 | koi8-f |
a999c27c |
142 | koi8-r cp878 [RFC1489] |
67d7b5ef |
143 | koi8-u [RFC2319] |
67d7b5ef |
144 | |
a999c27c |
145 | =item gsm0338 - Hentai Latin 1 |
146 | |
147 | GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII, |
148 | control character ranges and other parts are mapped very differently, |
149 | presumablly to store Cyrillics. This one is also covered in |
150 | Encode::Byte even thought this one does not comply extended ASCII. |
151 | |
152 | =back |
5d030b67 |
153 | |
5129552c |
154 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
155 | |
156 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
a63c962f |
157 | below. Also note these are implemented in distinct module by |
158 | languages, due the the size concerns. Please also refer to their |
159 | respective document pages. |
5d030b67 |
160 | |
5129552c |
161 | =over 4 |
162 | |
163 | =item Encode::CN -- Continental China |
164 | |
a999c27c |
165 | Standard DOS/Win Macintosh Comment |
67d7b5ef |
166 | ---------------------------------------------------------------- |
a999c27c |
167 | euc-cn MacChineseSimp GB2312 is aliased to this |
168 | (gbk) cp936 GBK is aliased to to this |
169 | gb12345-raw GB12345 as is |
170 | gb2312-raw GB2312 as is |
5129552c |
171 | hz |
172 | iso-ir-165 |
67d7b5ef |
173 | ---------------------------------------------------------------- |
5129552c |
174 | |
175 | =item Encode::JP -- Japan |
176 | |
a999c27c |
177 | Standard DOS/Win Macintosh Comment/Reference |
67d7b5ef |
178 | ---------------------------------------------------------------- |
a999c27c |
179 | euc-jp |
180 | shiftjis cp932 macJapanese |
5129552c |
181 | 7bit-jis jis |
5129552c |
182 | euc-jp ujis |
a999c27c |
183 | iso-2022-jp [RFC1468] |
184 | iso-2022-jp-1 [RFC2237] |
67d7b5ef |
185 | ---------------------------------------------------------------- |
5129552c |
186 | |
187 | =item Encode::KR -- Korea |
188 | |
67d7b5ef |
189 | ---------------------------------------------------------------- |
a999c27c |
190 | euc-kr MacKorean [RFC1557] |
191 | cp949 ks_c_5601-1987 is an alias |
192 | thereof. |
193 | iso-2022-kr [RFC1557] |
194 | johab [KS X 1001:1998, Annex 3] |
195 | ksc5601-raw KSC5601 as is |
67d7b5ef |
196 | ---------------------------------------------------------------- |
5129552c |
197 | |
198 | =item Encode::TW -- Taiwan |
199 | |
67d7b5ef |
200 | ---------------------------------------------------------------- |
a999c27c |
201 | big5 cp950 MacChineseTrad |
5129552c |
202 | big5-hkscs |
67d7b5ef |
203 | ---------------------------------------------------------------- |
5129552c |
204 | |
205 | =item Encode::HanExtra -- More Chinese via CPAN |
206 | |
207 | Due to size concerns, additional Chinese encodings below are |
208 | distributed separately on CPAN, under the name Encode::HanExtra. |
209 | |
67d7b5ef |
210 | ---------------------------------------------------------------- |
5129552c |
211 | gb18030 |
212 | euc-tw |
213 | big5plus |
67d7b5ef |
214 | ---------------------------------------------------------------- |
5129552c |
215 | |
216 | =back |
217 | |
218 | =head2 Miscellaneous encodings |
219 | |
220 | =over 4 |
221 | |
222 | =item Encode::EBCDIC |
5d030b67 |
223 | |
a999c27c |
224 | See L<perlebcdic> for details. |
5d030b67 |
225 | |
67d7b5ef |
226 | ---------------------------------------------------------------- |
5d030b67 |
227 | cp37 |
a999c27c |
228 | cp500 |
229 | cp875 |
230 | cp1026 |
231 | cp1047 |
5d030b67 |
232 | posix-bc |
67d7b5ef |
233 | ---------------------------------------------------------------- |
5129552c |
234 | |
a63c962f |
235 | =item Encode::Symbols |
5d030b67 |
236 | |
5129552c |
237 | For symbols and dingbats. |
5d030b67 |
238 | |
67d7b5ef |
239 | ---------------------------------------------------------------- |
5d030b67 |
240 | symbol |
241 | dingbats |
a999c27c |
242 | MacDingbats |
243 | AdobeZdingbat |
244 | AdobeSymbol |
67d7b5ef |
245 | ---------------------------------------------------------------- |
246 | |
247 | =back |
248 | |
249 | =head1 Unsupported encodings |
250 | |
251 | The following are not supported as yet. Some because they are rarely |
252 | usede, some because of technical difficulty. They may be supported by |
253 | external modules via CPAN in future, however. |
254 | |
255 | =over 4 |
256 | |
257 | =item ISO-2022-JP-2 [RFC1554] |
258 | |
259 | Not very popular yet. Needs Unicode Database or equivalent to |
260 | implement encode() (Because it includes JIS X 0208/0212, KSC5601, and |
261 | GB2312 sumulteniously, which code points in unicode overlap. So you |
262 | need to lookup the database to determine what character set a given |
263 | Unicode character should belong). |
264 | |
265 | =item ISO-2022-CN [RFC1922] |
266 | |
267 | Not very popular. Needs CNS 11643-1 and 2 which are not available in |
268 | this module. CNS 11643 is supported (via euc-tw) in |
269 | Encode::HanExtra. Autrijus may add support for this encoding in his |
270 | module in future |
271 | |
272 | =item various UP-UX encodings |
273 | |
274 | The following are unsoported due to the lack of mapping data. |
275 | |
276 | '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 |
277 | '15' - japanese15, korean15, and roi15 |
278 | |
279 | =item Cyrillic encoding ISO-IR-111 |
280 | |
281 | Anton doubts its usefulness. |
282 | |
283 | =item ISO-8859-8-1 [Hebrew] |
284 | |
a999c27c |
285 | None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and |
286 | MacHebrew are supported because and just because there were mappings |
287 | available at L<http://www.unicode.org/>). Contribution welcome. |
67d7b5ef |
288 | |
289 | =item Thai encoding TCVN |
290 | |
291 | Ditto. |
292 | |
293 | =item Vietnamese encodings VPS |
294 | |
a999c27c |
295 | Though Jungshik has reported that mozilla supports this encoding, It was too late for us to add one. In future via a separate module. See |
296 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> and |
297 | L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> |
298 | if you are interested in helping us. |
67d7b5ef |
299 | |
300 | =item various Mac encodings |
301 | |
a999c27c |
302 | The following are unsoported due to the lack of mapping data. |
303 | |
304 | MacArmenian, MacBengali, MacBurmese, MacEthiopic |
305 | MacExtArabic, MacGeorgian, MacKannada, MacKhmer |
306 | MacLaotian, MacMalayalam, MacMongolian, MacOriya |
307 | MacSinhalese, MacTamil, MacTelugu, MacTibetan |
308 | MacVietnamese |
309 | |
310 | The rest of which already available are based upon the vendor mappings at |
311 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . |
312 | |
313 | =item (Mac) Indic encodings |
314 | |
315 | The maps for the following is available at L<http://www.unicode.org/> |
316 | but remains unsupport because those encordigs need algorithmical |
317 | approach, unsupported by F<enc2xs> |
67d7b5ef |
318 | |
a999c27c |
319 | MacDevanagari |
320 | MacGurmukhi |
321 | MacGujarati |
67d7b5ef |
322 | |
a999c27c |
323 | For details, please see C<Unicode mapping issues and notes:> at |
324 | L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . |
325 | |
326 | I believe this issue is prevalent not only for Mac Indics but also in |
327 | other Indic encodings but those mentions were the only Indic encodings |
328 | maps that I could find at L<http://www.unicode.org/> . |
5129552c |
329 | |
330 | =back |
5d030b67 |
331 | |
a999c27c |
332 | =head1 Encoding vs. Charset -- terminology |
5d030b67 |
333 | |
a999c27c |
334 | We are used to using the term (character) I<encoding> and I<character set> |
335 | interchangeably. But just as using the term byte and character is |
336 | dangerous and should be differenciated when needed, we need to |
337 | differenciate I<encoding> and I<character set>. |
5d030b67 |
338 | |
a999c27c |
339 | To understand that, it's follow how we make computers grok our character. |
340 | |
341 | =over 4 |
342 | |
343 | =item * |
67d7b5ef |
344 | |
a999c27c |
345 | First we start with which characters to include. We call this |
346 | collection of characters I<character repertoire>. |
5d030b67 |
347 | |
a999c27c |
348 | =item * |
5d030b67 |
349 | |
a999c27c |
350 | Then we have to give each character a unique ID so your computer can |
351 | tell the differnce from 'a' to 'A'. This itemized character |
352 | repartoire is now a I<character set>. |
a63c962f |
353 | |
a999c27c |
354 | =item * |
355 | |
356 | If your computer can grow the character set without further |
357 | proccessing, you can go ahead use it. This is called a I<coded |
358 | character set> (CCS) or I<raw character encoding>. ASCII is used this |
359 | way for most cases. |
360 | |
361 | =item * |
362 | |
363 | But in many cases especially multi-byte CJK encodings, you have to |
364 | tweak a little more. Your network connection may not accept any data |
365 | with the Most Significant Bit set, Your computer may not be able to |
366 | tell if a given byte is a whole character or just half of it. So you |
367 | have to I<encode> the character set to use it. |
368 | |
369 | A I<character encoding scheme> (CES) determines how to encode a given |
370 | character set, or a set of multiple character sets. 7bit ISO-2022 is |
371 | an example of CES. You switch between character sets via I<escape |
372 | sequence>. |
67d7b5ef |
373 | |
374 | =back |
375 | |
a999c27c |
376 | Technically, or Mathematically speaking, a character set encoded in |
377 | such a CES that maps character by character may form a CCS. EUC is such |
378 | an example. CES of EUC is as follows; |
67d7b5ef |
379 | |
a999c27c |
380 | =over 4 |
5d030b67 |
381 | |
a999c27c |
382 | =item * |
5d030b67 |
383 | |
a999c27c |
384 | Map ASCII unchanged. |
385 | |
386 | =item * |
387 | |
388 | Map such a character set that consists of 94 or 96 powered by N |
389 | members by adding 0x80 to each byte. |
390 | |
391 | =item * |
392 | |
393 | You can also use 0x8e and 0x8f to tell the following sequence of |
394 | characters belong to yet another character set. each following byte |
395 | is added by 0x80 |
396 | |
397 | =back |
398 | |
399 | By carefully looking at at the encoded byte sequence, you may find the |
400 | byte sequence conforms a unique number. In that sense EUC is a CCS |
401 | generated by a CES above from up to four CCS (complicated?). UTF-8 |
402 | falls into this category. See L<perlunicode/"UTF-8"> to find how |
403 | UTF-8 maps Unicode to a byte sequence. |
404 | |
405 | You may also find by now why 7bit ISO-2022 cannot conform a CCS. If |
406 | you look at a byte sequence \x21\x21, you can't tell if it is two !'s |
407 | or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you have no |
408 | trouble between "!!". and " " |
67d7b5ef |
409 | |
a63c962f |
410 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
411 | |
412 | This section tries to classify the supported encodings by their |
413 | applicability for information exchange over the Internet and to |
414 | choose the most suitable aliases to name them in the context of |
415 | such communication. |
416 | |
67d7b5ef |
417 | =over 2 |
418 | |
419 | =item * |
420 | |
a999c27c |
421 | To (en|de) code Encodings marked as C<(*)>, You need |
422 | C<Encode::HanExtra>, available from CPAN. |
67d7b5ef |
423 | |
424 | =back |
425 | |
a63c962f |
426 | Encoding names |
5d030b67 |
427 | |
67d7b5ef |
428 | US-ASCII UTF-8 ISO-8859-* KOI8-R |
a63c962f |
429 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
a999c27c |
430 | EUC-KR Big5 GB2312 |
431 | |
432 | are registered to IANA as preferred MIME names and may probably |
433 | be used over the Internet. |
5d030b67 |
434 | |
a999c27c |
435 | C<Shift_JIS> has been officialized by JIS X 0208-1997. |
436 | L<Microsoft-related naming mess> gives details. |
5d030b67 |
437 | |
a999c27c |
438 | C<GB2312> is the IANA name for C<EUC-CN>. |
439 | See L<Microsoft-related naming mess> for details. |
440 | |
441 | C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> |
442 | with Encode. See L<Encode::CN -- Continental China> for details. |
5d030b67 |
443 | |
a63c962f |
444 | EUC-CN |
a999c27c |
445 | KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) |
5d030b67 |
446 | |
a999c27c |
447 | have not been registered with IANA (as of March 2002) but |
448 | seem to be supported by major web browsers. |
449 | IANA name for C<EUC-CN> is C<GB2312>. |
67d7b5ef |
450 | |
451 | KS_C_5601-1987 |
452 | |
a999c27c |
453 | is heavily misused. |
454 | See L<Microsoft-related naming mess> for details. |
455 | |
456 | C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> |
457 | with Encode. See L<Encode::KR -- Korea> for details. |
5d030b67 |
458 | |
a63c962f |
459 | UTF-16 |
5d030b67 |
460 | |
a999c27c |
461 | =for comment |
462 | waiting for comments from Jungshik Shin to soften this - Anton |
463 | |
464 | is a IANA-registered preferred MIME name |
a63c962f |
465 | but probably should be avoided as encoding for web pages due to |
a999c27c |
466 | the lack of browser support. |
5d030b67 |
467 | |
5d030b67 |
468 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
469 | GBK |
470 | VISCII |
a63c962f |
471 | GB 12345 |
472 | GB 18030 (*) (see links bellow) |
473 | EUC-TW (*) |
5d030b67 |
474 | |
475 | are totally valid encodings but not registered at IANA. |
a63c962f |
476 | The names under which they are listed here are probably the |
477 | most widely-known names for these encodings and are recommended |
478 | names. |
479 | |
67d7b5ef |
480 | BIG5PLUS (*) |
a63c962f |
481 | |
67d7b5ef |
482 | is a bit proprietary name. |
5d030b67 |
483 | |
a999c27c |
484 | =head2 Microsoft-related naming mess |
485 | |
486 | Microsoft products misuse the following names: |
5d030b67 |
487 | |
67d7b5ef |
488 | =over 2 |
a63c962f |
489 | |
a999c27c |
490 | =item KS_C_5601-1987 |
5d030b67 |
491 | |
a999c27c |
492 | Microsoft extension to C<EUC-KR>. |
5d030b67 |
493 | |
a999c27c |
494 | Proper name: C<CP949>. |
67d7b5ef |
495 | |
a999c27c |
496 | See |
497 | http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html |
498 | for details. |
5d030b67 |
499 | |
a999c27c |
500 | Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect |
501 | this common misusage. |
502 | I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>. |
5d030b67 |
503 | |
a999c27c |
504 | See L<Encode::KR -- Korea> for details. |
67d7b5ef |
505 | |
a999c27c |
506 | =item GB2312 |
67d7b5ef |
507 | |
a999c27c |
508 | Microsoft extension to C<EUC-CN>. |
a63c962f |
509 | |
a999c27c |
510 | Proper names: C<CP936>, C<GBK>. |
a63c962f |
511 | |
a999c27c |
512 | C<GB2312> has been registered in the C<EUC-CN> meaning at |
513 | IANA. This has partially repaired the situation: Microsoft's |
514 | C<GB2312> has become a superset of the official C<GB2312>. |
67d7b5ef |
515 | |
a999c27c |
516 | Encode aliases C<GB2312> to C<euc-cn> in full agreement with |
517 | IANA registration. C<cp936> is supported separately. |
518 | I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>. |
519 | |
520 | See L<Encode::CN -- Continental China> for details. |
521 | |
522 | =item Big5 |
523 | |
524 | Microsoft extension to C<Big5>. |
525 | |
526 | Proper name: C<CP950>. |
527 | |
528 | Encode separately supports C<Big5> and C<cp950>. |
529 | |
530 | =item Shift_JIS |
531 | |
532 | Microsoft's understanding of C<Shift_JIS>. |
533 | |
534 | JIS has not endorsed the full Microsoft standard however. |
535 | The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 |
536 | subsets, while Microsoft has always been meaning C<Shift_JIS> to |
537 | encode a wider character repertoire. |
538 | |
539 | As a historical predecessor Microsoft's variant |
540 | probably has more rights for the name, albeit it may be objected |
541 | that Microsoft shouldn't have used JIS as part of the name |
542 | in the first place. |
543 | |
544 | Unabiguous name: C<CP932>. |
545 | |
546 | Encode separately supports C<Shift_JIS> and C<cp932>. |
547 | |
548 | =back |
549 | |
550 | =head1 Glossary |
551 | |
552 | =over 2 |
553 | |
554 | =item character repertoire |
555 | |
556 | A collection of unique characters. A I<character> set in the most |
557 | strict sense. At this stage characters are not numberd. |
558 | |
559 | =item coded character set (CCS) |
560 | |
561 | A character set that is mapped in a way computers can use directly. |
562 | Many character encodings including EUC falls in this category. |
563 | |
564 | =item character encoding scheme (CES) |
565 | |
566 | An algorithm to map a character set to a byte sequence. You don't |
567 | have to be able to tell which character set a given byte sequence |
568 | belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an |
569 | example of being both a CCS and CES. |
570 | |
571 | =item EUC |
572 | |
573 | Extended Unix Character. See ISO-2022 |
574 | |
575 | =item ISO-2022 |
576 | |
577 | A CES that was carefully designed to coexist with ASCII. There are 7 |
578 | bit version and 8 bit version. 8 bit version can conform a CCS. EUC |
579 | and ISO-8859 are two examples thereof. |
580 | |
581 | =item UCS |
582 | |
583 | Short for I<Universal Character Set>. When you say just UCS, it means |
584 | I<Unicode> |
585 | |
586 | =item UCS-2 |
587 | |
588 | ISO/IEC 10646 encoding form: Universal Character Set coded in two |
589 | octets. |
590 | |
591 | =item Unicode |
592 | |
593 | A Character Set that aims to include all character character |
594 | repertoire of the world. Many character sets in various national as |
595 | well as industorial standards are therefore a subset thereof. |
596 | |
597 | =item UTF |
598 | |
599 | Short for I<Unicode Transformation Format>. Determinse how to map a |
600 | unicode character into byte sequnece. |
601 | |
602 | =item UTF-16 |
603 | |
604 | A UTF in 16-bit encoding. Can either be in big endian or little |
605 | endian. Big endian version is called UTF-16BE and little endian |
606 | version is UTF-16LE. |
67d7b5ef |
607 | |
608 | =back |
5d030b67 |
609 | |
610 | =head1 See Also |
611 | |
5129552c |
612 | L<Encode>, |
613 | L<Encode::Byte>, |
a63c962f |
614 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
615 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
616 | |
a999c27c |
617 | =head1 References |
618 | |
619 | =over 2 |
620 | |
621 | =item ECMA |
622 | |
623 | European Computer Manufacturers Association |
624 | L<http://www.ecma.ch> |
625 | |
626 | =over 2 |
627 | |
628 | =item EMCA-035 (eq C<ISO-2022>) |
629 | |
630 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> |
631 | |
632 | The very dspecification of ISO-2022 is available from the link above. |
633 | |
634 | =back |
635 | |
636 | =item IANA |
637 | |
638 | Internet Assigned Numbers Authority |
639 | L<http://www.iana.org/> |
640 | |
641 | =over 2 |
642 | |
643 | =item Assigned Charset Names by IANA |
644 | |
645 | L<http://www.iana.org/assignments/character-sets> |
646 | |
647 | Most of the C<canonical names> in Encode derive from this list |
648 | so you can directly apply the string you have extracted from MIME |
649 | header of mails and we pages. |
650 | |
651 | =back |
652 | |
653 | =item ISO |
654 | |
655 | International Organization for Standardization |
656 | L<http://www.iso.ch/> |
657 | |
658 | =item RFC |
659 | |
660 | Request For Comment -- need I say more? |
661 | L<http://www.rfc.net/> |
662 | |
663 | =item UC |
664 | |
665 | Unicode Consortium |
666 | L<http://www.unicode.org/> |
667 | |
668 | =over 2 |
669 | |
670 | =item Unicode Glossary |
671 | |
672 | L<http://www.unicode.org/glossary/> |
673 | |
674 | The glossary of this document is based opon this site. |
675 | |
676 | =back |
677 | |
678 | =back |
679 | |
680 | =head2 Other Notable Sites |
681 | |
682 | =over 2 |
683 | |
684 | =item czyborra.com |
685 | |
686 | <http://czyborra.com/> |
687 | |
688 | Contains a a lot of useful information, especially gory details of ISO |
689 | vs. vendor mappings. |
690 | |
691 | =item CJK.inf |
692 | |
693 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
694 | |
695 | Somewhat obsolete (last update in 1996), but still useful. Also try |
696 | |
697 | L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
698 | |
699 | You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> |
700 | |
701 | =back |
702 | |
5d030b67 |
703 | =cut |
67d7b5ef |
704 | |
705 | I could not find this page because the hostname doesn't resolve! |
706 | |
707 | Brief description for most of the mentioned CJK encodings |
708 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |