Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
3 | Encode::Supported -- Supported encodings by Encode |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
13 | he first in the following sequence: |
14 | |
15 | o The MIME name as defined in IETF RFCs. |
16 | o The name in the IANA registry. |
17 | o The name used by the organization that defined it. |
18 | |
5129552c |
19 | Because of all the alias issues, and because in the general case |
20 | encodings have state, "Encode" uses the encoding object internally |
21 | once an operation is in progress. |
5d030b67 |
22 | |
5129552c |
23 | =head1 Supported Encodings |
5d030b67 |
24 | |
25 | As of Perl 5.8.0, at least the following encodings are recognized. |
26 | Note that unless otherwise specified, they are all case insensitive |
a63c962f |
27 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 |
28 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
29 | |
5129552c |
30 | Encodings are categorized and implemented in several different modules |
31 | but you don't have to C<use Encode::XX> to make them available for |
32 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
33 | |
5129552c |
34 | =head2 Built-in Encodings |
5d030b67 |
35 | |
5129552c |
36 | The following encodings are always available. |
5d030b67 |
37 | |
5129552c |
38 | Canonical Aliases |
39 | ----------------------- |
40 | iso-8859-1 latin1 |
41 | US-ascii ascii |
42 | UCS-2 ucs2, iso-10646-1 |
43 | UCS-2le |
44 | UTF-8 utf8 |
45 | ----------------------- |
5d030b67 |
46 | |
5129552c |
47 | =head2 Encode::Byte |
5d030b67 |
48 | |
5129552c |
49 | The following encodings are based single-byte encoding implemented as |
50 | extended ASCII. For most cases it uses \x80-\xff (upper half) to map |
51 | non-ASCII characters. |
5d030b67 |
52 | |
5129552c |
53 | ----------------------- |
a63c962f |
54 | (iso-8859-1 is in built-in) |
5d030b67 |
55 | iso-8859-2 latin2 |
56 | iso-8859-3 latin3 |
57 | iso-8859-4 latin4 |
a63c962f |
58 | iso-8859-5 |
59 | iso-8859-6 |
5d030b67 |
60 | iso-8859-7 |
61 | iso-8859-8 |
62 | iso-8859-9 latin5 |
63 | iso-8859-10 latin6 |
64 | iso-8859-11 |
65 | (iso-8859-12 is nonexistent) |
66 | iso-8859-13 latin7 |
67 | iso-8859-14 latin8 |
68 | iso-8859-15 latin9 |
69 | iso-8859-16 latin10 |
70 | |
71 | koi8-f |
72 | koi8-r |
73 | koi8-u |
74 | |
75 | viscii # ASCII + vietnamese |
76 | |
77 | cp1250 WinLatin2 |
78 | cp1251 WinCyrillic |
79 | cp1252 WinLatin1 |
80 | cp1253 WinGreek |
81 | cp1254 WinTurkiskh |
82 | cp1255 WinHebrew |
83 | cp1256 WinArabic |
84 | cp1257 WinBaltic |
85 | cp1258 WinVietnamese |
86 | # all cp* are also available as ibm-* and ms-* |
87 | |
88 | maccentraleuropean |
89 | maccroatian |
90 | macroman |
91 | maccyrillic |
92 | macromanian |
93 | macdingbats |
94 | macsami |
95 | macgreek |
96 | macthai |
97 | macicelandic |
98 | macturkish |
99 | macukraine |
5129552c |
100 | ----------------------- |
5d030b67 |
101 | |
5129552c |
102 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
103 | |
104 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
a63c962f |
105 | below. Also note these are implemented in distinct module by |
106 | languages, due the the size concerns. Please also refer to their |
107 | respective document pages. |
5d030b67 |
108 | |
5129552c |
109 | =over 4 |
110 | |
111 | =item Encode::CN -- Continental China |
112 | |
113 | ----------------------- |
114 | cp936 gbk |
115 | euc-cn |
116 | gb12345 |
117 | gb2312 |
118 | hz |
119 | iso-ir-165 |
120 | ----------------------- |
121 | |
122 | =item Encode::JP -- Japan |
123 | |
124 | ----------------------- |
125 | 7bit-jis jis |
126 | cp932 |
127 | euc-jp ujis |
128 | iso-2022-jp |
a63c962f |
129 | iso-2022-jp-1 |
5129552c |
130 | macjapan |
131 | shiftjis Shift_JIS, sjis |
132 | ----------------------- |
133 | |
134 | =item Encode::KR -- Korea |
135 | |
136 | ----------------------- |
137 | euc-kr |
138 | ksc5601 |
139 | cp949 |
140 | ----------------------- |
141 | |
142 | =item Encode::TW -- Taiwan |
143 | |
144 | ----------------------- |
145 | big5 |
146 | big5-hkscs |
147 | cp950 |
148 | ----------------------- |
149 | |
150 | =item Encode::HanExtra -- More Chinese via CPAN |
151 | |
152 | Due to size concerns, additional Chinese encodings below are |
153 | distributed separately on CPAN, under the name Encode::HanExtra. |
154 | |
155 | ----------------------- |
156 | gb18030 |
157 | euc-tw |
158 | big5plus |
159 | ----------------------- |
160 | |
161 | =back |
162 | |
163 | =head2 Miscellaneous encodings |
164 | |
165 | =over 4 |
166 | |
167 | =item Encode::EBCDIC |
5d030b67 |
168 | |
169 | See perlebcdic for details. |
170 | |
5129552c |
171 | ----------------------- |
5d030b67 |
172 | cp1047 |
173 | cp37 |
174 | posix-bc |
5129552c |
175 | ----------------------- |
176 | |
a63c962f |
177 | =item Encode::Symbols |
5d030b67 |
178 | |
5129552c |
179 | For symbols and dingbats. |
5d030b67 |
180 | |
5129552c |
181 | ----------------------- |
5d030b67 |
182 | symbol |
183 | dingbats |
5129552c |
184 | ----------------------- |
185 | |
186 | =back |
5d030b67 |
187 | |
188 | =head1 Encoding vs. Charset |
189 | |
190 | Character encoding (or just "encoding") and Character Set (or just |
191 | "charset") are often used interchangeably but they are different |
192 | concepts. |
193 | |
194 | Charset determines which characters to be included in a given text. |
195 | |
196 | Encoding actually maps charset(s) to stream of bits. |
197 | |
a63c962f |
198 | Note a given encoding may contain multiple charsets and complex CJK |
199 | encodings are usually implemented that way. |
200 | |
201 | For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana), |
202 | JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended |
203 | Kanji) in a single encoding. |
5d030b67 |
204 | |
205 | As the name suggests, the Encode module supports encodings, not |
206 | individual charsets. |
207 | |
a63c962f |
208 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
209 | |
210 | This section tries to classify the supported encodings by their |
211 | applicability for information exchange over the Internet and to |
212 | choose the most suitable aliases to name them in the context of |
213 | such communication. |
214 | |
215 | Encoding names |
5d030b67 |
216 | |
a63c962f |
217 | US-ASCII UTF-8 |
218 | ISO-8859-* KOI8-R |
219 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
220 | EUC-KR |
221 | Big5 |
5d030b67 |
222 | |
a63c962f |
223 | are L<http://www.iana.org/assignments/character-sets>-registered as |
224 | preferred MIME names and may probably be used over the Internet. |
5d030b67 |
225 | |
a63c962f |
226 | C<Shift_JIS> is no longer Microsft proprietary since it has been |
227 | officialized by JIS X 0208-1997. It is probably the most wide |
228 | spread encoding for Japanese on the Internet. |
5d030b67 |
229 | |
a63c962f |
230 | EUC-CN |
5d030b67 |
231 | |
a63c962f |
232 | has not been registered with IANA (as of march 2002) but |
233 | seems to be supported by major web browsers. (IANA has registered |
234 | this encoding as C<GB2312>, but C<gb2312> currently has a different |
235 | meaning to the C<Encode> module. It will probably become alias to |
236 | C<EUC-CN> in the future; until then it is safer to avoid using |
237 | C<gb2312> as encoding name within Perl). |
5d030b67 |
238 | |
a63c962f |
239 | UTF-16 |
240 | KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) |
5d030b67 |
241 | |
a63c962f |
242 | are IANA-registered (C<UTF-16> even as a preferred MIME name) |
243 | but probably should be avoided as encoding for web pages due to |
244 | lack of browser support. |
5d030b67 |
245 | |
5d030b67 |
246 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
247 | GBK |
248 | VISCII |
a63c962f |
249 | GB 12345 |
250 | GB 18030 (*) (see links bellow) |
251 | EUC-TW (*) |
5d030b67 |
252 | |
253 | are totally valid encodings but not registered at IANA. |
a63c962f |
254 | The names under which they are listed here are probably the |
255 | most widely-known names for these encodings and are recommended |
256 | names. |
257 | |
258 | |
259 | =for comment this used to be listed as supported but |
5d030b67 |
260 | |
a63c962f |
261 | do not work @15457 when it's clear they will be uncommented |
262 | or deleted - Anton |
263 | ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) |
264 | CNS 11643 (only plains 1 and 2 available) |
5d030b67 |
265 | |
a63c962f |
266 | BIG5PLUS (*) |
267 | |
268 | is a bit proprietary name. C<(*)>-marked encodings belong to |
269 | C<Encode::HanExtra> available from CPAN. |
5d030b67 |
270 | |
271 | You may probably get some info on CJK encodings at |
272 | |
273 | brief description for most of the mentioned CJK encodings |
a63c962f |
274 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |
5d030b67 |
275 | |
276 | several years old, but still useful |
a63c962f |
277 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
5d030b67 |
278 | |
279 | and some in-depth reading for the heroes :-) |
a63c962f |
280 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>) |
281 | |
282 | gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> |
283 | F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
284 | |
285 | The nature of information in this section is most fragile and |
286 | error-prone; I<probably> is the most popular adverb :) |
287 | Please feel free to send your comments, disagreements and |
288 | additions to L<...>. (Note however, |
289 | that the mission of this document is to cover the |
290 | C<Encode>-supported encodings only. |
5d030b67 |
291 | |
292 | =head1 See Also |
293 | |
5129552c |
294 | L<Encode>, |
295 | L<Encode::Byte>, |
a63c962f |
296 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
297 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
298 | |
299 | =cut |