Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
64ffdd5e |
3 | Encode::Supports -- Supported encodings by Encode |
5d030b67 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
13 | he first in the following sequence: |
14 | |
15 | o The MIME name as defined in IETF RFCs. |
16 | o The name in the IANA registry. |
17 | o The name used by the organization that defined it. |
18 | |
5129552c |
19 | Because of all the alias issues, and because in the general case |
20 | encodings have state, "Encode" uses the encoding object internally |
21 | once an operation is in progress. |
5d030b67 |
22 | |
5129552c |
23 | =head1 Supported Encodings |
5d030b67 |
24 | |
25 | As of Perl 5.8.0, at least the following encodings are recognized. |
26 | Note that unless otherwise specified, they are all case insensitive |
a63c962f |
27 | (via alias) and all occurrance of spaces are replaced with '-'. In |
5d030b67 |
28 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
29 | |
5129552c |
30 | Encodings are categorized and implemented in several different modules |
31 | but you don't have to C<use Encode::XX> to make them available for |
32 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
33 | |
5129552c |
34 | =head2 Built-in Encodings |
5d030b67 |
35 | |
5129552c |
36 | The following encodings are always available. |
5d030b67 |
37 | |
5129552c |
38 | Canonical Aliases |
39 | ----------------------- |
40 | iso-8859-1 latin1 |
41 | US-ascii ascii |
42 | UCS-2 ucs2, iso-10646-1 |
43 | UCS-2le |
44 | UTF-8 utf8 |
45 | ----------------------- |
5d030b67 |
46 | |
5129552c |
47 | =head2 Encode::Byte |
5d030b67 |
48 | |
5129552c |
49 | The following encodings are based single-byte encoding implemented as |
50 | extended ASCII. For most cases it uses \x80-\xff (upper half) to map |
51 | non-ASCII characters. |
5d030b67 |
52 | |
5129552c |
53 | ----------------------- |
a63c962f |
54 | (iso-8859-1 is in built-in) |
5d030b67 |
55 | iso-8859-2 latin2 |
56 | iso-8859-3 latin3 |
57 | iso-8859-4 latin4 |
a63c962f |
58 | iso-8859-5 |
59 | iso-8859-6 |
5d030b67 |
60 | iso-8859-7 |
61 | iso-8859-8 |
62 | iso-8859-9 latin5 |
63 | iso-8859-10 latin6 |
64 | iso-8859-11 |
65 | (iso-8859-12 is nonexistent) |
66 | iso-8859-13 latin7 |
67 | iso-8859-14 latin8 |
68 | iso-8859-15 latin9 |
69 | iso-8859-16 latin10 |
70 | |
71 | koi8-f |
72 | koi8-r |
73 | koi8-u |
74 | |
75 | viscii # ASCII + vietnamese |
76 | |
77 | cp1250 WinLatin2 |
78 | cp1251 WinCyrillic |
79 | cp1252 WinLatin1 |
80 | cp1253 WinGreek |
81 | cp1254 WinTurkiskh |
82 | cp1255 WinHebrew |
83 | cp1256 WinArabic |
84 | cp1257 WinBaltic |
85 | cp1258 WinVietnamese |
86 | # all cp* are also available as ibm-* and ms-* |
87 | |
88 | maccentraleuropean |
89 | maccroatian |
90 | macroman |
91 | maccyrillic |
92 | macromanian |
5d030b67 |
93 | macsami |
94 | macgreek |
95 | macthai |
96 | macicelandic |
97 | macturkish |
98 | macukraine |
64ffdd5e |
99 | |
100 | nextstep |
101 | gsm0338 # used in GSM handsets |
102 | roman8 # what is this? |
5129552c |
103 | ----------------------- |
5d030b67 |
104 | |
5129552c |
105 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
106 | |
107 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
a63c962f |
108 | below. Also note these are implemented in distinct module by |
109 | languages, due the the size concerns. Please also refer to their |
110 | respective document pages. |
5d030b67 |
111 | |
5129552c |
112 | =over 4 |
113 | |
114 | =item Encode::CN -- Continental China |
115 | |
116 | ----------------------- |
117 | cp936 gbk |
118 | euc-cn |
119 | gb12345 |
120 | gb2312 |
121 | hz |
122 | iso-ir-165 |
123 | ----------------------- |
124 | |
125 | =item Encode::JP -- Japan |
126 | |
127 | ----------------------- |
128 | 7bit-jis jis |
129 | cp932 |
130 | euc-jp ujis |
131 | iso-2022-jp |
a63c962f |
132 | iso-2022-jp-1 |
5129552c |
133 | macjapan |
134 | shiftjis Shift_JIS, sjis |
135 | ----------------------- |
136 | |
137 | =item Encode::KR -- Korea |
138 | |
139 | ----------------------- |
140 | euc-kr |
141 | ksc5601 |
142 | cp949 |
143 | ----------------------- |
144 | |
145 | =item Encode::TW -- Taiwan |
146 | |
147 | ----------------------- |
148 | big5 |
149 | big5-hkscs |
150 | cp950 |
151 | ----------------------- |
152 | |
153 | =item Encode::HanExtra -- More Chinese via CPAN |
154 | |
155 | Due to size concerns, additional Chinese encodings below are |
156 | distributed separately on CPAN, under the name Encode::HanExtra. |
157 | |
158 | ----------------------- |
159 | gb18030 |
160 | euc-tw |
161 | big5plus |
162 | ----------------------- |
163 | |
164 | =back |
165 | |
166 | =head2 Miscellaneous encodings |
167 | |
168 | =over 4 |
169 | |
170 | =item Encode::EBCDIC |
5d030b67 |
171 | |
172 | See perlebcdic for details. |
173 | |
5129552c |
174 | ----------------------- |
5d030b67 |
175 | cp1047 |
176 | cp37 |
177 | posix-bc |
5129552c |
178 | ----------------------- |
179 | |
a63c962f |
180 | =item Encode::Symbols |
5d030b67 |
181 | |
5129552c |
182 | For symbols and dingbats. |
5d030b67 |
183 | |
5129552c |
184 | ----------------------- |
5d030b67 |
185 | symbol |
186 | dingbats |
64ffdd5e |
187 | macdingbats |
5129552c |
188 | ----------------------- |
189 | |
190 | =back |
5d030b67 |
191 | |
192 | =head1 Encoding vs. Charset |
193 | |
194 | Character encoding (or just "encoding") and Character Set (or just |
195 | "charset") are often used interchangeably but they are different |
196 | concepts. |
197 | |
198 | Charset determines which characters to be included in a given text. |
199 | |
200 | Encoding actually maps charset(s) to stream of bits. |
201 | |
a63c962f |
202 | Note a given encoding may contain multiple charsets and complex CJK |
203 | encodings are usually implemented that way. |
204 | |
205 | For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana), |
206 | JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended |
207 | Kanji) in a single encoding. |
5d030b67 |
208 | |
209 | As the name suggests, the Encode module supports encodings, not |
210 | individual charsets. |
211 | |
a63c962f |
212 | =head1 Encoding Classification (by Anton Tagunov and Dan Kogai) |
213 | |
214 | This section tries to classify the supported encodings by their |
215 | applicability for information exchange over the Internet and to |
216 | choose the most suitable aliases to name them in the context of |
217 | such communication. |
218 | |
219 | Encoding names |
5d030b67 |
220 | |
a63c962f |
221 | US-ASCII UTF-8 |
222 | ISO-8859-* KOI8-R |
223 | Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 |
224 | EUC-KR |
225 | Big5 |
5d030b67 |
226 | |
a63c962f |
227 | are L<http://www.iana.org/assignments/character-sets>-registered as |
228 | preferred MIME names and may probably be used over the Internet. |
5d030b67 |
229 | |
a63c962f |
230 | C<Shift_JIS> is no longer Microsft proprietary since it has been |
231 | officialized by JIS X 0208-1997. It is probably the most wide |
232 | spread encoding for Japanese on the Internet. |
5d030b67 |
233 | |
a63c962f |
234 | EUC-CN |
5d030b67 |
235 | |
a63c962f |
236 | has not been registered with IANA (as of march 2002) but |
237 | seems to be supported by major web browsers. (IANA has registered |
238 | this encoding as C<GB2312>, but C<gb2312> currently has a different |
239 | meaning to the C<Encode> module. It will probably become alias to |
240 | C<EUC-CN> in the future; until then it is safer to avoid using |
241 | C<gb2312> as encoding name within Perl). |
5d030b67 |
242 | |
a63c962f |
243 | UTF-16 |
244 | KOI8-U (http://www.faqs.org/rfcs/rfc2319.html) |
5d030b67 |
245 | |
a63c962f |
246 | are IANA-registered (C<UTF-16> even as a preferred MIME name) |
247 | but probably should be avoided as encoding for web pages due to |
248 | lack of browser support. |
5d030b67 |
249 | |
5d030b67 |
250 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
251 | GBK |
252 | VISCII |
a63c962f |
253 | GB 12345 |
254 | GB 18030 (*) (see links bellow) |
255 | EUC-TW (*) |
5d030b67 |
256 | |
257 | are totally valid encodings but not registered at IANA. |
a63c962f |
258 | The names under which they are listed here are probably the |
259 | most widely-known names for these encodings and are recommended |
260 | names. |
261 | |
262 | |
263 | =for comment this used to be listed as supported but |
5d030b67 |
264 | |
a63c962f |
265 | do not work @15457 when it's clear they will be uncommented |
266 | or deleted - Anton |
267 | ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) |
268 | CNS 11643 (only plains 1 and 2 available) |
5d030b67 |
269 | |
a63c962f |
270 | BIG5PLUS (*) |
271 | |
272 | is a bit proprietary name. C<(*)>-marked encodings belong to |
273 | C<Encode::HanExtra> available from CPAN. |
5d030b67 |
274 | |
275 | You may probably get some info on CJK encodings at |
276 | |
277 | brief description for most of the mentioned CJK encodings |
a63c962f |
278 | L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |
5d030b67 |
279 | |
280 | several years old, but still useful |
a63c962f |
281 | L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
5d030b67 |
282 | |
283 | and some in-depth reading for the heroes :-) |
a63c962f |
284 | L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>) |
285 | |
286 | gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> |
287 | F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> |
288 | |
289 | The nature of information in this section is most fragile and |
290 | error-prone; I<probably> is the most popular adverb :) |
291 | Please feel free to send your comments, disagreements and |
292 | additions to L<...>. (Note however, |
293 | that the mission of this document is to cover the |
294 | C<Encode>-supported encodings only. |
5d030b67 |
295 | |
296 | =head1 See Also |
297 | |
5129552c |
298 | L<Encode>, |
299 | L<Encode::Byte>, |
a63c962f |
300 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, |
5129552c |
301 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
302 | |
303 | =cut |