Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
3 | Encode::Supported -- Supported encodings by Encode |
4 | |
5 | =head1 DESCRIPTION |
6 | |
5129552c |
7 | =head2 Encoding Names |
5d030b67 |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
13 | he first in the following sequence: |
14 | |
15 | o The MIME name as defined in IETF RFCs. |
16 | o The name in the IANA registry. |
17 | o The name used by the organization that defined it. |
18 | |
5129552c |
19 | Because of all the alias issues, and because in the general case |
20 | encodings have state, "Encode" uses the encoding object internally |
21 | once an operation is in progress. |
5d030b67 |
22 | |
5129552c |
23 | =head1 Supported Encodings |
5d030b67 |
24 | |
25 | As of Perl 5.8.0, at least the following encodings are recognized. |
26 | Note that unless otherwise specified, they are all case insensitive |
27 | (via alias) and all occurance of spaces are replaced with '-'. In |
28 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
29 | |
5129552c |
30 | Encodings are categorized and implemented in several different modules |
31 | but you don't have to C<use Encode::XX> to make them available for |
32 | most cases. Encode.pm will automatically load those modules in need. |
5d030b67 |
33 | |
5129552c |
34 | =head2 Built-in Encodings |
5d030b67 |
35 | |
5129552c |
36 | The following encodings are always available. |
5d030b67 |
37 | |
5129552c |
38 | Canonical Aliases |
39 | ----------------------- |
40 | iso-8859-1 latin1 |
41 | US-ascii ascii |
42 | UCS-2 ucs2, iso-10646-1 |
43 | UCS-2le |
44 | UTF-8 utf8 |
45 | ----------------------- |
5d030b67 |
46 | |
5129552c |
47 | =head2 Encode::Byte |
5d030b67 |
48 | |
5129552c |
49 | The following encodings are based single-byte encoding implemented as |
50 | extended ASCII. For most cases it uses \x80-\xff (upper half) to map |
51 | non-ASCII characters. |
5d030b67 |
52 | |
5129552c |
53 | ----------------------- |
54 | iso-8859-1 latin |
5d030b67 |
55 | iso-8859-2 latin2 |
56 | iso-8859-3 latin3 |
57 | iso-8859-4 latin4 |
58 | iso-8859-5 latin |
59 | iso-8859-6 latin |
60 | iso-8859-7 |
61 | iso-8859-8 |
62 | iso-8859-9 latin5 |
63 | iso-8859-10 latin6 |
64 | iso-8859-11 |
65 | (iso-8859-12 is nonexistent) |
66 | iso-8859-13 latin7 |
67 | iso-8859-14 latin8 |
68 | iso-8859-15 latin9 |
69 | iso-8859-16 latin10 |
70 | |
71 | koi8-f |
72 | koi8-r |
73 | koi8-u |
74 | |
75 | viscii # ASCII + vietnamese |
76 | |
77 | cp1250 WinLatin2 |
78 | cp1251 WinCyrillic |
79 | cp1252 WinLatin1 |
80 | cp1253 WinGreek |
81 | cp1254 WinTurkiskh |
82 | cp1255 WinHebrew |
83 | cp1256 WinArabic |
84 | cp1257 WinBaltic |
85 | cp1258 WinVietnamese |
86 | # all cp* are also available as ibm-* and ms-* |
87 | |
88 | maccentraleuropean |
89 | maccroatian |
90 | macroman |
91 | maccyrillic |
92 | macromanian |
93 | macdingbats |
94 | macsami |
95 | macgreek |
96 | macthai |
97 | macicelandic |
98 | macturkish |
99 | macukraine |
5129552c |
100 | ----------------------- |
5d030b67 |
101 | |
5129552c |
102 | =head2 The CJK: Chinese, Japanese, Korean (Multibyte) |
5d030b67 |
103 | |
104 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
105 | below. Also note these are impelemented in distinct module by |
106 | languages, due the the size concerns. See these perldocs also. |
107 | |
5129552c |
108 | =over 4 |
109 | |
110 | =item Encode::CN -- Continental China |
111 | |
112 | ----------------------- |
113 | cp936 gbk |
114 | euc-cn |
115 | gb12345 |
116 | gb2312 |
117 | hz |
118 | iso-ir-165 |
119 | ----------------------- |
120 | |
121 | =item Encode::JP -- Japan |
122 | |
123 | ----------------------- |
124 | 7bit-jis jis |
125 | cp932 |
126 | euc-jp ujis |
127 | iso-2022-jp |
128 | macjapan |
129 | shiftjis Shift_JIS, sjis |
130 | ----------------------- |
131 | |
132 | =item Encode::KR -- Korea |
133 | |
134 | ----------------------- |
135 | euc-kr |
136 | ksc5601 |
137 | cp949 |
138 | ----------------------- |
139 | |
140 | =item Encode::TW -- Taiwan |
141 | |
142 | ----------------------- |
143 | big5 |
144 | big5-hkscs |
145 | cp950 |
146 | ----------------------- |
147 | |
148 | =item Encode::HanExtra -- More Chinese via CPAN |
149 | |
150 | Due to size concerns, additional Chinese encodings below are |
151 | distributed separately on CPAN, under the name Encode::HanExtra. |
152 | |
153 | ----------------------- |
154 | gb18030 |
155 | euc-tw |
156 | big5plus |
157 | ----------------------- |
158 | |
159 | =back |
160 | |
161 | =head2 Miscellaneous encodings |
162 | |
163 | =over 4 |
164 | |
165 | =item Encode::EBCDIC |
5d030b67 |
166 | |
167 | See perlebcdic for details. |
168 | |
5129552c |
169 | ----------------------- |
5d030b67 |
170 | cp1047 |
171 | cp37 |
172 | posix-bc |
5129552c |
173 | ----------------------- |
174 | |
175 | =item Enocode::Symbols |
5d030b67 |
176 | |
5129552c |
177 | For symbols and dingbats. |
5d030b67 |
178 | |
5129552c |
179 | ----------------------- |
5d030b67 |
180 | symbol |
181 | dingbats |
5129552c |
182 | ----------------------- |
183 | |
184 | =back |
5d030b67 |
185 | |
186 | =head1 Encoding vs. Charset |
187 | |
188 | Character encoding (or just "encoding") and Character Set (or just |
189 | "charset") are often used interchangeably but they are different |
190 | concepts. |
191 | |
192 | Charset determines which characters to be included in a given text. |
193 | |
194 | Encoding actually maps charset(s) to stream of bits. |
195 | |
196 | Note a given encoding contains multiple charsets. For instance, |
197 | euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku |
198 | Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding. |
199 | |
200 | As the name suggests, the Encode module supports encodings, not |
201 | individual charsets. |
202 | |
203 | =head1 Encoding Classification (by Anton Tagunov) |
204 | |
205 | Encodings |
206 | |
207 | US-ASCII UTF-8 KOI8-R ISO-8859-* |
208 | ISO-2022-CN ISO-2022-JP Big5 |
209 | EUC-CN EUC-JP EUC-KR |
210 | |
211 | are <http://www.iana.org/assignments/character-sets>-registered as |
212 | preferred MIME names and may probably be used over the Internet. So is |
213 | |
214 | Shift_JIS |
215 | |
216 | but despite its wide spread it bears the label of being |
217 | Microsft proprietary -- was. Now Shift JIS is official as of |
218 | JIS X 0208-1997. |
219 | |
220 | UTF-16 KOI8-U |
221 | |
222 | are IANA-registered preferred MIME names but probably |
223 | shoule be avoided as encoding for web pages due to lack of |
224 | browser support. |
225 | |
226 | ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) |
227 | ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html) |
228 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
229 | GBK |
230 | VISCII |
231 | GB 12345 (only plains 1 and 2 available) |
232 | GB 18030 |
233 | CNS 11643 |
234 | |
235 | are totally valid encodings but not registered at IANA. |
236 | |
237 | BIG5PLUS |
238 | EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended) |
239 | |
240 | are a bit proprietary |
241 | |
242 | You may probably get some info on CJK encodings at |
243 | |
244 | brief description for most of the mentioned CJK encodings |
245 | |
246 | F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |
247 | |
248 | several years old, but still useful |
249 | |
250 | F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
251 | |
252 | and some in-depth reading for the heroes :-) |
253 | F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022) |
254 | |
255 | =head1 See Also |
256 | |
5129552c |
257 | L<Encode>, |
258 | L<Encode::Byte>, |
259 | L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW> |
260 | L<Encode::EBCDIC>, L<Encode::Symbol> |
5d030b67 |
261 | |
262 | =cut |