Commit | Line | Data |
5d030b67 |
1 | =head1 NAME |
2 | |
3 | Encode::Supported -- Supported encodings by Encode |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | =Encoding Names |
8 | |
9 | Encoding names are case insensitive. White space in names |
10 | is ignored. In addition an encoding may have aliases. |
11 | Each encoding has one "canonical" name. The "canonical" |
12 | name is chosen from the names of the encoding by picking |
13 | he first in the following sequence: |
14 | |
15 | o The MIME name as defined in IETF RFCs. |
16 | o The name in the IANA registry. |
17 | o The name used by the organization that defined it. |
18 | |
19 | Because of all the alias issues, and because in the gen- |
20 | eral case encodings have state, "Encode" uses the encoding |
21 | object internally once an operation is in progress. |
22 | |
23 | =head2 Supported Encodings |
24 | |
25 | As of Perl 5.8.0, at least the following encodings are recognized. |
26 | Note that unless otherwise specified, they are all case insensitive |
27 | (via alias) and all occurance of spaces are replaced with '-'. In |
28 | other words, "ISO 8859 1" and "iso-8859-1" are identical. |
29 | |
30 | =head3 ASCII |
31 | |
32 | Canonical Aliases |
33 | ----------------------- |
34 | ascii uc-ascii |
35 | |
36 | =head3 The Unicode |
37 | |
38 | utf8 UTF-8 |
39 | utf16 UTF-16 |
40 | ucs2 UCS-2, iso-10646-1 |
41 | |
42 | =head3 The ISO 8859, KOI, and other 1-byte encodings |
43 | |
44 | The following encodings are based upon ASCII. For most cases it uses |
45 | \x80-\xff (upper half) to map non-ASCII characters. |
46 | |
47 | iso-8859-1 latin1 |
48 | iso-8859-2 latin2 |
49 | iso-8859-3 latin3 |
50 | iso-8859-4 latin4 |
51 | iso-8859-5 latin |
52 | iso-8859-6 latin |
53 | iso-8859-7 |
54 | iso-8859-8 |
55 | iso-8859-9 latin5 |
56 | iso-8859-10 latin6 |
57 | iso-8859-11 |
58 | (iso-8859-12 is nonexistent) |
59 | iso-8859-13 latin7 |
60 | iso-8859-14 latin8 |
61 | iso-8859-15 latin9 |
62 | iso-8859-16 latin10 |
63 | |
64 | koi8-f |
65 | koi8-r |
66 | koi8-u |
67 | |
68 | viscii # ASCII + vietnamese |
69 | |
70 | cp1250 WinLatin2 |
71 | cp1251 WinCyrillic |
72 | cp1252 WinLatin1 |
73 | cp1253 WinGreek |
74 | cp1254 WinTurkiskh |
75 | cp1255 WinHebrew |
76 | cp1256 WinArabic |
77 | cp1257 WinBaltic |
78 | cp1258 WinVietnamese |
79 | # all cp* are also available as ibm-* and ms-* |
80 | |
81 | maccentraleuropean |
82 | maccroatian |
83 | macroman |
84 | maccyrillic |
85 | macromanian |
86 | macdingbats |
87 | macsami |
88 | macgreek |
89 | macthai |
90 | macicelandic |
91 | macturkish |
92 | macukraine |
93 | |
94 | =head3 The CJK: Chinese, Japanese, Korean (Multibyte) |
95 | |
96 | Note Vietnamese is listed above. Also read "Encoding vs Charset" |
97 | below. Also note these are impelemented in distinct module by |
98 | languages, due the the size concerns. See these perldocs also. |
99 | |
100 | cp936 gbk # Encode::CN |
101 | euc-cn # Encode::CN |
102 | gb12345 # Encode::CN |
103 | gb2312 # Encode::CN |
104 | gb2312 # Encode::CN |
105 | hz # Encode::CN |
106 | iso-ir-165 # Encode::CN |
107 | |
108 | 7bit-jis jis # Encode::JP |
109 | cp932 # Encode::JP |
110 | euc-jp ujis # Encode::JP |
111 | iso-2022-jp # Encode::JP |
112 | macjapan # Encode::JP |
113 | shiftjis Shift_JIS, sjis # Encode::JP |
114 | |
115 | euc-kr # Encode::KR |
116 | ksc5601 # Encode::KR |
117 | cp949 # Encode::KR |
118 | |
119 | big5 # Encode::TW |
120 | big5-hkscs # Encode::TW |
121 | cp950 # Encode::TW |
122 | |
123 | Due to size concerns, additional Chinese encodings including "GB |
124 | 18030", "EUC-TW" and "BIG5PLUS" are distributed separately on CPAN, |
125 | under the name Encode::HanExtra. |
126 | |
127 | =head3 EBCDIC |
128 | |
129 | See perlebcdic for details. |
130 | |
131 | cp1047 |
132 | cp37 |
133 | posix-bc |
134 | |
135 | =head3 Symbols and dingbats |
136 | |
137 | symbol |
138 | dingbats |
139 | |
140 | =head1 Encoding vs. Charset |
141 | |
142 | Character encoding (or just "encoding") and Character Set (or just |
143 | "charset") are often used interchangeably but they are different |
144 | concepts. |
145 | |
146 | Charset determines which characters to be included in a given text. |
147 | |
148 | Encoding actually maps charset(s) to stream of bits. |
149 | |
150 | Note a given encoding contains multiple charsets. For instance, |
151 | euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku |
152 | Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding. |
153 | |
154 | As the name suggests, the Encode module supports encodings, not |
155 | individual charsets. |
156 | |
157 | =head1 Encoding Classification (by Anton Tagunov) |
158 | |
159 | Encodings |
160 | |
161 | US-ASCII UTF-8 KOI8-R ISO-8859-* |
162 | ISO-2022-CN ISO-2022-JP Big5 |
163 | EUC-CN EUC-JP EUC-KR |
164 | |
165 | are <http://www.iana.org/assignments/character-sets>-registered as |
166 | preferred MIME names and may probably be used over the Internet. So is |
167 | |
168 | Shift_JIS |
169 | |
170 | but despite its wide spread it bears the label of being |
171 | Microsft proprietary -- was. Now Shift JIS is official as of |
172 | JIS X 0208-1997. |
173 | |
174 | UTF-16 KOI8-U |
175 | |
176 | are IANA-registered preferred MIME names but probably |
177 | shoule be avoided as encoding for web pages due to lack of |
178 | browser support. |
179 | |
180 | ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) |
181 | ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html) |
182 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
183 | GBK |
184 | VISCII |
185 | GB 12345 (only plains 1 and 2 available) |
186 | GB 18030 |
187 | CNS 11643 |
188 | |
189 | are totally valid encodings but not registered at IANA. |
190 | |
191 | BIG5PLUS |
192 | EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended) |
193 | |
194 | are a bit proprietary |
195 | |
196 | You may probably get some info on CJK encodings at |
197 | |
198 | brief description for most of the mentioned CJK encodings |
199 | |
200 | F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html> |
201 | |
202 | several years old, but still useful |
203 | |
204 | F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html> |
205 | |
206 | and some in-depth reading for the heroes :-) |
207 | F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022) |
208 | |
209 | =head1 See Also |
210 | |
211 | L<Encode>, L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW> |
212 | |
213 | =cut |