Upgrade to Encode 0.99, from Dan Kogai.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
64ffdd5e 3Encode::Supports -- Supported encodings by Encode
5d030b67 4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
13he first in the following sequence:
14
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
18
5129552c 19Because of all the alias issues, and because in the general case
20encodings have state, "Encode" uses the encoding object internally
21once an operation is in progress.
5d030b67 22
5129552c 23=head1 Supported Encodings
5d030b67 24
25As of Perl 5.8.0, at least the following encodings are recognized.
26Note that unless otherwise specified, they are all case insensitive
a63c962f 27(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67 28other words, "ISO 8859 1" and "iso-8859-1" are identical.
29
5129552c 30Encodings are categorized and implemented in several different modules
31but you don't have to C<use Encode::XX> to make them available for
32most cases. Encode.pm will automatically load those modules in need.
5d030b67 33
5129552c 34=head2 Built-in Encodings
5d030b67 35
5129552c 36The following encodings are always available.
5d030b67 37
5129552c 38 Canonical Aliases
39 -----------------------
40 iso-8859-1 latin1
41 US-ascii ascii
42 UCS-2 ucs2, iso-10646-1
43 UCS-2le
44 UTF-8 utf8
45 -----------------------
5d030b67 46
5129552c 47=head2 Encode::Byte
5d030b67 48
5129552c 49The following encodings are based single-byte encoding implemented as
50extended ASCII. For most cases it uses \x80-\xff (upper half) to map
51non-ASCII characters.
5d030b67 52
5129552c 53 -----------------------
a63c962f 54 (iso-8859-1 is in built-in)
5d030b67 55 iso-8859-2 latin2
56 iso-8859-3 latin3
57 iso-8859-4 latin4
a63c962f 58 iso-8859-5
59 iso-8859-6
5d030b67 60 iso-8859-7
61 iso-8859-8
62 iso-8859-9 latin5
63 iso-8859-10 latin6
64 iso-8859-11
65 (iso-8859-12 is nonexistent)
66 iso-8859-13 latin7
67 iso-8859-14 latin8
68 iso-8859-15 latin9
69 iso-8859-16 latin10
70
71 koi8-f
72 koi8-r
73 koi8-u
74
75 viscii # ASCII + vietnamese
76
77 cp1250 WinLatin2
78 cp1251 WinCyrillic
79 cp1252 WinLatin1
80 cp1253 WinGreek
81 cp1254 WinTurkiskh
82 cp1255 WinHebrew
83 cp1256 WinArabic
84 cp1257 WinBaltic
85 cp1258 WinVietnamese
86 # all cp* are also available as ibm-* and ms-*
87
88 maccentraleuropean
89 maccroatian
90 macroman
91 maccyrillic
92 macromanian
5d030b67 93 macsami
94 macgreek
95 macthai
96 macicelandic
97 macturkish
98 macukraine
64ffdd5e 99
100 nextstep
101 gsm0338 # used in GSM handsets
102 roman8 # what is this?
5129552c 103 -----------------------
5d030b67 104
5129552c 105=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 106
107Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 108below. Also note these are implemented in distinct module by
109languages, due the the size concerns. Please also refer to their
110respective document pages.
5d030b67 111
5129552c 112=over 4
113
114=item Encode::CN -- Continental China
115
116 -----------------------
117 cp936 gbk
118 euc-cn
119 gb12345
120 gb2312
121 hz
122 iso-ir-165
123 -----------------------
124
125=item Encode::JP -- Japan
126
127 -----------------------
128 7bit-jis jis
129 cp932
130 euc-jp ujis
131 iso-2022-jp
a63c962f 132 iso-2022-jp-1
5129552c 133 macjapan
134 shiftjis Shift_JIS, sjis
135 -----------------------
136
137=item Encode::KR -- Korea
138
139 -----------------------
140 euc-kr
141 ksc5601
142 cp949
143 -----------------------
144
145=item Encode::TW -- Taiwan
146
147 -----------------------
148 big5
149 big5-hkscs
150 cp950
151 -----------------------
152
153=item Encode::HanExtra -- More Chinese via CPAN
154
155Due to size concerns, additional Chinese encodings below are
156distributed separately on CPAN, under the name Encode::HanExtra.
157
158 -----------------------
159 gb18030
160 euc-tw
161 big5plus
162 -----------------------
163
164=back
165
166=head2 Miscellaneous encodings
167
168=over 4
169
170=item Encode::EBCDIC
5d030b67 171
172See perlebcdic for details.
173
5129552c 174 -----------------------
5d030b67 175 cp1047
176 cp37
177 posix-bc
5129552c 178 -----------------------
179
a63c962f 180=item Encode::Symbols
5d030b67 181
5129552c 182For symbols and dingbats.
5d030b67 183
5129552c 184 -----------------------
5d030b67 185 symbol
186 dingbats
64ffdd5e 187 macdingbats
5129552c 188 -----------------------
189
190=back
5d030b67 191
192=head1 Encoding vs. Charset
193
194Character encoding (or just "encoding") and Character Set (or just
195"charset") are often used interchangeably but they are different
196concepts.
197
198Charset determines which characters to be included in a given text.
199
200Encoding actually maps charset(s) to stream of bits.
201
a63c962f 202Note a given encoding may contain multiple charsets and complex CJK
203encodings are usually implemented that way.
204
205For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
206JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
207Kanji) in a single encoding.
5d030b67 208
209As the name suggests, the Encode module supports encodings, not
210individual charsets.
211
a63c962f 212=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
213
214This section tries to classify the supported encodings by their
215applicability for information exchange over the Internet and to
216choose the most suitable aliases to name them in the context of
217such communication.
218
219Encoding names
5d030b67 220
a63c962f 221 US-ASCII UTF-8
222 ISO-8859-* KOI8-R
223 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
224 EUC-KR
225 Big5
5d030b67 226
a63c962f 227are L<http://www.iana.org/assignments/character-sets>-registered as
228preferred MIME names and may probably be used over the Internet.
5d030b67 229
a63c962f 230C<Shift_JIS> is no longer Microsft proprietary since it has been
231officialized by JIS X 0208-1997. It is probably the most wide
232spread encoding for Japanese on the Internet.
5d030b67 233
a63c962f 234 EUC-CN
5d030b67 235
a63c962f 236has not been registered with IANA (as of march 2002) but
237seems to be supported by major web browsers. (IANA has registered
238this encoding as C<GB2312>, but C<gb2312> currently has a different
239meaning to the C<Encode> module. It will probably become alias to
240C<EUC-CN> in the future; until then it is safer to avoid using
241C<gb2312> as encoding name within Perl).
5d030b67 242
a63c962f 243 UTF-16
244 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
5d030b67 245
a63c962f 246are IANA-registered (C<UTF-16> even as a preferred MIME name)
247but probably should be avoided as encoding for web pages due to
248lack of browser support.
5d030b67 249
5d030b67 250 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
251 GBK
252 VISCII
a63c962f 253 GB 12345
254 GB 18030 (*) (see links bellow)
255 EUC-TW (*)
5d030b67 256
257are totally valid encodings but not registered at IANA.
a63c962f 258The names under which they are listed here are probably the
259most widely-known names for these encodings and are recommended
260names.
261
262
263=for comment this used to be listed as supported but
5d030b67 264
a63c962f 265do not work @15457 when it's clear they will be uncommented
266or deleted - Anton
267ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
268CNS 11643 (only plains 1 and 2 available)
5d030b67 269
a63c962f 270 BIG5PLUS (*)
271
272is a bit proprietary name. C<(*)>-marked encodings belong to
273C<Encode::HanExtra> available from CPAN.
5d030b67 274
275You may probably get some info on CJK encodings at
276
277brief description for most of the mentioned CJK encodings
a63c962f 278L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
5d030b67 279
280several years old, but still useful
a63c962f 281L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
5d030b67 282
283and some in-depth reading for the heroes :-)
a63c962f 284L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
285
286gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
287F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
288
289The nature of information in this section is most fragile and
290error-prone; I<probably> is the most popular adverb :)
291Please feel free to send your comments, disagreements and
292additions to L<...>. (Note however,
293that the mission of this document is to cover the
294C<Encode>-supported encodings only.
5d030b67 295
296=head1 See Also
297
5129552c 298L<Encode>,
299L<Encode::Byte>,
a63c962f 300L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 301L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 302
303=cut