Noise with -w.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod
CommitLineData
5d030b67 1=head1 NAME
2
3Encode::Supported -- Supported encodings by Encode
4
5=head1 DESCRIPTION
6
5129552c 7=head2 Encoding Names
5d030b67 8
9Encoding names are case insensitive. White space in names
10is ignored. In addition an encoding may have aliases.
11Each encoding has one "canonical" name. The "canonical"
12name is chosen from the names of the encoding by picking
13he first in the following sequence:
14
15 o The MIME name as defined in IETF RFCs.
16 o The name in the IANA registry.
17 o The name used by the organization that defined it.
18
5129552c 19Because of all the alias issues, and because in the general case
20encodings have state, "Encode" uses the encoding object internally
21once an operation is in progress.
5d030b67 22
5129552c 23=head1 Supported Encodings
5d030b67 24
25As of Perl 5.8.0, at least the following encodings are recognized.
26Note that unless otherwise specified, they are all case insensitive
a63c962f 27(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67 28other words, "ISO 8859 1" and "iso-8859-1" are identical.
29
5129552c 30Encodings are categorized and implemented in several different modules
31but you don't have to C<use Encode::XX> to make them available for
32most cases. Encode.pm will automatically load those modules in need.
5d030b67 33
5129552c 34=head2 Built-in Encodings
5d030b67 35
5129552c 36The following encodings are always available.
5d030b67 37
5129552c 38 Canonical Aliases
39 -----------------------
40 iso-8859-1 latin1
41 US-ascii ascii
42 UCS-2 ucs2, iso-10646-1
43 UCS-2le
44 UTF-8 utf8
45 -----------------------
5d030b67 46
5129552c 47=head2 Encode::Byte
5d030b67 48
5129552c 49The following encodings are based single-byte encoding implemented as
50extended ASCII. For most cases it uses \x80-\xff (upper half) to map
51non-ASCII characters.
5d030b67 52
5129552c 53 -----------------------
a63c962f 54 (iso-8859-1 is in built-in)
5d030b67 55 iso-8859-2 latin2
56 iso-8859-3 latin3
57 iso-8859-4 latin4
a63c962f 58 iso-8859-5
59 iso-8859-6
5d030b67 60 iso-8859-7
61 iso-8859-8
62 iso-8859-9 latin5
63 iso-8859-10 latin6
64 iso-8859-11
65 (iso-8859-12 is nonexistent)
66 iso-8859-13 latin7
67 iso-8859-14 latin8
68 iso-8859-15 latin9
69 iso-8859-16 latin10
70
71 koi8-f
72 koi8-r
73 koi8-u
74
75 viscii # ASCII + vietnamese
76
77 cp1250 WinLatin2
78 cp1251 WinCyrillic
79 cp1252 WinLatin1
80 cp1253 WinGreek
81 cp1254 WinTurkiskh
82 cp1255 WinHebrew
83 cp1256 WinArabic
84 cp1257 WinBaltic
85 cp1258 WinVietnamese
86 # all cp* are also available as ibm-* and ms-*
87
88 maccentraleuropean
89 maccroatian
90 macroman
91 maccyrillic
92 macromanian
93 macdingbats
94 macsami
95 macgreek
96 macthai
97 macicelandic
98 macturkish
99 macukraine
5129552c 100 -----------------------
5d030b67 101
5129552c 102=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67 103
104Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f 105below. Also note these are implemented in distinct module by
106languages, due the the size concerns. Please also refer to their
107respective document pages.
5d030b67 108
5129552c 109=over 4
110
111=item Encode::CN -- Continental China
112
113 -----------------------
114 cp936 gbk
115 euc-cn
116 gb12345
117 gb2312
118 hz
119 iso-ir-165
120 -----------------------
121
122=item Encode::JP -- Japan
123
124 -----------------------
125 7bit-jis jis
126 cp932
127 euc-jp ujis
128 iso-2022-jp
a63c962f 129 iso-2022-jp-1
5129552c 130 macjapan
131 shiftjis Shift_JIS, sjis
132 -----------------------
133
134=item Encode::KR -- Korea
135
136 -----------------------
137 euc-kr
138 ksc5601
139 cp949
140 -----------------------
141
142=item Encode::TW -- Taiwan
143
144 -----------------------
145 big5
146 big5-hkscs
147 cp950
148 -----------------------
149
150=item Encode::HanExtra -- More Chinese via CPAN
151
152Due to size concerns, additional Chinese encodings below are
153distributed separately on CPAN, under the name Encode::HanExtra.
154
155 -----------------------
156 gb18030
157 euc-tw
158 big5plus
159 -----------------------
160
161=back
162
163=head2 Miscellaneous encodings
164
165=over 4
166
167=item Encode::EBCDIC
5d030b67 168
169See perlebcdic for details.
170
5129552c 171 -----------------------
5d030b67 172 cp1047
173 cp37
174 posix-bc
5129552c 175 -----------------------
176
a63c962f 177=item Encode::Symbols
5d030b67 178
5129552c 179For symbols and dingbats.
5d030b67 180
5129552c 181 -----------------------
5d030b67 182 symbol
183 dingbats
5129552c 184 -----------------------
185
186=back
5d030b67 187
188=head1 Encoding vs. Charset
189
190Character encoding (or just "encoding") and Character Set (or just
191"charset") are often used interchangeably but they are different
192concepts.
193
194Charset determines which characters to be included in a given text.
195
196Encoding actually maps charset(s) to stream of bits.
197
a63c962f 198Note a given encoding may contain multiple charsets and complex CJK
199encodings are usually implemented that way.
200
201For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
202JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
203Kanji) in a single encoding.
5d030b67 204
205As the name suggests, the Encode module supports encodings, not
206individual charsets.
207
a63c962f 208=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
209
210This section tries to classify the supported encodings by their
211applicability for information exchange over the Internet and to
212choose the most suitable aliases to name them in the context of
213such communication.
214
215Encoding names
5d030b67 216
a63c962f 217 US-ASCII UTF-8
218 ISO-8859-* KOI8-R
219 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
220 EUC-KR
221 Big5
5d030b67 222
a63c962f 223are L<http://www.iana.org/assignments/character-sets>-registered as
224preferred MIME names and may probably be used over the Internet.
5d030b67 225
a63c962f 226C<Shift_JIS> is no longer Microsft proprietary since it has been
227officialized by JIS X 0208-1997. It is probably the most wide
228spread encoding for Japanese on the Internet.
5d030b67 229
a63c962f 230 EUC-CN
5d030b67 231
a63c962f 232has not been registered with IANA (as of march 2002) but
233seems to be supported by major web browsers. (IANA has registered
234this encoding as C<GB2312>, but C<gb2312> currently has a different
235meaning to the C<Encode> module. It will probably become alias to
236C<EUC-CN> in the future; until then it is safer to avoid using
237C<gb2312> as encoding name within Perl).
5d030b67 238
a63c962f 239 UTF-16
240 KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
5d030b67 241
a63c962f 242are IANA-registered (C<UTF-16> even as a preferred MIME name)
243but probably should be avoided as encoding for web pages due to
244lack of browser support.
5d030b67 245
5d030b67 246 ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
247 GBK
248 VISCII
a63c962f 249 GB 12345
250 GB 18030 (*) (see links bellow)
251 EUC-TW (*)
5d030b67 252
253are totally valid encodings but not registered at IANA.
a63c962f 254The names under which they are listed here are probably the
255most widely-known names for these encodings and are recommended
256names.
257
258
259=for comment this used to be listed as supported but
5d030b67 260
a63c962f 261do not work @15457 when it's clear they will be uncommented
262or deleted - Anton
263ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
264CNS 11643 (only plains 1 and 2 available)
5d030b67 265
a63c962f 266 BIG5PLUS (*)
267
268is a bit proprietary name. C<(*)>-marked encodings belong to
269C<Encode::HanExtra> available from CPAN.
5d030b67 270
271You may probably get some info on CJK encodings at
272
273brief description for most of the mentioned CJK encodings
a63c962f 274L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
5d030b67 275
276several years old, but still useful
a63c962f 277L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
5d030b67 278
279and some in-depth reading for the heroes :-)
a63c962f 280L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
281
282gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
283F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
284
285The nature of information in this section is most fragile and
286error-prone; I<probably> is the most popular adverb :)
287Please feel free to send your comments, disagreements and
288additions to L<...>. (Note however,
289that the mission of this document is to cover the
290C<Encode>-supported encodings only.
5d030b67 291
292=head1 See Also
293
5129552c 294L<Encode>,
295L<Encode::Byte>,
a63c962f 296L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c 297L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67 298
299=cut