ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supports -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 he first in the following sequence:
  14
  15        o The MIME name as defined in IETF RFCs.
  16        o The name in the IANA registry.
  17        o The name used by the organization that defined it.
  18
  19 Because of all the alias issues, and because in the general case
  20 encodings have state, "Encode" uses the encoding object internally
  21 once an operation is in progress.
  22
  23 =head1 Supported Encodings
  24
  25 As of Perl 5.8.0, at least the following encodings are recognized.
  26 Note that unless otherwise specified, they are all case insensitive
  27 (via alias) and all occurrance of spaces are replaced with '-'.  In
  28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  29
  30 Encodings are categorized and implemented in several different modules
  31 but you don't have to C<use Encode::XX> to make them available for
  32 most cases.  Encode.pm will automatically load those modules in need.
  33
  34 =head2 Built-in Encodings
  35
  36 The following encodings are always available.
  37
  38   Canonical     Aliases
  39   -----------------------
  40   iso-8859-1    latin1
  41   US-ascii      ascii
  42   UCS-2         ucs2, iso-10646-1
  43   UCS-2le
  44   UTF-8         utf8
  45   -----------------------
  46
  47 =head2 Encode::Byte
  48
  49 The following encodings are based single-byte encoding implemented as
  50 extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
  51 non-ASCII characters.
  52
  53   -----------------------
  54   (iso-8859-1   is in built-in)
  55   iso-8859-2    latin2
  56   iso-8859-3    latin3
  57   iso-8859-4    latin4
  58   iso-8859-5
  59   iso-8859-6
  60   iso-8859-7
  61   iso-8859-8
  62   iso-8859-9    latin5
  63   iso-8859-10   latin6
  64   iso-8859-11
  65   (iso-8859-12 is nonexistent)
  66   iso-8859-13   latin7
  67   iso-8859-14   latin8
  68   iso-8859-15   latin9
  69   iso-8859-16   latin10
  70
  71   koi8-f
  72   koi8-r
  73   koi8-u
  74
  75   viscii        # ASCII + vietnamese
  76
  77   cp1250        WinLatin2
  78   cp1251        WinCyrillic
  79   cp1252        WinLatin1
  80   cp1253        WinGreek
  81   cp1254        WinTurkiskh
  82   cp1255        WinHebrew
  83   cp1256        WinArabic
  84   cp1257        WinBaltic
  85   cp1258        WinVietnamese
  86   # all cp* are also available as ibm-* and ms-*
  87
  88   maccentraleuropean
  89   maccroatian
  90   macroman
  91   maccyrillic
  92   macromanian
  93   macsami
  94   macgreek
  95   macthai
  96   macicelandic
  97   macturkish
  98   macukraine
  99
 100   nextstep
 101   gsm0338       # used in GSM handsets
 102   roman8        # what is this?
 103   -----------------------
 104
 105 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 106
 107 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 108 below.  Also note these are implemented in distinct module by
 109 languages, due the the size concerns.  Please also refer to their
 110 respective document pages.
 111
 112 =over 4
 113
 114 =item Encode::CN -- Continental China
 115
 116   -----------------------
 117   cp936      gbk
 118   euc-cn
 119   gb12345
 120   gb2312
 121   hz
 122   iso-ir-165
 123   -----------------------
 124
 125 =item Encode::JP -- Japan
 126
 127   -----------------------
 128   7bit-jis        jis
 129   cp932
 130   euc-jp          ujis
 131   iso-2022-jp
 132   iso-2022-jp-1
 133   macjapan
 134   shiftjis        Shift_JIS, sjis
 135   -----------------------
 136
 137 =item Encode::KR -- Korea
 138
 139   -----------------------
 140   euc-kr
 141   ksc5601
 142   cp949
 143   -----------------------
 144
 145 =item Encode::TW -- Taiwan
 146
 147   -----------------------
 148   big5
 149   big5-hkscs
 150   cp950
 151   -----------------------
 152
 153 =item Encode::HanExtra -- More Chinese via CPAN
 154
 155 Due to size concerns, additional Chinese encodings below are
 156 distributed separately on CPAN, under the name Encode::HanExtra.
 157
 158   -----------------------
 159   gb18030
 160   euc-tw
 161   big5plus
 162   -----------------------
 163
 164 =back
 165
 166 =head2 Miscellaneous encodings
 167
 168 =over 4
 169
 170 =item Encode::EBCDIC
 171
 172 See perlebcdic for details.
 173
 174   -----------------------
 175   cp1047
 176   cp37
 177   posix-bc
 178   -----------------------
 179
 180 =item Encode::Symbols
 181
 182 For symbols  and dingbats.
 183
 184   -----------------------
 185   symbol
 186   dingbats
 187   macdingbats
 188   -----------------------
 189
 190 =back
 191
 192 =head1 Encoding vs. Charset
 193
 194 Character encoding (or just "encoding") and Character Set (or just
 195 "charset") are often used interchangeably but they are different
 196 concepts.
 197
 198 Charset determines which characters to be included in a given text.
 199
 200 Encoding actually maps charset(s) to stream of bits.
 201
 202 Note a given encoding may contain multiple charsets and complex CJK
 203 encodings are usually implemented that way.
 204
 205 For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
 206 JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
 207 Kanji) in a single encoding.
 208
 209 As the name suggests, the Encode module supports encodings, not
 210 individual charsets.
 211
 212 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 213
 214 This section tries to classify the supported encodings by their
 215 applicability for information exchange over the Internet and to
 216 choose the most suitable aliases to name them in the context of
 217 such communication.
 218
 219 Encoding names
 220
 221   US-ASCII    UTF-8
 222   ISO-8859-*  KOI8-R
 223   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
 224   EUC-KR
 225   Big5
 226
 227 are L<http://www.iana.org/assignments/character-sets>-registered as
 228 preferred MIME names and may probably be used over the Internet.
 229
 230 C<Shift_JIS> is no longer Microsft proprietary since it has been
 231 officialized by JIS X 0208-1997. It is probably the most wide
 232 spread encoding for Japanese on the Internet.
 233
 234   EUC-CN
 235
 236 has not been registered with IANA (as of march 2002) but
 237 seems to be supported by major web browsers. (IANA has registered
 238 this encoding as C<GB2312>, but C<gb2312> currently has a different
 239 meaning to the C<Encode> module. It will probably become alias to
 240 C<EUC-CN> in the future; until then it is safer to avoid using
 241 C<gb2312> as encoding name within Perl).
 242
 243   UTF-16
 244   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 245
 246 are IANA-registered (C<UTF-16> even as a preferred MIME name)
 247 but probably should be avoided as encoding for web pages due to
 248 lack of browser support.
 249
 250   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 251   GBK
 252   VISCII
 253   GB 12345
 254   GB 18030 (*)  (see links bellow)
 255   EUC-TW   (*)
 256
 257 are totally valid encodings but not registered at IANA.
 258 The names under which they are listed here are probably the
 259 most widely-known names for these encodings and are recommended
 260 names.
 261
 262
 263 =for comment this used to be listed as supported but
 264
 265 do not work @15457 when it's clear they will be uncommented
 266 or deleted - Anton
 267 ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
 268 CNS 11643     (only plains 1 and 2 available)
 269
 270   BIG5PLUS (*)
 271
 272 is a bit proprietary name. C<(*)>-marked encodings belong to
 273 C<Encode::HanExtra> available from CPAN.
 274
 275 You may probably get some info on CJK encodings at
 276
 277 brief description for most of the mentioned CJK encodings
 278 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
 279
 280 several years old, but still useful
 281 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 282
 283 and some in-depth reading for the heroes :-)
 284 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
 285
 286 gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 287 F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 288
 289 The nature of information in this section is most fragile and
 290 error-prone; I<probably> is the most popular adverb :)
 291 Please feel free to send your comments, disagreements and
 292 additions to L<...>. (Note however,
 293 that the mission of this document is to cover the
 294 C<Encode>-supported encodings only.
 295
 296 =head1 See Also
 297
 298 L<Encode>,
 299 L<Encode::Byte>,
 300 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 301 L<Encode::EBCDIC>, L<Encode::Symbol>
 302
 303 =cut