ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 he first in the following sequence:
  14
  15        o The MIME name as defined in IETF RFCs.
  16        o The name in the IANA registry.
  17        o The name used by the organization that defined it.
  18
  19 Because of all the alias issues, and because in the general case
  20 encodings have state, "Encode" uses the encoding object internally
  21 once an operation is in progress.
  22
  23 =head1 Supported Encodings
  24
  25 As of Perl 5.8.0, at least the following encodings are recognized.
  26 Note that unless otherwise specified, they are all case insensitive
  27 (via alias) and all occurrance of spaces are replaced with '-'.  In
  28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  29
  30 Encodings are categorized and implemented in several different modules
  31 but you don't have to C<use Encode::XX> to make them available for
  32 most cases.  Encode.pm will automatically load those modules in need.
  33
  34 =head2 Built-in Encodings
  35
  36 The following encodings are always available.
  37
  38   Canonical     Aliases
  39   -----------------------
  40   iso-8859-1    latin1
  41   US-ascii      ascii
  42   UCS-2         ucs2, iso-10646-1
  43   UCS-2le
  44   UTF-8         utf8
  45   -----------------------
  46
  47 =head2 Encode::Byte
  48
  49 The following encodings are based single-byte encoding implemented as
  50 extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
  51 non-ASCII characters.
  52
  53   -----------------------
  54   (iso-8859-1   is in built-in)
  55   iso-8859-2    latin2
  56   iso-8859-3    latin3
  57   iso-8859-4    latin4
  58   iso-8859-5
  59   iso-8859-6
  60   iso-8859-7
  61   iso-8859-8
  62   iso-8859-9    latin5
  63   iso-8859-10   latin6
  64   iso-8859-11
  65   (iso-8859-12 is nonexistent)
  66   iso-8859-13   latin7
  67   iso-8859-14   latin8
  68   iso-8859-15   latin9
  69   iso-8859-16   latin10
  70
  71   koi8-f
  72   koi8-r
  73   koi8-u
  74
  75   viscii        # ASCII + vietnamese
  76
  77   cp1250        WinLatin2
  78   cp1251        WinCyrillic
  79   cp1252        WinLatin1
  80   cp1253        WinGreek
  81   cp1254        WinTurkiskh
  82   cp1255        WinHebrew
  83   cp1256        WinArabic
  84   cp1257        WinBaltic
  85   cp1258        WinVietnamese
  86   # all cp* are also available as ibm-* and ms-*
  87
  88   maccentraleuropean
  89   maccroatian
  90   macroman
  91   maccyrillic
  92   macromanian
  93   macdingbats
  94   macsami
  95   macgreek
  96   macthai
  97   macicelandic
  98   macturkish
  99   macukraine
 100   -----------------------
 101
 102 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 103
 104 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 105 below.  Also note these are implemented in distinct module by
 106 languages, due the the size concerns.  Please also refer to their
 107 respective document pages.
 108
 109 =over 4
 110
 111 =item Encode::CN -- Continental China
 112
 113   -----------------------
 114   cp936      gbk
 115   euc-cn
 116   gb12345
 117   gb2312
 118   hz
 119   iso-ir-165
 120   -----------------------
 121
 122 =item Encode::JP -- Japan
 123
 124   -----------------------
 125   7bit-jis        jis
 126   cp932
 127   euc-jp          ujis
 128   iso-2022-jp
 129   iso-2022-jp-1
 130   macjapan
 131   shiftjis        Shift_JIS, sjis
 132   -----------------------
 133
 134 =item Encode::KR -- Korea
 135
 136   -----------------------
 137   euc-kr
 138   ksc5601
 139   cp949
 140   -----------------------
 141
 142 =item Encode::TW -- Taiwan
 143
 144   -----------------------
 145   big5
 146   big5-hkscs
 147   cp950
 148   -----------------------
 149
 150 =item Encode::HanExtra -- More Chinese via CPAN
 151
 152 Due to size concerns, additional Chinese encodings below are
 153 distributed separately on CPAN, under the name Encode::HanExtra.
 154
 155   -----------------------
 156   gb18030
 157   euc-tw
 158   big5plus
 159   -----------------------
 160
 161 =back
 162
 163 =head2 Miscellaneous encodings
 164
 165 =over 4
 166
 167 =item Encode::EBCDIC
 168
 169 See perlebcdic for details.
 170
 171   -----------------------
 172   cp1047
 173   cp37
 174   posix-bc
 175   -----------------------
 176
 177 =item Encode::Symbols
 178
 179 For symbols  and dingbats.
 180
 181   -----------------------
 182   symbol
 183   dingbats
 184   -----------------------
 185
 186 =back
 187
 188 =head1 Encoding vs. Charset
 189
 190 Character encoding (or just "encoding") and Character Set (or just
 191 "charset") are often used interchangeably but they are different
 192 concepts.
 193
 194 Charset determines which characters to be included in a given text.
 195
 196 Encoding actually maps charset(s) to stream of bits.
 197
 198 Note a given encoding may contain multiple charsets and complex CJK
 199 encodings are usually implemented that way.
 200
 201 For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
 202 JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
 203 Kanji) in a single encoding.
 204
 205 As the name suggests, the Encode module supports encodings, not
 206 individual charsets.
 207
 208 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 209
 210 This section tries to classify the supported encodings by their
 211 applicability for information exchange over the Internet and to
 212 choose the most suitable aliases to name them in the context of
 213 such communication.
 214
 215 Encoding names
 216
 217   US-ASCII    UTF-8
 218   ISO-8859-*  KOI8-R
 219   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
 220   EUC-KR
 221   Big5
 222
 223 are L<http://www.iana.org/assignments/character-sets>-registered as
 224 preferred MIME names and may probably be used over the Internet.
 225
 226 C<Shift_JIS> is no longer Microsft proprietary since it has been
 227 officialized by JIS X 0208-1997. It is probably the most wide
 228 spread encoding for Japanese on the Internet.
 229
 230   EUC-CN
 231
 232 has not been registered with IANA (as of march 2002) but
 233 seems to be supported by major web browsers. (IANA has registered
 234 this encoding as C<GB2312>, but C<gb2312> currently has a different
 235 meaning to the C<Encode> module. It will probably become alias to
 236 C<EUC-CN> in the future; until then it is safer to avoid using
 237 C<gb2312> as encoding name within Perl).
 238
 239   UTF-16
 240   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 241
 242 are IANA-registered (C<UTF-16> even as a preferred MIME name)
 243 but probably should be avoided as encoding for web pages due to
 244 lack of browser support.
 245
 246   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 247   GBK
 248   VISCII
 249   GB 12345
 250   GB 18030 (*)  (see links bellow)
 251   EUC-TW   (*)
 252
 253 are totally valid encodings but not registered at IANA.
 254 The names under which they are listed here are probably the
 255 most widely-known names for these encodings and are recommended
 256 names.
 257
 258
 259 =for comment this used to be listed as supported but
 260
 261 do not work @15457 when it's clear they will be uncommented
 262 or deleted - Anton
 263 ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
 264 CNS 11643     (only plains 1 and 2 available)
 265
 266   BIG5PLUS (*)
 267
 268 is a bit proprietary name. C<(*)>-marked encodings belong to
 269 C<Encode::HanExtra> available from CPAN.
 270
 271 You may probably get some info on CJK encodings at
 272
 273 brief description for most of the mentioned CJK encodings
 274 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
 275
 276 several years old, but still useful
 277 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 278
 279 and some in-depth reading for the heroes :-)
 280 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
 281
 282 gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 283 F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 284
 285 The nature of information in this section is most fragile and
 286 error-prone; I<probably> is the most popular adverb :)
 287 Please feel free to send your comments, disagreements and
 288 additions to L<...>. (Note however,
 289 that the mission of this document is to cover the
 290 C<Encode>-supported encodings only.
 291
 292 =head1 See Also
 293
 294 L<Encode>,
 295 L<Encode::Byte>,
 296 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 297 L<Encode::EBCDIC>, L<Encode::Symbol>
 298
 299 =cut