ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 he first in the following sequence:
  14
  15        o The MIME name as defined in IETF RFCs.
  16        o The name in the IANA registry.
  17        o The name used by the organization that defined it.
  18
  19 Because of all the alias issues, and because in the general case
  20 encodings have state, "Encode" uses the encoding object internally
  21 once an operation is in progress.
  22
  23 =head1 Supported Encodings
  24
  25 As of Perl 5.8.0, at least the following encodings are recognized.
  26 Note that unless otherwise specified, they are all case insensitive
  27 (via alias) and all occurance of spaces are replaced with '-'.  In
  28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  29
  30 Encodings are categorized and implemented in several different modules
  31 but you don't have to C<use Encode::XX> to make them available for
  32 most cases.  Encode.pm will automatically load those modules in need.
  33
  34 =head2 Built-in Encodings
  35
  36 The following encodings are always available.
  37
  38   Canonical     Aliases
  39   -----------------------
  40   iso-8859-1    latin1
  41   US-ascii      ascii
  42   UCS-2         ucs2, iso-10646-1
  43   UCS-2le
  44   UTF-8         utf8
  45   -----------------------
  46
  47 =head2 Encode::Byte
  48
  49 The following encodings are based single-byte encoding implemented as
  50 extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
  51 non-ASCII characters.
  52
  53   -----------------------
  54   iso-8859-1    latin
  55   iso-8859-2    latin2
  56   iso-8859-3    latin3
  57   iso-8859-4    latin4
  58   iso-8859-5    latin
  59   iso-8859-6    latin
  60   iso-8859-7
  61   iso-8859-8
  62   iso-8859-9    latin5
  63   iso-8859-10   latin6
  64   iso-8859-11
  65   (iso-8859-12 is nonexistent)
  66   iso-8859-13   latin7
  67   iso-8859-14   latin8
  68   iso-8859-15   latin9
  69   iso-8859-16   latin10
  70
  71   koi8-f
  72   koi8-r
  73   koi8-u
  74
  75   viscii        # ASCII + vietnamese
  76
  77   cp1250        WinLatin2
  78   cp1251        WinCyrillic
  79   cp1252        WinLatin1
  80   cp1253        WinGreek
  81   cp1254        WinTurkiskh
  82   cp1255        WinHebrew
  83   cp1256        WinArabic
  84   cp1257        WinBaltic
  85   cp1258        WinVietnamese
  86   # all cp* are also available as ibm-* and ms-*
  87
  88   maccentraleuropean
  89   maccroatian
  90   macroman
  91   maccyrillic
  92   macromanian
  93   macdingbats
  94   macsami
  95   macgreek
  96   macthai
  97   macicelandic
  98   macturkish
  99   macukraine
 100   -----------------------
 101
 102 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 103
 104 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 105 below.  Also note these are impelemented in distinct module by
 106 languages, due the the size concerns.  See these perldocs also.
 107
 108 =over 4
 109
 110 =item Encode::CN -- Continental China
 111
 112   -----------------------
 113   cp936      gbk
 114   euc-cn
 115   gb12345
 116   gb2312
 117   hz
 118   iso-ir-165
 119   -----------------------
 120
 121 =item Encode::JP -- Japan
 122
 123   -----------------------
 124   7bit-jis        jis
 125   cp932
 126   euc-jp          ujis
 127   iso-2022-jp
 128   macjapan
 129   shiftjis        Shift_JIS, sjis
 130   -----------------------
 131
 132 =item Encode::KR -- Korea
 133
 134   -----------------------
 135   euc-kr
 136   ksc5601
 137   cp949
 138   -----------------------
 139
 140 =item Encode::TW -- Taiwan
 141
 142   -----------------------
 143   big5
 144   big5-hkscs
 145   cp950
 146   -----------------------
 147
 148 =item Encode::HanExtra -- More Chinese via CPAN
 149
 150 Due to size concerns, additional Chinese encodings below are
 151 distributed separately on CPAN, under the name Encode::HanExtra.
 152
 153   -----------------------
 154   gb18030
 155   euc-tw
 156   big5plus
 157   -----------------------
 158
 159 =back
 160
 161 =head2 Miscellaneous encodings
 162
 163 =over 4
 164
 165 =item Encode::EBCDIC
 166
 167 See perlebcdic for details.
 168
 169   -----------------------
 170   cp1047
 171   cp37
 172   posix-bc
 173   -----------------------
 174
 175 =item Enocode::Symbols
 176
 177 For symbols  and dingbats.
 178
 179   -----------------------
 180   symbol
 181   dingbats
 182   -----------------------
 183
 184 =back
 185
 186 =head1 Encoding vs. Charset
 187
 188 Character encoding (or just "encoding") and Character Set (or just
 189 "charset") are often used interchangeably but they are different
 190 concepts.
 191
 192 Charset determines which characters to be included in a given text.
 193
 194 Encoding actually maps charset(s) to stream of bits.
 195
 196 Note a given encoding contains multiple charsets.  For instance,
 197 euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
 198 Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.
 199
 200 As the name suggests, the Encode module supports encodings, not
 201 individual charsets.
 202
 203 =head1 Encoding Classification (by Anton Tagunov)
 204
 205 Encodings
 206
 207   US-ASCII    UTF-8       KOI8-R      ISO-8859-*
 208   ISO-2022-CN ISO-2022-JP Big5
 209   EUC-CN      EUC-JP      EUC-KR
 210
 211 are <http://www.iana.org/assignments/character-sets>-registered as
 212 preferred MIME names and may probably be used  over the Internet.  So is
 213
 214   Shift_JIS
 215
 216 but despite its wide spread it bears the label of being
 217 Microsft proprietary -- was.  Now Shift JIS is official as of
 218 JIS X 0208-1997.
 219
 220          UTF-16 KOI8-U
 221
 222 are IANA-registered preferred MIME names but probably
 223 shoule be avoided as encoding for web pages due to lack of
 224 browser support.
 225
 226   ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
 227   ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
 228   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 229   GBK
 230   VISCII
 231   GB 12345      (only plains 1 and 2 available)
 232   GB 18030
 233   CNS 11643
 234
 235 are totally valid encodings but not registered at IANA.
 236
 237    BIG5PLUS
 238    EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)
 239
 240 are a bit proprietary
 241
 242 You may probably get some info on CJK encodings at
 243
 244 brief description for most of the mentioned CJK encodings
 245
 246 F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
 247
 248 several years old, but still useful
 249
 250 F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 251
 252 and some in-depth reading for the heroes :-)
 253 F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)
 254
 255 =head1 See Also
 256
 257 L<Encode>,
 258 L<Encode::Byte>,
 259 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>
 260 L<Encode::EBCDIC>, L<Encode::Symbol>
 261
 262 =cut