ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supported -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 he first in the following sequence:
  14
  15        o The MIME name as defined in IETF RFCs.
  16        o The name in the IANA registry.
  17        o The name used by the organization that defined it.
  18
  19 Because of all the alias issues, and because in the gen-
  20 eral case encodings have state, "Encode" uses the encoding
  21 object internally once an operation is in progress.
  22
  23 =head2 Supported Encodings
  24
  25 As of Perl 5.8.0, at least the following encodings are recognized.
  26 Note that unless otherwise specified, they are all case insensitive
  27 (via alias) and all occurance of spaces are replaced with '-'.  In
  28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  29
  30 =head3 ASCII
  31
  32   Canonical     Aliases
  33   -----------------------
  34   ascii         uc-ascii
  35
  36 =head3 The Unicode
  37
  38   utf8          UTF-8
  39   utf16         UTF-16
  40   ucs2          UCS-2, iso-10646-1
  41
  42 =head3 The ISO 8859, KOI, and other 1-byte encodings
  43
  44 The following encodings are based upon ASCII.  For most cases it uses
  45 \x80-\xff (upper half) to map non-ASCII characters.
  46
  47   iso-8859-1    latin1
  48   iso-8859-2    latin2
  49   iso-8859-3    latin3
  50   iso-8859-4    latin4
  51   iso-8859-5    latin
  52   iso-8859-6    latin
  53   iso-8859-7
  54   iso-8859-8
  55   iso-8859-9    latin5
  56   iso-8859-10   latin6
  57   iso-8859-11
  58   (iso-8859-12 is nonexistent)
  59   iso-8859-13   latin7
  60   iso-8859-14   latin8
  61   iso-8859-15   latin9
  62   iso-8859-16   latin10
  63
  64   koi8-f
  65   koi8-r
  66   koi8-u
  67
  68   viscii        # ASCII + vietnamese
  69
  70   cp1250        WinLatin2
  71   cp1251        WinCyrillic
  72   cp1252        WinLatin1
  73   cp1253        WinGreek
  74   cp1254        WinTurkiskh
  75   cp1255        WinHebrew
  76   cp1256        WinArabic
  77   cp1257        WinBaltic
  78   cp1258        WinVietnamese
  79   # all cp* are also available as ibm-* and ms-*
  80
  81   maccentraleuropean
  82   maccroatian
  83   macroman
  84   maccyrillic
  85   macromanian
  86   macdingbats
  87   macsami
  88   macgreek
  89   macthai
  90   macicelandic
  91   macturkish
  92   macukraine
  93
  94 =head3 The CJK: Chinese, Japanese, Korean (Multibyte)
  95
  96 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
  97 below.  Also note these are impelemented in distinct module by
  98 languages, due the the size concerns.  See these perldocs also.
  99
 100   cp936      gbk                    # Encode::CN
 101   euc-cn                            # Encode::CN
 102   gb12345                           # Encode::CN
 103   gb2312                            # Encode::CN
 104   gb2312                            # Encode::CN
 105   hz                                # Encode::CN
 106   iso-ir-165                        # Encode::CN
 107
 108   7bit-jis        jis               # Encode::JP
 109   cp932                             # Encode::JP
 110   euc-jp          ujis              # Encode::JP
 111   iso-2022-jp                       # Encode::JP
 112   macjapan                          # Encode::JP
 113   shiftjis        Shift_JIS, sjis   # Encode::JP
 114
 115   euc-kr                            # Encode::KR
 116   ksc5601                           # Encode::KR
 117   cp949                             # Encode::KR
 118
 119   big5                              # Encode::TW
 120   big5-hkscs                        # Encode::TW
 121   cp950                             # Encode::TW
 122
 123 Due to size concerns, additional Chinese encodings including "GB
 124 18030", "EUC-TW" and "BIG5PLUS" are distributed separately on CPAN,
 125 under the name Encode::HanExtra.
 126
 127 =head3 EBCDIC
 128
 129 See perlebcdic for details.
 130
 131   cp1047
 132   cp37
 133   posix-bc
 134
 135 =head3 Symbols and dingbats
 136
 137   symbol
 138   dingbats
 139
 140 =head1 Encoding vs. Charset
 141
 142 Character encoding (or just "encoding") and Character Set (or just
 143 "charset") are often used interchangeably but they are different
 144 concepts.
 145
 146 Charset determines which characters to be included in a given text.
 147
 148 Encoding actually maps charset(s) to stream of bits.
 149
 150 Note a given encoding contains multiple charsets.  For instance,
 151 euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
 152 Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.
 153
 154 As the name suggests, the Encode module supports encodings, not
 155 individual charsets.
 156
 157 =head1 Encoding Classification (by Anton Tagunov)
 158
 159 Encodings
 160
 161   US-ASCII    UTF-8       KOI8-R      ISO-8859-*
 162   ISO-2022-CN ISO-2022-JP Big5
 163   EUC-CN      EUC-JP      EUC-KR
 164
 165 are <http://www.iana.org/assignments/character-sets>-registered as
 166 preferred MIME names and may probably be used  over the Internet.  So is
 167
 168   Shift_JIS
 169
 170 but despite its wide spread it bears the label of being
 171 Microsft proprietary -- was.  Now Shift JIS is official as of
 172 JIS X 0208-1997.
 173
 174          UTF-16 KOI8-U
 175
 176 are IANA-registered preferred MIME names but probably
 177 shoule be avoided as encoding for web pages due to lack of
 178 browser support.
 179
 180   ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
 181   ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
 182   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 183   GBK
 184   VISCII
 185   GB 12345      (only plains 1 and 2 available)
 186   GB 18030
 187   CNS 11643
 188
 189 are totally valid encodings but not registered at IANA.
 190
 191    BIG5PLUS
 192    EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)
 193
 194 are a bit proprietary
 195
 196 You may probably get some info on CJK encodings at
 197
 198 brief description for most of the mentioned CJK encodings
 199
 200 F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
 201
 202 several years old, but still useful
 203
 204 F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 205
 206 and some in-depth reading for the heroes :-)
 207 F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)
 208
 209 =head1 See Also
 210
 211 L<Encode>, L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>
 212
 213 =cut