ext/Encode/lib/Encode/Supported.pod

   1 =head1 NAME
   2
   3 Encode::Supports -- Supported encodings by Encode
   4
   5 =head1 DESCRIPTION
   6
   7 =head2 Encoding Names
   8
   9 Encoding names are case insensitive. White space in names
  10 is ignored.  In addition an encoding may have aliases.
  11 Each encoding has one "canonical" name.  The "canonical"
  12 name is chosen from the names of the encoding by picking
  13 he first in the following sequence:
  14
  15        o The MIME name as defined in IETF RFCs.
  16        o The name in the IANA registry.
  17        o The name used by the organization that defined it.
  18
  19 Because of all the alias issues, and because in the general case
  20 encodings have state, "Encode" uses the encoding object internally
  21 once an operation is in progress.
  22
  23 =head1 Supported Encodings
  24
  25 As of Perl 5.8.0, at least the following encodings are recognized.
  26 Note that unless otherwise specified, they are all case insensitive
  27 (via alias) and all occurrance of spaces are replaced with '-'.  In
  28 other words, "ISO 8859 1" and "iso-8859-1" are identical.
  29
  30 Encodings are categorized and implemented in several different modules
  31 but you don't have to C<use Encode::XX> to make them available for
  32 most cases.  Encode.pm will automatically load those modules in need.
  33
  34 =head2 Built-in Encodings
  35
  36 The following encodings are always available.
  37
  38   Canonical     Aliases                      Comments & References
  39   ----------------------------------------------------------------
  40   iso-8859-1    latin1                                       [ISO]
  41   US-ascii      ascii                                       [ECMA]
  42   UCS-2         ucs2, iso-10646-1                    [IANA, et al]
  43   UCS-2l
  44   UTF-8         utf8                                     [RFC2279]
  45   ----------------------------------------------------------------
  46
  47 =head2 Encode::Byte
  48
  49 The following encodings are based single-byte encoding implemented as
  50 extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
  51 non-ASCII characters.
  52
  53   ----------------------------------------------------------------
  54   # ISO 8859 series
  55   (iso-8859-1   is in built-in)
  56   iso-8859-2    latin2                                       [ISO]
  57   iso-8859-3    latin3                                       [ISO]
  58   iso-8859-4    latin4                                       [ISO]
  59   iso-8859-5                                                 [ISO]
  60   iso-8859-6                                                 [ISO]
  61   iso-8859-7                                                 [ISO]
  62   iso-8859-8                                                 [ISO]
  63   iso-8859-9    latin5                                       [ISO]
  64   iso-8859-10   latin6                                       [ISO]
  65   iso-8859-11
  66   (iso-8859-12 is nonexistent)
  67   iso-8859-13   latin7                                       [ISO]
  68   iso-8859-14   latin8                                       [ISO]
  69   iso-8859-15   latin9                                       [ISO]
  70   iso-8859-16   latin10                                      [ISO]
  71
  72   # Cyrillic
  73   koi8-f
  74   koi8-r                                                 [RFC1489]
  75   koi8-u                                                 [RFC2319]
  76
  77   # Vietnamese
  78   viscii
  79
  80   # all cp* are also available as ibm-*, ms-*, and windows-*
  81   # also see L<http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp>
  82   cp1250        WinLatin2
  83   cp1251        WinCyrillic
  84   cp1252        WinLatin1
  85   cp1253        WinGreek
  86   cp1254        WinTurkiskh
  87   cp1255        WinHebrew
  88   cp1256        WinArabic
  89   cp1257        WinBaltic
  90   cp1258        WinVietnamese
  91
  92   # Macintosh
  93   # Also see L<http://developer.apple.com/technotes/tn/tn1150.html>
  94   MacCentralEurRoman
  95   MacCroatian
  96   MacRoman
  97   MacCyrillic
  98   MacRomanian
  99   MacSami
 100   MacGreek
 101   MacThai
 102   MacIcelandic
 103   MacTurkish
 104   MacUkrainian
 105
 106   # More vendor encodings
 107   nextstep
 108   gsm0338       # used in GSM handsets
 109   hp-roman8
 110   ----------------------------------------------------------------
 111
 112 =head2 The CJK: Chinese, Japanese, Korean (Multibyte)
 113
 114 Note Vietnamese is listed above.  Also read "Encoding vs Charset"
 115 below.  Also note these are implemented in distinct module by
 116 languages, due the the size concerns.  Please also refer to their
 117 respective document pages.
 118
 119 =over 4
 120
 121 =item Encode::CN -- Continental China
 122
 123   ----------------------------------------------------------------
 124   cp936      gbk
 125   euc-cn     gb2312
 126   gb12345-raw
 127   gb2312-raw
 128   hz
 129   iso-ir-165
 130   ----------------------------------------------------------------
 131
 132 =item Encode::JP -- Japan
 133
 134   ----------------------------------------------------------------
 135   7bit-jis        jis
 136   cp932           ms_Kanji
 137   euc-jp          ujis
 138   iso-2022-jp                                            [RFC1468]
 139   iso-2022-jp-1                                          [RFC2237]
 140   macJapan
 141   shiftjis        Shift_JIS, sjis
 142   ----------------------------------------------------------------
 143
 144 =item Encode::KR -- Korea
 145
 146   ----------------------------------------------------------------
 147   euc-kr
 148   cp949         ks_c_5601-1987 x-windows-949 uhc
 149   iso-2022-kr                                            [RFC1557]
 150   johab
 151   ksc5601-raw
 152   ----------------------------------------------------------------
 153
 154 =item Encode::TW -- Taiwan
 155
 156   ----------------------------------------------------------------
 157   big5
 158   big5-hkscs
 159   cp950
 160   ----------------------------------------------------------------
 161
 162 =item Encode::HanExtra -- More Chinese via CPAN
 163
 164 Due to size concerns, additional Chinese encodings below are
 165 distributed separately on CPAN, under the name Encode::HanExtra.
 166
 167   ----------------------------------------------------------------
 168   gb18030
 169   euc-tw
 170   big5plus
 171   ----------------------------------------------------------------
 172
 173 =back
 174
 175 =head2 Miscellaneous encodings
 176
 177 =over 4
 178
 179 =item Encode::EBCDIC
 180
 181 See perlebcdic for details.
 182
 183   ----------------------------------------------------------------
 184   cp1047
 185   cp37
 186   posix-bc
 187   ----------------------------------------------------------------
 188
 189 =item Encode::Symbols
 190
 191 For symbols  and dingbats.
 192
 193   ----------------------------------------------------------------
 194   symbol
 195   dingbats
 196   macDingbats
 197   ----------------------------------------------------------------
 198
 199 =back
 200
 201 =head1 Unsupported encodings
 202
 203 The following are not supported as yet.  Some because they are rarely
 204 usede, some because of technical difficulty.  They may be supported by
 205 external modules via CPAN in future, however.
 206
 207 =over 4
 208
 209 =item   ISO-2022-JP-2 [RFC1554]
 210
 211 Not very popular yet.  Needs Unicode Database or equivalent to
 212 implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
 213 GB2312 sumulteniously, which code points in unicode overlap.  So you
 214 need to lookup the database to determine what character set a given
 215 Unicode character should belong).
 216
 217 =item   ISO-2022-CN [RFC1922]
 218
 219 Not very popular.  Needs CNS 11643-1 and 2 which are not available in
 220 this module.  CNS 11643 is supported (via euc-tw) in
 221 Encode::HanExtra.  Autrijus may add support for this encoding in his
 222 module in future
 223
 224 =item various UP-UX encodings
 225
 226 The following are unsoported due to the lack of mapping data.
 227
 228   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 229   '15' - japanese15, korean15, and  roi15
 230
 231 =item Cyrillic encoding ISO-IR-111
 232
 233 Anton doubts its usefulness.
 234
 235 =item ISO-8859-8-1 [Hebrew]
 236
 237 None of the Encode team knows Hebrew enough.  Contribution welcome.
 238
 239 =item Thai encoding TCVN
 240
 241 Ditto.
 242
 243 =item Vietnamese encodings VPS
 244
 245 Ditto.
 246
 247 =item various Mac encodings
 248
 249 The following are unsoported due to the lack of mapping data. "Mac"
 250 that prepends the encoding names are omitted.
 251
 252  Arabic, Armenian, Bengali, Burmese
 253  ChineseSimp, ChineseTrad, Devanagari, Ethiopic, ExtArabic
 254  Farsi, Georgian, Gujarati, Gurmukhi, Hebrew
 255  Kannada, Khmer, Korean, Laotian, Malayalam, Mongolian
 256  Oriya Sinhalese Symbol Tamil Telugu Tibetan Vietnamese
 257
 258 The rest of which already available are based upon the vendor mapping
 259 available at L<http://www.unicode.org/>
 260
 261 =back
 262
 263 =head1 Encoding vs. Charset
 264
 265 Character encoding (or just "encoding") and Character Set (or just
 266 "charset") are often used interchangeably but they are different
 267 concepts.
 268
 269 =over 2
 270
 271 =item Character I<Set> (I<charset> for short)
 272
 273 Is a collection of characters in which each character is distinguished
 274 with unique ID (in most cases, ID is number).
 275
 276 =item Character I<Encoding>
 277
 278 Is a way to represent character set(s) in a stream of bits.
 279
 280 =back
 281
 282 A character encoding may contain a single character set
 283 (i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
 284 US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).
 285
 286 A character encoding may also encode character set as-is (also called
 287 a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
 288 as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
 289 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
 290
 291 As the name suggests, the Encode module supports encodings, not
 292 individual charsets.
 293
 294 However, the word I<charset> is casually used even in Internet
 295 Assigned Number Authority to actually mean I<encoding>.  Encode tries
 296 to soothe this misconception via aliases.  For instance,
 297 C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
 298 available as C<gb2312-raw>.
 299
 300 =head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
 301
 302 This section tries to classify the supported encodings by their
 303 applicability for information exchange over the Internet and to
 304 choose the most suitable aliases to name them in the context of
 305 such communication.
 306
 307 =over 2
 308
 309 =item *
 310
 311 To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
 312 ,available from CPAN.
 313
 314 =back
 315
 316 Encoding names
 317
 318   US-ASCII    UTF-8     ISO-8859-*  KOI8-R
 319   Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
 320   EUC-KR      Big5
 321
 322 are registered to IANA as preferred MIME names and may probably be used over the Internet.
 323
 324 C<Shift_JIS> is no longer Microsft proprietary since it has been
 325 officialized by JIS X 0208-1997.
 326
 327   EUC-CN
 328
 329 has not been registered with IANA (as of march 2002) but
 330 seems to be supported by major web browsers. In Encode, GB2312
 331 is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
 332 as gb2312-raw.  See L<Encode::CN> for details.
 333
 334   KS_C_5601-1987
 335
 336 has been registered to IANA but when they are used, they are
 337 EUC-coded.  Internet community in Korea is not happy with this.
 338 so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
 339 of C<euc-kr>, with ksc5601-raw for "uncooked".
 340
 341   UTF-16
 342   KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)
 343
 344 are IANA-registered (C<UTF-16> even as a preferred MIME name)
 345 but probably should be avoided as encoding for web pages due to
 346 the lack of browser supports.
 347
 348   ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
 349   GBK
 350   VISCII
 351   GB 12345
 352   GB 18030 (*)  (see links bellow)
 353   EUC-TW   (*)
 354
 355 are totally valid encodings but not registered at IANA.
 356 The names under which they are listed here are probably the
 357 most widely-known names for these encodings and are recommended
 358 names.
 359
 360   BIG5PLUS (*)
 361
 362 is a bit proprietary name.
 363
 364 =head1 Bookmarks
 365
 366 =over 2
 367
 368 =item Assigned Charset Names by IANA
 369
 370 L<http://www.iana.org/assignments/character-sets>
 371
 372 Most of the C<canonical names> in Encode derive from this list
 373 so you can directly apply the string you have extracted from MIME
 374 header of mails and we pages.
 375
 376 =item CJK.inf
 377
 378 L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
 379
 380 Somewhat obsolete (last update in 1996), but still useful.  Also try
 381
 382 L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
 383
 384 You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
 385
 386 =item EMCA-035 (eq C<ISO-2022>)
 387
 388 L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
 389
 390 The very dspecification of ISO-2022 is available from the link above.
 391
 392 =back
 393
 394 =head1 See Also
 395
 396 L<Encode>,
 397 L<Encode::Byte>,
 398 L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
 399 L<Encode::EBCDIC>, L<Encode::Symbol>
 400
 401 =cut
 402
 403 I could not find this page because the hostname doesn't resolve!
 404
 405  Brief description for most of the mentioned CJK encodings
 406 L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>