=head1 NAME

Encode::Supports -- Supported encodings by Encode

=head1 DESCRIPTION

=head2 Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case 
encodings have state, "Encode" uses the encoding object internally 
once an operation is in progress.

=head1 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurrance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases.  Encode.pm will automatically load those modules in need.

=head2 Built-in Encodings

The following encodings are always available.

  Canonical	Aliases
  -----------------------
  iso-8859-1	latin1
  US-ascii	ascii
  UCS-2		ucs2, iso-10646-1
  UCS-2le
  UTF-8		utf8
  -----------------------

=head2 Encode::Byte

The following encodings are based single-byte encoding implemented as
extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
non-ASCII characters.

  -----------------------
  (iso-8859-1	is in built-in)
  iso-8859-2	latin2
  iso-8859-3	latin3
  iso-8859-4	latin4
  iso-8859-5
  iso-8859-6
  iso-8859-7
  iso-8859-8
  iso-8859-9	latin5
  iso-8859-10	latin6
  iso-8859-11
  (iso-8859-12 is nonexistent)
  iso-8859-13   latin7
  iso-8859-14	latin8
  iso-8859-15	latin9
  iso-8859-16	latin10

  koi8-f
  koi8-r
  koi8-u

  viscii	# ASCII + vietnamese

  cp1250	WinLatin2
  cp1251	WinCyrillic
  cp1252	WinLatin1
  cp1253	WinGreek
  cp1254	WinTurkiskh
  cp1255	WinHebrew
  cp1256	WinArabic
  cp1257	WinBaltic
  cp1258	WinVietnamese
  # all cp* are also available as ibm-* and ms-*

  maccentraleuropean  
  maccroatian
  macroman
  maccyrillic
  macromanian
  macsami
  macgreek 
  macthai
  macicelandic    
  macturkish
  macukraine

  nextstep
  gsm0338	# used in GSM handsets
  roman8	# what is this?
  -----------------------

=head2 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are implemented in distinct module by
languages, due the the size concerns.  Please also refer to their
respective document pages.

=over 4

=item Encode::CN -- Continental China

  -----------------------
  cp936      gbk		    
  euc-cn
  gb12345
  gb2312
  hz
  iso-ir-165
  -----------------------

=item Encode::JP -- Japan

  -----------------------
  7bit-jis	  jis
  cp932
  euc-jp	  ujis
  iso-2022-jp
  iso-2022-jp-1
  macjapan
  shiftjis	  Shift_JIS, sjis
  -----------------------

=item Encode::KR -- Korea

  -----------------------
  euc-kr
  ksc5601
  cp949
  -----------------------

=item Encode::TW -- Taiwan

  -----------------------
  big5
  big5-hkscs
  cp950
  -----------------------

=item Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.

  -----------------------
  gb18030
  euc-tw
  big5plus
  -----------------------

=back

=head2 Miscellaneous encodings

=over 4

=item Encode::EBCDIC

See perlebcdic for details.

  -----------------------
  cp1047
  cp37
  posix-bc
  -----------------------

=item Encode::Symbols

For symbols  and dingbats.

  -----------------------
  symbol
  dingbats
  macdingbats
  -----------------------

=back

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding may contain multiple charsets and complex CJK 
encodings are usually implemented that way.

For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not
individual charsets.

=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their 
applicability for information exchange over the Internet and to 
choose the most suitable aliases to name them in the context of 
such communication.

Encoding names

  US-ASCII    UTF-8       
  ISO-8859-*  KOI8-R
  Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
  EUC-KR 
  Big5

are L<http://www.iana.org/assignments/character-sets>-registered as
preferred MIME names and may probably be used over the Internet.

C<Shift_JIS> is no longer Microsft proprietary since it has been
officialized by JIS X 0208-1997. It is probably the most wide
spread encoding for Japanese on the Internet.

  EUC-CN

has not been registered with IANA (as of march 2002) but
seems to be supported by major web browsers. (IANA has registered
this encoding as C<GB2312>, but C<gb2312> currently has a different
meaning to the C<Encode> module. It will probably become alias to 
C<EUC-CN> in the future; until then it is safer to avoid using 
C<gb2312> as encoding name within Perl). 

  UTF-16 
  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to 
lack of browser support.

  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345
  GB 18030 (*)  (see links bellow)
  EUC-TW   (*)

are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.


=for comment this used to be listed as supported but

do not work @15457 when it's clear they will be uncommented 
or deleted - Anton
ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
CNS 11643     (only plains 1 and 2 available)

  BIG5PLUS (*)

is a bit proprietary name. C<(*)>-marked encodings belong to
C<Encode::HanExtra> available from CPAN.

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

several years old, but still useful
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

and some in-depth reading for the heroes :-)
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)

gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>

The nature of information in this section is most fragile and
error-prone; I<probably> is the most popular adverb :)
Please feel free to send your comments, disagreements and 
additions to L<...>. (Note however,
that the mission of this document is to cover the
C<Encode>-supported encodings only.

=head1 See Also

L<Encode>, 
L<Encode::Byte>, 
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>

=cut