[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod

=head1 NAME

Encode::Supports -- Supported encodings by Encode

=head1 DESCRIPTION

=head2 Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case 
encodings have state, "Encode" uses the encoding object internally 
once an operation is in progress.

=head1 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurrance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases.  Encode.pm will automatically load those modules in need.

=head2 Built-in Encodings

The following encodings are always available.

  Canonical	Aliases
  -----------------------
  iso-8859-1	latin1
  US-ascii	ascii
  UCS-2		ucs2, iso-10646-1
  UCS-2le
  UTF-8		utf8
  -----------------------

=head2 Encode::Byte

The following encodings are based single-byte encoding implemented as
extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
non-ASCII characters.

  -----------------------
  (iso-8859-1	is in built-in)
  iso-8859-2	latin2
  iso-8859-3	latin3
  iso-8859-4	latin4
  iso-8859-5
  iso-8859-6
  iso-8859-7
  iso-8859-8
  iso-8859-9	latin5
  iso-8859-10	latin6
  iso-8859-11
  (iso-8859-12 is nonexistent)
  iso-8859-13   latin7
  iso-8859-14	latin8
  iso-8859-15	latin9
  iso-8859-16	latin10

  koi8-f
  koi8-r
  koi8-u

  viscii	# ASCII + vietnamese

  cp1250	WinLatin2
  cp1251	WinCyrillic
  cp1252	WinLatin1
  cp1253	WinGreek
  cp1254	WinTurkiskh
  cp1255	WinHebrew
  cp1256	WinArabic
  cp1257	WinBaltic
  cp1258	WinVietnamese
  # all cp* are also available as ibm-* and ms-*

  maccentraleuropean  
  maccroatian
  macroman
  maccyrillic
  macromanian
  macsami
  macgreek 
  macthai
  macicelandic    
  macturkish
  macukraine

  nextstep
  gsm0338	# used in GSM handsets
  roman8	# what is this?
  -----------------------

=head2 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are implemented in distinct module by
languages, due the the size concerns.  Please also refer to their
respective document pages.

=over 4

=item Encode::CN -- Continental China

  -----------------------
  cp936      gbk		    
  euc-cn
  gb12345
  gb2312
  hz
  iso-ir-165
  -----------------------

=item Encode::JP -- Japan

  -----------------------
  7bit-jis	  jis
  cp932
  euc-jp	  ujis
  iso-2022-jp
  iso-2022-jp-1
  macjapan
  shiftjis	  Shift_JIS, sjis
  -----------------------

=item Encode::KR -- Korea

  -----------------------
  euc-kr
  ksc5601
  cp949
  -----------------------

=item Encode::TW -- Taiwan

  -----------------------
  big5
  big5-hkscs
  cp950
  -----------------------

=item Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.

  -----------------------
  gb18030
  euc-tw
  big5plus
  -----------------------

=back

=head2 Miscellaneous encodings

=over 4

=item Encode::EBCDIC

See perlebcdic for details.

  -----------------------
  cp1047
  cp37
  posix-bc
  -----------------------

=item Encode::Symbols

For symbols  and dingbats.

  -----------------------
  symbol
  dingbats
  macdingbats
  -----------------------

=back

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding may contain multiple charsets and complex CJK 
encodings are usually implemented that way.

For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not
individual charsets.

=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their 
applicability for information exchange over the Internet and to 
choose the most suitable aliases to name them in the context of 
such communication.

Encoding names

  US-ASCII    UTF-8       
  ISO-8859-*  KOI8-R
  Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
  EUC-KR 
  Big5

are L<http://www.iana.org/assignments/character-sets>-registered as
preferred MIME names and may probably be used over the Internet.

C<Shift_JIS> is no longer Microsft proprietary since it has been
officialized by JIS X 0208-1997. It is probably the most wide
spread encoding for Japanese on the Internet.

  EUC-CN

has not been registered with IANA (as of march 2002) but
seems to be supported by major web browsers. (IANA has registered
this encoding as C<GB2312>, but C<gb2312> currently has a different
meaning to the C<Encode> module. It will probably become alias to 
C<EUC-CN> in the future; until then it is safer to avoid using 
C<gb2312> as encoding name within Perl). 

  UTF-16 
  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to 
lack of browser support.

  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345
  GB 18030 (*)  (see links bellow)
  EUC-TW   (*)

are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.


=for comment this used to be listed as supported but

do not work @15457 when it's clear they will be uncommented 
or deleted - Anton
ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
CNS 11643     (only plains 1 and 2 available)

  BIG5PLUS (*)

is a bit proprietary name. C<(*)>-marked encodings belong to
C<Encode::HanExtra> available from CPAN.

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

several years old, but still useful
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

and some in-depth reading for the heroes :-)
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)

gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>

The nature of information in this section is most fragile and
error-prone; I<probably> is the most popular adverb :)
Please feel free to send your comments, disagreements and 
additions to L<...>. (Note however,
that the mission of this document is to cover the
C<Encode>-supported encodings only.

=head1 See Also

L<Encode>, 
L<Encode::Byte>, 
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>

=cut
Commit	Line	Data
5d030b67	1	=head1 NAME
5d030b67	2
64ffdd5e	3	Encode::Supports -- Supported encodings by Encode
5d030b67	4
	5	=head1 DESCRIPTION
	6
5129552c	7	=head2 Encoding Names
5d030b67	8
	9	Encoding names are case insensitive. White space in names
	10	is ignored. In addition an encoding may have aliases.
	11	Each encoding has one "canonical" name. The "canonical"
	12	name is chosen from the names of the encoding by picking
	13	he first in the following sequence:
	14
	15	o The MIME name as defined in IETF RFCs.
	16	o The name in the IANA registry.
	17	o The name used by the organization that defined it.
	18
5129552c	19	Because of all the alias issues, and because in the general case
	20	encodings have state, "Encode" uses the encoding object internally
	21	once an operation is in progress.
5d030b67	22
5129552c	23	=head1 Supported Encodings
5d030b67	24
	25	As of Perl 5.8.0, at least the following encodings are recognized.
	26	Note that unless otherwise specified, they are all case insensitive
a63c962f	27	(via alias) and all occurrance of spaces are replaced with '-'. In
5d030b67	28	other words, "ISO 8859 1" and "iso-8859-1" are identical.
5d030b67	29
5129552c	30	Encodings are categorized and implemented in several different modules
	31	but you don't have to C<use Encode::XX> to make them available for
	32	most cases. Encode.pm will automatically load those modules in need.
5d030b67	33
5129552c	34	=head2 Built-in Encodings
5d030b67	35
5129552c	36	The following encodings are always available.
5d030b67	37
5129552c	38	Canonical Aliases
	39	-----------------------
	40	iso-8859-1 latin1
	41	US-ascii ascii
	42	UCS-2 ucs2, iso-10646-1
	43	UCS-2le
	44	UTF-8 utf8
	45	-----------------------
5d030b67	46
5129552c	47	=head2 Encode::Byte
5d030b67	48
5129552c	49	The following encodings are based single-byte encoding implemented as
	50	extended ASCII. For most cases it uses \x80-\xff (upper half) to map
	51	non-ASCII characters.
5d030b67	52
5129552c	53	-----------------------
a63c962f	54	(iso-8859-1 is in built-in)
5d030b67	55	iso-8859-2 latin2
	56	iso-8859-3 latin3
	57	iso-8859-4 latin4
a63c962f	58	iso-8859-5
a63c962f	59	iso-8859-6
5d030b67	60	iso-8859-7
	61	iso-8859-8
	62	iso-8859-9 latin5
	63	iso-8859-10 latin6
	64	iso-8859-11
	65	(iso-8859-12 is nonexistent)
	66	iso-8859-13 latin7
	67	iso-8859-14 latin8
	68	iso-8859-15 latin9
	69	iso-8859-16 latin10
	70
	71	koi8-f
	72	koi8-r
	73	koi8-u
	74
	75	viscii # ASCII + vietnamese
	76
	77	cp1250 WinLatin2
	78	cp1251 WinCyrillic
	79	cp1252 WinLatin1
	80	cp1253 WinGreek
	81	cp1254 WinTurkiskh
	82	cp1255 WinHebrew
	83	cp1256 WinArabic
	84	cp1257 WinBaltic
	85	cp1258 WinVietnamese
	86	# all cp* are also available as ibm-* and ms-*
	87
	88	maccentraleuropean
	89	maccroatian
	90	macroman
	91	maccyrillic
	92	macromanian
5d030b67	93	macsami
	94	macgreek
	95	macthai
	96	macicelandic
	97	macturkish
	98	macukraine
64ffdd5e	99
	100	nextstep
	101	gsm0338 # used in GSM handsets
	102	roman8 # what is this?
5129552c	103	-----------------------
5d030b67	104
5129552c	105	=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67	106
5d030b67	107	Note Vietnamese is listed above. Also read "Encoding vs Charset"
a63c962f	108	below. Also note these are implemented in distinct module by
	109	languages, due the the size concerns. Please also refer to their
	110	respective document pages.
5d030b67	111
5129552c	112	=over 4
	113
	114	=item Encode::CN -- Continental China
	115
	116	-----------------------
	117	cp936 gbk
	118	euc-cn
	119	gb12345
	120	gb2312
	121	hz
	122	iso-ir-165
	123	-----------------------
	124
	125	=item Encode::JP -- Japan
	126
	127	-----------------------
	128	7bit-jis jis
	129	cp932
	130	euc-jp ujis
	131	iso-2022-jp
a63c962f	132	iso-2022-jp-1
5129552c	133	macjapan
	134	shiftjis Shift_JIS, sjis
	135	-----------------------
	136
	137	=item Encode::KR -- Korea
	138
	139	-----------------------
	140	euc-kr
	141	ksc5601
	142	cp949
	143	-----------------------
	144
	145	=item Encode::TW -- Taiwan
	146
	147	-----------------------
	148	big5
	149	big5-hkscs
	150	cp950
	151	-----------------------
	152
	153	=item Encode::HanExtra -- More Chinese via CPAN
	154
	155	Due to size concerns, additional Chinese encodings below are
	156	distributed separately on CPAN, under the name Encode::HanExtra.
	157
	158	-----------------------
	159	gb18030
	160	euc-tw
	161	big5plus
	162	-----------------------
	163
	164	=back
	165
	166	=head2 Miscellaneous encodings
	167
	168	=over 4
	169
	170	=item Encode::EBCDIC
5d030b67	171
	172	See perlebcdic for details.
	173
5129552c	174	-----------------------
5d030b67	175	cp1047
	176	cp37
	177	posix-bc
5129552c	178	-----------------------
5129552c	179
a63c962f	180	=item Encode::Symbols
5d030b67	181
5129552c	182	For symbols and dingbats.
5d030b67	183
5129552c	184	-----------------------
5d030b67	185	symbol
5d030b67	186	dingbats
64ffdd5e	187	macdingbats
5129552c	188	-----------------------
	189
	190	=back
5d030b67	191
	192	=head1 Encoding vs. Charset
	193
	194	Character encoding (or just "encoding") and Character Set (or just
	195	"charset") are often used interchangeably but they are different
	196	concepts.
	197
	198	Charset determines which characters to be included in a given text.
	199
	200	Encoding actually maps charset(s) to stream of bits.
	201
a63c962f	202	Note a given encoding may contain multiple charsets and complex CJK
	203	encodings are usually implemented that way.
	204
	205	For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana),
	206	JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended
	207	Kanji) in a single encoding.
5d030b67	208
	209	As the name suggests, the Encode module supports encodings, not
	210	individual charsets.
	211
a63c962f	212	=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
	213
	214	This section tries to classify the supported encodings by their
	215	applicability for information exchange over the Internet and to
	216	choose the most suitable aliases to name them in the context of
	217	such communication.
	218
	219	Encoding names
5d030b67	220
a63c962f	221	US-ASCII UTF-8
	222	ISO-8859-* KOI8-R
	223	Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
	224	EUC-KR
	225	Big5
5d030b67	226
a63c962f	227	are L<http://www.iana.org/assignments/character-sets>-registered as
a63c962f	228	preferred MIME names and may probably be used over the Internet.
5d030b67	229
a63c962f	230	C<Shift_JIS> is no longer Microsft proprietary since it has been
	231	officialized by JIS X 0208-1997. It is probably the most wide
	232	spread encoding for Japanese on the Internet.
5d030b67	233
a63c962f	234	EUC-CN
5d030b67	235
a63c962f	236	has not been registered with IANA (as of march 2002) but
	237	seems to be supported by major web browsers. (IANA has registered
	238	this encoding as C<GB2312>, but C<gb2312> currently has a different
	239	meaning to the C<Encode> module. It will probably become alias to
	240	C<EUC-CN> in the future; until then it is safer to avoid using
	241	C<gb2312> as encoding name within Perl).
5d030b67	242
a63c962f	243	UTF-16
a63c962f	244	KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
5d030b67	245
a63c962f	246	are IANA-registered (C<UTF-16> even as a preferred MIME name)
	247	but probably should be avoided as encoding for web pages due to
	248	lack of browser support.
5d030b67	249
5d030b67	250	ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
	251	GBK
	252	VISCII
a63c962f	253	GB 12345
	254	GB 18030 (*) (see links bellow)
	255	EUC-TW (*)
5d030b67	256
5d030b67	257	are totally valid encodings but not registered at IANA.
a63c962f	258	The names under which they are listed here are probably the
	259	most widely-known names for these encodings and are recommended
	260	names.
	261
	262
	263	=for comment this used to be listed as supported but
5d030b67	264
a63c962f	265	do not work @15457 when it's clear they will be uncommented
	266	or deleted - Anton
	267	ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
	268	CNS 11643 (only plains 1 and 2 available)
5d030b67	269
a63c962f	270	BIG5PLUS (*)
	271
	272	is a bit proprietary name. C<(*)>-marked encodings belong to
	273	C<Encode::HanExtra> available from CPAN.
5d030b67	274
	275	You may probably get some info on CJK encodings at
	276
	277	brief description for most of the mentioned CJK encodings
a63c962f	278	L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
5d030b67	279
5d030b67	280	several years old, but still useful
a63c962f	281	L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
5d030b67	282
5d030b67	283	and some in-depth reading for the heroes :-)
a63c962f	284	L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq C<ISO-2022>)
	285
	286	gives brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
	287	F<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
	288
	289	The nature of information in this section is most fragile and
	290	error-prone; I<probably> is the most popular adverb :)
	291	Please feel free to send your comments, disagreements and
	292	additions to L<...>. (Note however,
	293	that the mission of this document is to cover the
	294	C<Encode>-supported encodings only.
5d030b67	295
	296	=head1 See Also
	297
5129552c	298	L<Encode>,
5129552c	299	L<Encode::Byte>,
a63c962f	300	L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
5129552c	301	L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67	302
5d030b67	303	=cut