[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod

=head1 NAME

Encode::Supported -- Supported encodings by Encode

=head1 DESCRIPTION

=head2 Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case 
encodings have state, "Encode" uses the encoding object internally 
once an operation is in progress.

=head1 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases.  Encode.pm will automatically load those modules in need.

=head2 Built-in Encodings

The following encodings are always available.

  Canonical	Aliases
  -----------------------
  iso-8859-1	latin1
  US-ascii	ascii
  UCS-2		ucs2, iso-10646-1
  UCS-2le
  UTF-8		utf8
  -----------------------

=head2 Encode::Byte

The following encodings are based single-byte encoding implemented as
extended ASCII.  For most cases it uses \x80-\xff (upper half) to map
non-ASCII characters.

  -----------------------
  iso-8859-1	latin
  iso-8859-2	latin2
  iso-8859-3	latin3
  iso-8859-4	latin4
  iso-8859-5	latin
  iso-8859-6	latin
  iso-8859-7
  iso-8859-8
  iso-8859-9	latin5
  iso-8859-10	latin6
  iso-8859-11
  (iso-8859-12 is nonexistent)
  iso-8859-13   latin7
  iso-8859-14	latin8
  iso-8859-15	latin9
  iso-8859-16	latin10

  koi8-f
  koi8-r
  koi8-u

  viscii	# ASCII + vietnamese

  cp1250	WinLatin2
  cp1251	WinCyrillic
  cp1252	WinLatin1
  cp1253	WinGreek
  cp1254	WinTurkiskh
  cp1255	WinHebrew
  cp1256	WinArabic
  cp1257	WinBaltic
  cp1258	WinVietnamese
  # all cp* are also available as ibm-* and ms-*

  maccentraleuropean  
  maccroatian
  macroman
  maccyrillic
  macromanian
  macdingbats       
  macsami
  macgreek 
  macthai
  macicelandic    
  macturkish
  macukraine
  -----------------------

=head2 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are impelemented in distinct module by
languages, due the the size concerns.  See these perldocs also.

=over 4

=item Encode::CN -- Continental China

  -----------------------
  cp936      gbk		    
  euc-cn
  gb12345
  gb2312
  hz
  iso-ir-165
  -----------------------

=item Encode::JP -- Japan

  -----------------------
  7bit-jis	  jis
  cp932
  euc-jp	  ujis
  iso-2022-jp
  macjapan
  shiftjis	  Shift_JIS, sjis
  -----------------------

=item Encode::KR -- Korea

  -----------------------
  euc-kr
  ksc5601
  cp949
  -----------------------

=item Encode::TW -- Taiwan

  -----------------------
  big5
  big5-hkscs
  cp950
  -----------------------

=item Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.

  -----------------------
  gb18030
  euc-tw
  big5plus
  -----------------------

=back

=head2 Miscellaneous encodings

=over 4

=item Encode::EBCDIC

See perlebcdic for details.

  -----------------------
  cp1047
  cp37
  posix-bc
  -----------------------

=item Enocode::Symbols

For symbols  and dingbats.

  -----------------------
  symbol
  dingbats
  -----------------------

=back

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding contains multiple charsets.  For instance,
euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not
individual charsets.

=head1 Encoding Classification (by Anton Tagunov)

Encodings

  US-ASCII    UTF-8       KOI8-R      ISO-8859-*
  ISO-2022-CN ISO-2022-JP Big5
  EUC-CN      EUC-JP      EUC-KR

are <http://www.iana.org/assignments/character-sets>-registered as
preferred MIME names and may probably be used  over the Internet.  So is

  Shift_JIS

but despite its wide spread it bears the label of being
Microsft proprietary -- was.  Now Shift JIS is official as of
JIS X 0208-1997.

         UTF-16 KOI8-U

are IANA-registered preferred MIME names but probably
shoule be avoided as encoding for web pages due to lack of
browser support.

  ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
  ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345      (only plains 1 and 2 available)
  GB 18030
  CNS 11643

are totally valid encodings but not registered at IANA.

   BIG5PLUS
   EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)

are a bit proprietary

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings

F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

several years old, but still useful

F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

and some in-depth reading for the heroes :-)
F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)

=head1 See Also

L<Encode>, 
L<Encode::Byte>, 
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>
L<Encode::EBCDIC>, L<Encode::Symbol>

=cut
Commit	Line	Data
5d030b67	1	=head1 NAME
	2
	3	Encode::Supported -- Supported encodings by Encode
	4
	5	=head1 DESCRIPTION
	6
5129552c	7	=head2 Encoding Names
5d030b67	8
	9	Encoding names are case insensitive. White space in names
	10	is ignored. In addition an encoding may have aliases.
	11	Each encoding has one "canonical" name. The "canonical"
	12	name is chosen from the names of the encoding by picking
	13	he first in the following sequence:
	14
	15	o The MIME name as defined in IETF RFCs.
	16	o The name in the IANA registry.
	17	o The name used by the organization that defined it.
	18
5129552c	19	Because of all the alias issues, and because in the general case
	20	encodings have state, "Encode" uses the encoding object internally
	21	once an operation is in progress.
5d030b67	22
5129552c	23	=head1 Supported Encodings
5d030b67	24
	25	As of Perl 5.8.0, at least the following encodings are recognized.
	26	Note that unless otherwise specified, they are all case insensitive
	27	(via alias) and all occurance of spaces are replaced with '-'. In
	28	other words, "ISO 8859 1" and "iso-8859-1" are identical.
	29
5129552c	30	Encodings are categorized and implemented in several different modules
	31	but you don't have to C<use Encode::XX> to make them available for
	32	most cases. Encode.pm will automatically load those modules in need.
5d030b67	33
5129552c	34	=head2 Built-in Encodings
5d030b67	35
5129552c	36	The following encodings are always available.
5d030b67	37
5129552c	38	Canonical Aliases
	39	-----------------------
	40	iso-8859-1 latin1
	41	US-ascii ascii
	42	UCS-2 ucs2, iso-10646-1
	43	UCS-2le
	44	UTF-8 utf8
	45	-----------------------
5d030b67	46
5129552c	47	=head2 Encode::Byte
5d030b67	48
5129552c	49	The following encodings are based single-byte encoding implemented as
	50	extended ASCII. For most cases it uses \x80-\xff (upper half) to map
	51	non-ASCII characters.
5d030b67	52
5129552c	53	-----------------------
5129552c	54	iso-8859-1 latin
5d030b67	55	iso-8859-2 latin2
	56	iso-8859-3 latin3
	57	iso-8859-4 latin4
	58	iso-8859-5 latin
	59	iso-8859-6 latin
	60	iso-8859-7
	61	iso-8859-8
	62	iso-8859-9 latin5
	63	iso-8859-10 latin6
	64	iso-8859-11
	65	(iso-8859-12 is nonexistent)
	66	iso-8859-13 latin7
	67	iso-8859-14 latin8
	68	iso-8859-15 latin9
	69	iso-8859-16 latin10
	70
	71	koi8-f
	72	koi8-r
	73	koi8-u
	74
	75	viscii # ASCII + vietnamese
	76
	77	cp1250 WinLatin2
	78	cp1251 WinCyrillic
	79	cp1252 WinLatin1
	80	cp1253 WinGreek
	81	cp1254 WinTurkiskh
	82	cp1255 WinHebrew
	83	cp1256 WinArabic
	84	cp1257 WinBaltic
	85	cp1258 WinVietnamese
	86	# all cp* are also available as ibm-* and ms-*
	87
	88	maccentraleuropean
	89	maccroatian
	90	macroman
	91	maccyrillic
	92	macromanian
	93	macdingbats
	94	macsami
	95	macgreek
	96	macthai
	97	macicelandic
	98	macturkish
	99	macukraine
5129552c	100	-----------------------
5d030b67	101
5129552c	102	=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
5d030b67	103
	104	Note Vietnamese is listed above. Also read "Encoding vs Charset"
	105	below. Also note these are impelemented in distinct module by
	106	languages, due the the size concerns. See these perldocs also.
	107
5129552c	108	=over 4
	109
	110	=item Encode::CN -- Continental China
	111
	112	-----------------------
	113	cp936 gbk
	114	euc-cn
	115	gb12345
	116	gb2312
	117	hz
	118	iso-ir-165
	119	-----------------------
	120
	121	=item Encode::JP -- Japan
	122
	123	-----------------------
	124	7bit-jis jis
	125	cp932
	126	euc-jp ujis
	127	iso-2022-jp
	128	macjapan
	129	shiftjis Shift_JIS, sjis
	130	-----------------------
	131
	132	=item Encode::KR -- Korea
	133
	134	-----------------------
	135	euc-kr
	136	ksc5601
	137	cp949
	138	-----------------------
	139
	140	=item Encode::TW -- Taiwan
	141
	142	-----------------------
	143	big5
	144	big5-hkscs
	145	cp950
	146	-----------------------
	147
	148	=item Encode::HanExtra -- More Chinese via CPAN
	149
	150	Due to size concerns, additional Chinese encodings below are
	151	distributed separately on CPAN, under the name Encode::HanExtra.
	152
	153	-----------------------
	154	gb18030
	155	euc-tw
	156	big5plus
	157	-----------------------
	158
	159	=back
	160
	161	=head2 Miscellaneous encodings
	162
	163	=over 4
	164
	165	=item Encode::EBCDIC
5d030b67	166
	167	See perlebcdic for details.
	168
5129552c	169	-----------------------
5d030b67	170	cp1047
	171	cp37
	172	posix-bc
5129552c	173	-----------------------
	174
	175	=item Enocode::Symbols
5d030b67	176
5129552c	177	For symbols and dingbats.
5d030b67	178
5129552c	179	-----------------------
5d030b67	180	symbol
5d030b67	181	dingbats
5129552c	182	-----------------------
	183
	184	=back
5d030b67	185
	186	=head1 Encoding vs. Charset
	187
	188	Character encoding (or just "encoding") and Character Set (or just
	189	"charset") are often used interchangeably but they are different
	190	concepts.
	191
	192	Charset determines which characters to be included in a given text.
	193
	194	Encoding actually maps charset(s) to stream of bits.
	195
	196	Note a given encoding contains multiple charsets. For instance,
	197	euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
	198	Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.
	199
	200	As the name suggests, the Encode module supports encodings, not
	201	individual charsets.
	202
	203	=head1 Encoding Classification (by Anton Tagunov)
	204
	205	Encodings
	206
	207	US-ASCII UTF-8 KOI8-R ISO-8859-*
	208	ISO-2022-CN ISO-2022-JP Big5
	209	EUC-CN EUC-JP EUC-KR
	210
	211	are <http://www.iana.org/assignments/character-sets>-registered as
	212	preferred MIME names and may probably be used over the Internet. So is
	213
	214	Shift_JIS
	215
	216	but despite its wide spread it bears the label of being
	217	Microsft proprietary -- was. Now Shift JIS is official as of
	218	JIS X 0208-1997.
	219
	220	UTF-16 KOI8-U
	221
	222	are IANA-registered preferred MIME names but probably
	223	shoule be avoided as encoding for web pages due to lack of
	224	browser support.
	225
	226	ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
	227	ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
	228	ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
	229	GBK
	230	VISCII
	231	GB 12345 (only plains 1 and 2 available)
	232	GB 18030
	233	CNS 11643
	234
	235	are totally valid encodings but not registered at IANA.
	236
	237	BIG5PLUS
	238	EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended)
	239
	240	are a bit proprietary
	241
	242	You may probably get some info on CJK encodings at
	243
	244	brief description for most of the mentioned CJK encodings
	245
	246	F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
	247
	248	several years old, but still useful
249
250	F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
251
252	and some in-depth reading for the heroes :-)
253	F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)
254
255	=head1 See Also
256
5129552c	257	L<Encode>,
	258	L<Encode::Byte>,
	259	L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>
	260	L<Encode::EBCDIC>, L<Encode::Symbol>
5d030b67	261
5d030b67	262	=cut