[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Supported.pod

=head1 NAME

Encode::Supported -- Supported encodings by Encode

=head1 DESCRIPTION

=Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the gen-
eral case encodings have state, "Encode" uses the encoding
object internally once an operation is in progress.

=head2 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

=head3 ASCII

  Canonical	Aliases
  -----------------------
  ascii	        uc-ascii

=head3 The Unicode

  utf8		UTF-8
  utf16		UTF-16
  ucs2		UCS-2, iso-10646-1

=head3 The ISO 8859, KOI, and other 1-byte encodings

The following encodings are based upon ASCII.  For most cases it uses
\x80-\xff (upper half) to map non-ASCII characters.

  iso-8859-1	latin1
  iso-8859-2	latin2
  iso-8859-3	latin3
  iso-8859-4	latin4
  iso-8859-5	latin
  iso-8859-6	latin
  iso-8859-7
  iso-8859-8
  iso-8859-9	latin5
  iso-8859-10	latin6
  iso-8859-11
  (iso-8859-12 is nonexistent)
  iso-8859-13   latin7
  iso-8859-14	latin8
  iso-8859-15	latin9
  iso-8859-16	latin10

  koi8-f
  koi8-r
  koi8-u

  viscii	# ASCII + vietnamese

  cp1250	WinLatin2
  cp1251	WinCyrillic
  cp1252	WinLatin1
  cp1253	WinGreek
  cp1254	WinTurkiskh
  cp1255	WinHebrew
  cp1256	WinArabic
  cp1257	WinBaltic
  cp1258	WinVietnamese
  # all cp* are also available as ibm-* and ms-*

  maccentraleuropean  
  maccroatian
  macroman
  maccyrillic
  macromanian
  macdingbats       
  macsami
  macgreek 
  macthai
  macicelandic    
  macturkish
  macukraine

=head3 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are impelemented in distinct module by
languages, due the the size concerns.  See these perldocs also.

  cp936      gbk		    # Encode::CN
  euc-cn			    # Encode::CN
  gb12345			    # Encode::CN
  gb2312			    # Encode::CN
  gb2312			    # Encode::CN
  hz				    # Encode::CN
  iso-ir-165			    # Encode::CN

  7bit-jis	  jis		    # Encode::JP
  cp932				    # Encode::JP
  euc-jp	  ujis		    # Encode::JP
  iso-2022-jp			    # Encode::JP
  macjapan			    # Encode::JP
  shiftjis	  Shift_JIS, sjis   # Encode::JP
  
  euc-kr			    # Encode::KR
  ksc5601			    # Encode::KR
  cp949                             # Encode::KR

  big5				    # Encode::TW
  big5-hkscs			    # Encode::TW
  cp950                             # Encode::TW

Due to size concerns, additional Chinese encodings including "GB
18030", "EUC-TW" and "BIG5PLUS" are distributed separately on CPAN,
under the name Encode::HanExtra.

=head3 EBCDIC

See perlebcdic for details.

  cp1047
  cp37
  posix-bc

=head3 Symbols and dingbats

  symbol
  dingbats

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding contains multiple charsets.  For instance,
euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not
individual charsets.

=head1 Encoding Classification (by Anton Tagunov)

Encodings

  US-ASCII    UTF-8       KOI8-R      ISO-8859-*
  ISO-2022-CN ISO-2022-JP Big5
  EUC-CN      EUC-JP      EUC-KR

are <http://www.iana.org/assignments/character-sets>-registered as
preferred MIME names and may probably be used  over the Internet.  So is

  Shift_JIS

but despite its wide spread it bears the label of being
Microsft proprietary -- was.  Now Shift JIS is official as of
JIS X 0208-1997.

         UTF-16 KOI8-U

are IANA-registered preferred MIME names but probably
shoule be avoided as encoding for web pages due to lack of
browser support.

  ISO-2022      (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
  ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345      (only plains 1 and 2 available)
  GB 18030
  CNS 11643

are totally valid encodings but not registered at IANA.

   BIG5PLUS
   EUC-JP-0212   (Encode::lib::Encode::Tcl::Extended)

are a bit proprietary

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings

F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>

several years old, but still useful

F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

and some in-depth reading for the heroes :-)
F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)

=head1 See Also

L<Encode>, L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>

=cut
Commit	Line	Data
5d030b67	1	=head1 NAME
	2
	3	Encode::Supported -- Supported encodings by Encode
	4
	5	=head1 DESCRIPTION
	6
	7	=Encoding Names
	8
	9	Encoding names are case insensitive. White space in names
	10	is ignored. In addition an encoding may have aliases.
	11	Each encoding has one "canonical" name. The "canonical"
	12	name is chosen from the names of the encoding by picking
	13	he first in the following sequence:
	14
	15	o The MIME name as defined in IETF RFCs.
	16	o The name in the IANA registry.
	17	o The name used by the organization that defined it.
	18
	19	Because of all the alias issues, and because in the gen-
	20	eral case encodings have state, "Encode" uses the encoding
	21	object internally once an operation is in progress.
	22
	23	=head2 Supported Encodings
	24
	25	As of Perl 5.8.0, at least the following encodings are recognized.
	26	Note that unless otherwise specified, they are all case insensitive
	27	(via alias) and all occurance of spaces are replaced with '-'. In
	28	other words, "ISO 8859 1" and "iso-8859-1" are identical.
	29
	30	=head3 ASCII
	31
	32	Canonical Aliases
	33	-----------------------
	34	ascii uc-ascii
	35
	36	=head3 The Unicode
	37
	38	utf8 UTF-8
	39	utf16 UTF-16
	40	ucs2 UCS-2, iso-10646-1
	41
	42	=head3 The ISO 8859, KOI, and other 1-byte encodings
	43
	44	The following encodings are based upon ASCII. For most cases it uses
	45	\x80-\xff (upper half) to map non-ASCII characters.
	46
	47	iso-8859-1 latin1
	48	iso-8859-2 latin2
	49	iso-8859-3 latin3
	50	iso-8859-4 latin4
	51	iso-8859-5 latin
	52	iso-8859-6 latin
	53	iso-8859-7
	54	iso-8859-8
	55	iso-8859-9 latin5
	56	iso-8859-10 latin6
	57	iso-8859-11
	58	(iso-8859-12 is nonexistent)
	59	iso-8859-13 latin7
	60	iso-8859-14 latin8
	61	iso-8859-15 latin9
	62	iso-8859-16 latin10
	63
	64	koi8-f
65	koi8-r
66	koi8-u
67
68	viscii # ASCII + vietnamese
69
70	cp1250 WinLatin2
71	cp1251 WinCyrillic
72	cp1252 WinLatin1
73	cp1253 WinGreek
74	cp1254 WinTurkiskh
75	cp1255 WinHebrew
76	cp1256 WinArabic
77	cp1257 WinBaltic
78	cp1258 WinVietnamese
79	# all cp* are also available as ibm-* and ms-*
80
81	maccentraleuropean
82	maccroatian
83	macroman
84	maccyrillic
85	macromanian
86	macdingbats
87	macsami
88	macgreek
89	macthai
90	macicelandic
91	macturkish
92	macukraine
93
94	=head3 The CJK: Chinese, Japanese, Korean (Multibyte)
95
96	Note Vietnamese is listed above. Also read "Encoding vs Charset"
97	below. Also note these are impelemented in distinct module by
98	languages, due the the size concerns. See these perldocs also.
99
100	cp936 gbk # Encode::CN
101	euc-cn # Encode::CN
102	gb12345 # Encode::CN
103	gb2312 # Encode::CN
104	gb2312 # Encode::CN
105	hz # Encode::CN
106	iso-ir-165 # Encode::CN
107
108	7bit-jis jis # Encode::JP
109	cp932 # Encode::JP
110	euc-jp ujis # Encode::JP
111	iso-2022-jp # Encode::JP
112	macjapan # Encode::JP
113	shiftjis Shift_JIS, sjis # Encode::JP
114
115	euc-kr # Encode::KR
116	ksc5601 # Encode::KR
117	cp949 # Encode::KR
118
119	big5 # Encode::TW
120	big5-hkscs # Encode::TW
121	cp950 # Encode::TW
122
123	Due to size concerns, additional Chinese encodings including "GB
124	18030", "EUC-TW" and "BIG5PLUS" are distributed separately on CPAN,
125	under the name Encode::HanExtra.
126
127	=head3 EBCDIC
128
129	See perlebcdic for details.
130
131	cp1047
132	cp37
133	posix-bc
134
135	=head3 Symbols and dingbats
136
137	symbol
138	dingbats
139
140	=head1 Encoding vs. Charset
141
142	Character encoding (or just "encoding") and Character Set (or just
143	"charset") are often used interchangeably but they are different
144	concepts.
145
146	Charset determines which characters to be included in a given text.
147
148	Encoding actually maps charset(s) to stream of bits.
149
150	Note a given encoding contains multiple charsets. For instance,
151	euc-jp contains ASCII, JIS X 0201 (Hankaku Kana), JIS X 0208 (Zenkaku
152	Kana and Kanji) and JIS X 0212 (Extended Kanji) in a single encoding.
153
154	As the name suggests, the Encode module supports encodings, not
155	individual charsets.
156
157	=head1 Encoding Classification (by Anton Tagunov)
158
159	Encodings
160
161	US-ASCII UTF-8 KOI8-R ISO-8859-*
162	ISO-2022-CN ISO-2022-JP Big5
163	EUC-CN EUC-JP EUC-KR
164
165	are <http://www.iana.org/assignments/character-sets>-registered as
166	preferred MIME names and may probably be used over the Internet. So is
167
168	Shift_JIS
169
170	but despite its wide spread it bears the label of being
171	Microsft proprietary -- was. Now Shift JIS is official as of
172	JIS X 0208-1997.
173
174	UTF-16 KOI8-U
175
176	are IANA-registered preferred MIME names but probably
177	shoule be avoided as encoding for web pages due to lack of
178	browser support.
179
180	ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM)
181	ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html)
182	ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
183	GBK
184	VISCII
185	GB 12345 (only plains 1 and 2 available)
186	GB 18030
187	CNS 11643
188
189	are totally valid encodings but not registered at IANA.
190
191	BIG5PLUS
192	EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended)
193
194	are a bit proprietary
195
196	You may probably get some info on CJK encodings at
197
198	brief description for most of the mentioned CJK encodings
199
200	F<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>
201
202	several years old, but still useful
203
204	F<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
205
206	and some in-depth reading for the heroes :-)
207	F<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> (eq ISO-2022)
208
209	=head1 See Also
210
211	L<Encode>, L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>
212
213	=cut