[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / EncodeFormat.pod

=head1 NAME

EncodeFormat - the format of encoding tables of the Encode extension

=head1 DESCRIPTION

I<The format used in the encoding tables of the Encode extension has
been borrowed from Tcl, as has the following documentation been borrowed
from the same.  The documentation has been reformatted as Perl pod.>

Space would prohibit precompiling into Tcl every possible encoding
algorithm, so many encodings are stored on disk as dynamically-loadable
encoding files.  This behavior also allows the user to create additional
encoding files that can be loaded using the same mechanism.  These
encoding files contain information about the tables and/or escape
sequences used to map between an external encoding and Unicode.  The
external encoding may consist of single-byte, multi-byte, or double-byte
characters.

Each dynamically-loadable encoding is represented as a text file.  The
initial line of the file, beginning with a ``#'' symbol, is a comment
that provides a human-readable description of the file.  The next line
identifies the type of encoding file.  It can be one of the following
letters:

=over 4

=item [1]   B<S>

A single-byte encoding, where one character is always one byte long in
the encoding.  An example is B<iso8859-1>, used by many European languages.

=item [2]   B<D>

A double-byte encoding, where one character is always two bytes long in the
encoding.  An example is B<big5>, used for Chinese text.

=item [3]   B<M>

A multi-byte encoding, where one character may be either one or two
bytes long.  Certain bytes are a lead bytes, indicating that another
byte must follow and that together the two bytes represent one
character.  Other bytes are not lead bytes and represent themselves.
An example is B<shiftjis>, used by many Japanese computers.

=item [4]   B<E>

An escape-sequence encoding, specifying that certain sequences of
bytes do not represent characters, but commands that describe how
following bytes should be interpreted.

=back

The rest of the lines in the file depend on the type.

Cases [1], [2], and [3] are collectively referred to as table-based
encoding files.  The lines in a table-based encoding file are in the
same format as this example taken from the B<shiftjis> encoding (this
is not the complete file):

 # Encoding file: shiftjis, multi-byte
 M
 003F 0 40
 00
 0000000100020003000400050006000700080009000A000B000C000D000E000F
 0010001100120013001400150016001700180019001A001B001C001D001E001F
 0020002100220023002400250026002700280029002A002B002C002D002E002F
 0030003100320033003400350036003700380039003A003B003C003D003E003F
 0040004100420043004400450046004700480049004A004B004C004D004E004F
 0050005100520053005400550056005700580059005A005B005C005D005E005F
 0060006100620063006400650066006700680069006A006B006C006D006E006F
 0070007100720073007400750076007700780079007A007B007C007D203E007F
 0080000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000
 0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
 FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
 FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
 FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
 0000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000
 81
 0000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000
 300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
 FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
 301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
 FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
 00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
 FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
 25A125A025B325B225BD25BC203B301221922190219121933013000000000000
 000000000000000000000000000000002208220B2286228722822283222A2229
 000000000000000000000000000000002227222800AC21D221D4220022030000
 0000000000000000000000000000000000000000222022A52312220222072261
 2252226A226B221A223D221D2235222B222C0000000000000000000000000000
 212B2030266F266D266A2020202100B6000000000000000025EF000000000000

The third line of the file is three numbers.  The first number is the
fallback character (in base 16) to use when converting from UTF-8 to
this encoding.  The second number is a B<1> if this file represents
the encoding for a symbol font, or B<0> otherwise.  The last number
(in base 10) is how many pages of data follow.

Subsequent lines in the example above are pages that describe how to
map from the encoding into 2-byte Unicode.  The first line in a page
identifies the page number.  Following it are 256 double-byte numbers,
arranged as 16 rows of 16 numbers.  Given a character in the encoding,
the high byte of that character is used to select which page, and the
low byte of that character is used as an index to select one of the
double-byte numbers in that page - the value obtained being the
corresponding Unicode character.  By examination of the example above,
one can see that the characters 0x7E and 0x8163 in B<shiftjis> map to
203E and 2026 in Unicode, respectively.

Following the first page will be all the other pages, each in the same
format as the first: one number identifying the page followed by 256
double-byte Unicode characters.  If a character in the encoding maps
to the Unicode character 0000, it means that the character doesn't
actually exist.  If all characters on a page would map to 0000, that
page can be omitted.

Case [4] is the escape-sequence encoding file.  The lines in an this
type of file are in the same format as this example taken from the
B<iso2022-jp> encoding:

 # Encoding file: iso2022-jp, escape-driven
 E
 init		{}
 final		{}
 iso8859-1	\\x1b(B
 jis0201		\\x1b(J
 jis0208		\\x1b$@
 jis0208		\\x1b$B
 jis0212		\\x1b$(D
 gb2312		\\x1b$A
 ksc5601		\\x1b$(C

In the file, the first column represents an option and the second
column is the associated value.  B<init> is a string to emit or expect
before the first character is converted, while B<final> is a string to
emit or expect after the last character.  All other options are names
of table-based encodings; the associated value is the escape-sequence
that marks that encoding.  Tcl syntax is used for the values; in the
above example, for instance, ``B<{}>'' represents the empty string and
``B<\\x1b>'' represents character 27.

B<Completely Tcl-specific paragraph, ignore in the context of Perl>
When B<Tcl_GetEncoding> encounters an encoding I<name> that has not
been loaded, it attempts to load an encoding file called
I<name>B<.enc> from the B<encoding> subdirectory of each directory
specified in the library path B<$tcl_libPath>.  If the encoding file
exists, but is malformed, an error message will be left in I<interp>.

=head1 KEYWORDS

utf, encoding, convert

=head1 COPYRIGHT

  #  Copyright (c) 1997-1998 Sun Microsystems, Inc.
  #  See the file "license.terms" for information on usage and redistribution
  #  of this file, and for a DISCLAIMER OF ALL WARRANTIES.
Commit	Line	Data
e2cfc455	1	=head1 NAME
	2
	3	EncodeFormat - the format of encoding tables of the Encode extension
	4
	5	=head1 DESCRIPTION
	6
	7	I<The format used in the encoding tables of the Encode extension has
	8	been borrowed from Tcl, as has the following documentation been borrowed
	9	from the same. The documentation has been reformatted as Perl pod.>
	10
	11	Space would prohibit precompiling into Tcl every possible encoding
	12	algorithm, so many encodings are stored on disk as dynamically-loadable
	13	encoding files. This behavior also allows the user to create additional
	14	encoding files that can be loaded using the same mechanism. These
	15	encoding files contain information about the tables and/or escape
	16	sequences used to map between an external encoding and Unicode. The
	17	external encoding may consist of single-byte, multi-byte, or double-byte
	18	characters.
	19
	20	Each dynamically-loadable encoding is represented as a text file. The
	21	initial line of the file, beginning with a ``#'' symbol, is a comment
	22	that provides a human-readable description of the file. The next line
	23	identifies the type of encoding file. It can be one of the following
	24	letters:
	25
	26	=over 4
	27
	28	=item [1] B<S>
	29
	30	A single-byte encoding, where one character is always one byte long in
	31	the encoding. An example is B<iso8859-1>, used by many European languages.
	32
	33	=item [2] B<D>
	34
	35	A double-byte encoding, where one character is always two bytes long in the
	36	encoding. An example is B<big5>, used for Chinese text.
	37
	38	=item [3] B<M>
	39
	40	A multi-byte encoding, where one character may be either one or two
	41	bytes long. Certain bytes are a lead bytes, indicating that another
	42	byte must follow and that together the two bytes represent one
	43	character. Other bytes are not lead bytes and represent themselves.
	44	An example is B<shiftjis>, used by many Japanese computers.
	45
	46	=item [4] B<E>
	47
	48	An escape-sequence encoding, specifying that certain sequences of
	49	bytes do not represent characters, but commands that describe how
	50	following bytes should be interpreted.
	51
	52	=back
	53
	54	The rest of the lines in the file depend on the type.
	55
	56	Cases [1], [2], and [3] are collectively referred to as table-based
	57	encoding files. The lines in a table-based encoding file are in the
	58	same format as this example taken from the B<shiftjis> encoding (this
	59	is not the complete file):
	60
	61	# Encoding file: shiftjis, multi-byte
	62	M
	63	003F 0 40
	64	00
65	0000000100020003000400050006000700080009000A000B000C000D000E000F
66	0010001100120013001400150016001700180019001A001B001C001D001E001F
67	0020002100220023002400250026002700280029002A002B002C002D002E002F
68	0030003100320033003400350036003700380039003A003B003C003D003E003F
69	0040004100420043004400450046004700480049004A004B004C004D004E004F
70	0050005100520053005400550056005700580059005A005B005C005D005E005F
71	0060006100620063006400650066006700680069006A006B006C006D006E006F
72	0070007100720073007400750076007700780079007A007B007C007D203E007F
73	0080000000000000000000000000000000000000000000000000000000000000
74	0000000000000000000000000000000000000000000000000000000000000000
75	0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
76	FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
77	FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
78	FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
79	0000000000000000000000000000000000000000000000000000000000000000
80	0000000000000000000000000000000000000000000000000000000000000000
81	81
82	0000000000000000000000000000000000000000000000000000000000000000
83	0000000000000000000000000000000000000000000000000000000000000000
84	0000000000000000000000000000000000000000000000000000000000000000
85	0000000000000000000000000000000000000000000000000000000000000000
86	300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
87	FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
88	301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
89	FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
90	00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
91	FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
92	25A125A025B325B225BD25BC203B301221922190219121933013000000000000
93	000000000000000000000000000000002208220B2286228722822283222A2229
94	000000000000000000000000000000002227222800AC21D221D4220022030000
95	0000000000000000000000000000000000000000222022A52312220222072261
96	2252226A226B221A223D221D2235222B222C0000000000000000000000000000
97	212B2030266F266D266A2020202100B6000000000000000025EF000000000000
98
99	The third line of the file is three numbers. The first number is the
100	fallback character (in base 16) to use when converting from UTF-8 to
101	this encoding. The second number is a B<1> if this file represents
102	the encoding for a symbol font, or B<0> otherwise. The last number
103	(in base 10) is how many pages of data follow.
104
105	Subsequent lines in the example above are pages that describe how to
106	map from the encoding into 2-byte Unicode. The first line in a page
107	identifies the page number. Following it are 256 double-byte numbers,
108	arranged as 16 rows of 16 numbers. Given a character in the encoding,
109	the high byte of that character is used to select which page, and the
110	low byte of that character is used as an index to select one of the
111	double-byte numbers in that page - the value obtained being the
112	corresponding Unicode character. By examination of the example above,
113	one can see that the characters 0x7E and 0x8163 in B<shiftjis> map to
114	203E and 2026 in Unicode, respectively.
115
116	Following the first page will be all the other pages, each in the same
117	format as the first: one number identifying the page followed by 256
118	double-byte Unicode characters. If a character in the encoding maps
119	to the Unicode character 0000, it means that the character doesn't
120	actually exist. If all characters on a page would map to 0000, that
121	page can be omitted.
122
123	Case [4] is the escape-sequence encoding file. The lines in an this
124	type of file are in the same format as this example taken from the
125	B<iso2022-jp> encoding:
126
127	# Encoding file: iso2022-jp, escape-driven
128	E
129	init {}
130	final {}
131	iso8859-1 \\x1b(B
132	jis0201 \\x1b(J
133	jis0208 \\x1b$@
134	jis0208 \\x1b$B
135	jis0212 \\x1b$(D
136	gb2312 \\x1b$A
137	ksc5601 \\x1b$(C
138
139	In the file, the first column represents an option and the second
140	column is the associated value. B<init> is a string to emit or expect
141	before the first character is converted, while B<final> is a string to
142	emit or expect after the last character. All other options are names
143	of table-based encodings; the associated value is the escape-sequence
144	that marks that encoding. Tcl syntax is used for the values; in the
145	above example, for instance, ``B<{}>'' represents the empty string and
146	``B<\\x1b>'' represents character 27.
147
148	B<Completely Tcl-specific paragraph, ignore in the context of Perl>
149	When B<Tcl_GetEncoding> encounters an encoding I<name> that has not
150	been loaded, it attempts to load an encoding file called
151	I<name>B<.enc> from the B<encoding> subdirectory of each directory
152	specified in the library path B<$tcl_libPath>. If the encoding file
153	exists, but is malformed, an error message will be left in I<interp>.
154
155	=head1 KEYWORDS
156
157	utf, encoding, convert
158
159	=head1 COPYRIGHT
160
161	# Copyright (c) 1997-1998 Sun Microsystems, Inc.
162	# See the file "license.terms" for information on usage and redistribution
163	# of this file, and for a DISCLAIMER OF ALL WARRANTIES.