Commit | Line | Data |
e2cfc455 |
1 | =head1 NAME |
2 | |
a63c962f |
3 | Encode::EncFormat - the format of encoding tables of the Encode/*.enc files |
e2cfc455 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | I<The format used in the encoding tables of the Encode extension has |
8 | been borrowed from Tcl, as has the following documentation been borrowed |
9 | from the same. The documentation has been reformatted as Perl pod.> |
10 | |
11 | Space would prohibit precompiling into Tcl every possible encoding |
12 | algorithm, so many encodings are stored on disk as dynamically-loadable |
13 | encoding files. This behavior also allows the user to create additional |
14 | encoding files that can be loaded using the same mechanism. These |
15 | encoding files contain information about the tables and/or escape |
16 | sequences used to map between an external encoding and Unicode. The |
17 | external encoding may consist of single-byte, multi-byte, or double-byte |
18 | characters. |
19 | |
20 | Each dynamically-loadable encoding is represented as a text file. The |
21 | initial line of the file, beginning with a ``#'' symbol, is a comment |
22 | that provides a human-readable description of the file. The next line |
23 | identifies the type of encoding file. It can be one of the following |
24 | letters: |
25 | |
26 | =over 4 |
27 | |
28 | =item [1] B<S> |
29 | |
30 | A single-byte encoding, where one character is always one byte long in |
31 | the encoding. An example is B<iso8859-1>, used by many European languages. |
32 | |
33 | =item [2] B<D> |
34 | |
35 | A double-byte encoding, where one character is always two bytes long in the |
36 | encoding. An example is B<big5>, used for Chinese text. |
37 | |
38 | =item [3] B<M> |
39 | |
40 | A multi-byte encoding, where one character may be either one or two |
41 | bytes long. Certain bytes are a lead bytes, indicating that another |
42 | byte must follow and that together the two bytes represent one |
43 | character. Other bytes are not lead bytes and represent themselves. |
44 | An example is B<shiftjis>, used by many Japanese computers. |
45 | |
46 | =item [4] B<E> |
47 | |
48 | An escape-sequence encoding, specifying that certain sequences of |
49 | bytes do not represent characters, but commands that describe how |
50 | following bytes should be interpreted. |
51 | |
52 | =back |
53 | |
54 | The rest of the lines in the file depend on the type. |
55 | |
56 | Cases [1], [2], and [3] are collectively referred to as table-based |
57 | encoding files. The lines in a table-based encoding file are in the |
58 | same format as this example taken from the B<shiftjis> encoding (this |
59 | is not the complete file): |
60 | |
61 | # Encoding file: shiftjis, multi-byte |
62 | M |
63 | 003F 0 40 |
64 | 00 |
65 | 0000000100020003000400050006000700080009000A000B000C000D000E000F |
66 | 0010001100120013001400150016001700180019001A001B001C001D001E001F |
67 | 0020002100220023002400250026002700280029002A002B002C002D002E002F |
68 | 0030003100320033003400350036003700380039003A003B003C003D003E003F |
69 | 0040004100420043004400450046004700480049004A004B004C004D004E004F |
70 | 0050005100520053005400550056005700580059005A005B005C005D005E005F |
71 | 0060006100620063006400650066006700680069006A006B006C006D006E006F |
72 | 0070007100720073007400750076007700780079007A007B007C007D203E007F |
73 | 0080000000000000000000000000000000000000000000000000000000000000 |
74 | 0000000000000000000000000000000000000000000000000000000000000000 |
75 | 0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F |
76 | FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F |
77 | FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F |
78 | FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F |
79 | 0000000000000000000000000000000000000000000000000000000000000000 |
80 | 0000000000000000000000000000000000000000000000000000000000000000 |
81 | 81 |
82 | 0000000000000000000000000000000000000000000000000000000000000000 |
83 | 0000000000000000000000000000000000000000000000000000000000000000 |
84 | 0000000000000000000000000000000000000000000000000000000000000000 |
85 | 0000000000000000000000000000000000000000000000000000000000000000 |
86 | 300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E |
87 | FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C |
88 | 301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B |
89 | FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000 |
90 | 00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5 |
91 | FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6 |
92 | 25A125A025B325B225BD25BC203B301221922190219121933013000000000000 |
93 | 000000000000000000000000000000002208220B2286228722822283222A2229 |
94 | 000000000000000000000000000000002227222800AC21D221D4220022030000 |
95 | 0000000000000000000000000000000000000000222022A52312220222072261 |
96 | 2252226A226B221A223D221D2235222B222C0000000000000000000000000000 |
97 | 212B2030266F266D266A2020202100B6000000000000000025EF000000000000 |
98 | |
99 | The third line of the file is three numbers. The first number is the |
100 | fallback character (in base 16) to use when converting from UTF-8 to |
101 | this encoding. The second number is a B<1> if this file represents |
102 | the encoding for a symbol font, or B<0> otherwise. The last number |
103 | (in base 10) is how many pages of data follow. |
104 | |
105 | Subsequent lines in the example above are pages that describe how to |
106 | map from the encoding into 2-byte Unicode. The first line in a page |
107 | identifies the page number. Following it are 256 double-byte numbers, |
108 | arranged as 16 rows of 16 numbers. Given a character in the encoding, |
109 | the high byte of that character is used to select which page, and the |
110 | low byte of that character is used as an index to select one of the |
111 | double-byte numbers in that page - the value obtained being the |
112 | corresponding Unicode character. By examination of the example above, |
113 | one can see that the characters 0x7E and 0x8163 in B<shiftjis> map to |
114 | 203E and 2026 in Unicode, respectively. |
115 | |
116 | Following the first page will be all the other pages, each in the same |
117 | format as the first: one number identifying the page followed by 256 |
118 | double-byte Unicode characters. If a character in the encoding maps |
119 | to the Unicode character 0000, it means that the character doesn't |
120 | actually exist. If all characters on a page would map to 0000, that |
121 | page can be omitted. |
122 | |
123 | Case [4] is the escape-sequence encoding file. The lines in an this |
124 | type of file are in the same format as this example taken from the |
125 | B<iso2022-jp> encoding: |
126 | |
127 | # Encoding file: iso2022-jp, escape-driven |
128 | E |
129 | init {} |
130 | final {} |
131 | iso8859-1 \\x1b(B |
132 | jis0201 \\x1b(J |
133 | jis0208 \\x1b$@ |
134 | jis0208 \\x1b$B |
135 | jis0212 \\x1b$(D |
136 | gb2312 \\x1b$A |
137 | ksc5601 \\x1b$(C |
138 | |
139 | In the file, the first column represents an option and the second |
140 | column is the associated value. B<init> is a string to emit or expect |
141 | before the first character is converted, while B<final> is a string to |
142 | emit or expect after the last character. All other options are names |
143 | of table-based encodings; the associated value is the escape-sequence |
144 | that marks that encoding. Tcl syntax is used for the values; in the |
145 | above example, for instance, ``B<{}>'' represents the empty string and |
146 | ``B<\\x1b>'' represents character 27. |
147 | |
148 | B<Completely Tcl-specific paragraph, ignore in the context of Perl> |
149 | When B<Tcl_GetEncoding> encounters an encoding I<name> that has not |
150 | been loaded, it attempts to load an encoding file called |
151 | I<name>B<.enc> from the B<encoding> subdirectory of each directory |
152 | specified in the library path B<$tcl_libPath>. If the encoding file |
153 | exists, but is malformed, an error message will be left in I<interp>. |
154 | |
155 | =head1 KEYWORDS |
156 | |
157 | utf, encoding, convert |
158 | |
159 | =head1 COPYRIGHT |
160 | |
161 | # Copyright (c) 1997-1998 Sun Microsystems, Inc. |
162 | # See the file "license.terms" for information on usage and redistribution |
163 | # of this file, and for a DISCLAIMER OF ALL WARRANTIES. |