Commit | Line | Data |
18586f54 |
1 | package Encode::Encoding; |
2 | # Base class for classes which implement encodings |
3 | use strict; |
1b2c56c8 |
4 | our $VERSION = do { my @r = (q$Revision: 0.94 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; |
18586f54 |
5 | |
6 | sub Define |
7 | { |
8 | my $obj = shift; |
9 | my $canonical = shift; |
10 | $obj = bless { Name => $canonical },$obj unless ref $obj; |
11 | # warn "$canonical => $obj\n"; |
12 | Encode::define_encoding($obj, $canonical, @_); |
13 | } |
14 | |
15 | sub name { shift->{'Name'} } |
16 | |
17 | # Temporary legacy methods |
18 | sub toUnicode { shift->decode(@_) } |
19 | sub fromUnicode { shift->encode(@_) } |
20 | |
21 | sub new_sequence { return $_[0] } |
22 | |
284ee456 |
23 | sub DESTROY {} |
24 | |
18586f54 |
25 | 1; |
26 | __END__ |
1b2c56c8 |
27 | |
28 | =head1 NAME |
29 | |
30 | Encode::Encoding - Encode Implementation Base Class |
31 | |
32 | =head1 SYNOPSIS |
33 | |
34 | package Encode::MyEncoding; |
35 | use base qw(Encode::Encoding); |
36 | |
37 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
38 | |
39 | =head 1 DESCRIPTION |
40 | |
41 | As mentioned in L<Encode>, encodings are (in the current |
42 | implementation at least) defined by objects. The mapping of encoding |
43 | name to object is via the C<%encodings> hash. |
44 | |
45 | The values of the hash can currently be either strings or objects. |
46 | The string form may go away in the future. The string form occurs |
47 | when C<encodings()> has scanned C<@INC> for loadable encodings but has |
48 | not actually loaded the encoding in question. This is because the |
49 | current "loading" process is all Perl and a bit slow. |
50 | |
51 | Once an encoding is loaded then value of the hash is object which |
52 | implements the encoding. The object should provide the following |
53 | interface: |
54 | |
55 | =over 4 |
56 | |
57 | =item -E<gt>name |
58 | |
59 | Should return the string representing the canonical name of the encoding. |
60 | |
61 | =item -E<gt>new_sequence |
62 | |
63 | This is a placeholder for encodings with state. It should return an |
64 | object which implements this interface, all current implementations |
65 | return the original object. |
66 | |
67 | =item -E<gt>encode($string,$check) |
68 | |
69 | Should return the octet sequence representing I<$string>. If I<$check> |
70 | is true it should modify I<$string> in place to remove the converted |
71 | part (i.e. the whole string unless there is an error). If an error |
72 | occurs it should return the octet sequence for the fragment of string |
73 | that has been converted, and modify $string in-place to remove the |
74 | converted part leaving it starting with the problem fragment. |
75 | |
76 | If check is is false then C<encode> should make a "best effort" to |
77 | convert the string - for example by using a replacement character. |
78 | |
79 | =item -E<gt>decode($octets,$check) |
80 | |
81 | Should return the string that I<$octets> represents. If I<$check> is |
82 | true it should modify I<$octets> in place to remove the converted part |
83 | (i.e. the whole sequence unless there is an error). If an error |
84 | occurs it should return the fragment of string that has been |
85 | converted, and modify $octets in-place to remove the converted part |
86 | leaving it starting with the problem fragment. |
87 | |
88 | If check is is false then C<decode> should make a "best effort" to |
89 | convert the string - for example by using Unicode's "\x{FFFD}" as a |
90 | replacement character. |
91 | |
92 | =back |
93 | |
94 | It should be noted that the check behaviour is different from the |
95 | outer public API. The logic is that the "unchecked" case is useful |
96 | when encoding is part of a stream which may be reporting errors |
97 | (e.g. STDERR). In such cases it is desirable to get everything |
98 | through somehow without causing additional errors which obscure the |
99 | original one. Also the encoding is best placed to know what the |
100 | correct replacement character is, so if that is the desired behaviour |
101 | then letting low level code do it is the most efficient. |
102 | |
103 | In contrast if check is true, the scheme above allows the encoding to |
104 | do as much as it can and tell layer above how much that was. What is |
105 | lacking at present is a mechanism to report what went wrong. The most |
106 | likely interface will be an additional method call to the object, or |
107 | perhaps (to avoid forcing per-stream objects on otherwise stateless |
108 | encodings) and additional parameter. |
109 | |
110 | It is also highly desirable that encoding classes inherit from |
111 | C<Encode::Encoding> as a base class. This allows that class to define |
112 | additional behaviour for all encoding objects. For example built in |
113 | Unicode, UCS-2 and UTF-8 classes use : |
114 | |
115 | package Encode::MyEncoding; |
116 | use base qw(Encode::Encoding); |
117 | |
118 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
119 | |
120 | To create an object with bless {Name => ...},$class, and call |
121 | define_encoding. They inherit their C<name> method from |
122 | C<Encode::Encoding>. |
123 | |
124 | =head2 Compiled Encodings |
125 | |
126 | F<Encode.xs> provides a class C<Encode::XS> which provides the |
127 | interface described above. It calls a generic octet-sequence to |
128 | octet-sequence "engine" that is driven by tables (defined in |
129 | F<encengine.c>). The same engine is used for both encode and |
130 | decode. C<Encode:XS>'s C<encode> forces Perl's characters to their |
131 | UTF-8 form and then treats them as just another multibyte |
132 | encoding. C<Encode:XS>'s C<decode> transforms the sequence and then |
133 | turns the UTF-8-ness flag as that is the form that the tables are |
134 | defined to produce. For details of the engine see the comments in |
135 | F<encengine.c>. |
136 | |
137 | The tables are produced by the Perl script F<compile> (the name needs |
138 | to change so we can eventually install it somewhere). F<compile> can |
139 | currently read two formats: |
140 | |
141 | =over 4 |
142 | |
143 | =item *.enc |
144 | |
145 | This is a coined format used by Tcl. It is documented in |
146 | Encode/EncodeFormat.pod. |
147 | |
148 | =item *.ucm |
149 | |
150 | This is the semi-standard format used by IBM's ICU package. |
151 | |
152 | =back |
153 | |
154 | F<compile> can write the following forms: |
155 | |
156 | =over 4 |
157 | |
158 | =item *.ucm |
159 | |
160 | See above - the F<Encode/*.ucm> files provided with the distribution have |
161 | been created from the original Tcl .enc files using this approach. |
162 | |
163 | =item *.c |
164 | |
165 | Produces tables as C data structures - this is used to build in encodings |
166 | into F<Encode.so>/F<Encode.dll>. |
167 | |
168 | =item *.xs |
169 | |
170 | In theory this allows encodings to be stand-alone loadable Perl |
171 | extensions. The process has not yet been tested. The plan is to use |
172 | this approach for large East Asian encodings. |
173 | |
174 | =back |
175 | |
176 | The set of encodings built-in to F<Encode.so>/F<Encode.dll> is |
177 | determined by F<Makefile.PL>. The current set is as follows: |
178 | |
179 | =over 4 |
180 | |
181 | =item ascii and iso-8859-* |
182 | |
183 | That is all the common 8-bit "western" encodings. |
184 | |
185 | =item IBM-1047 and two other variants of EBCDIC. |
186 | |
187 | These are the same variants that are supported by EBCDIC Perl as |
188 | "native" encodings. They are included to prove "reversibility" of |
189 | some constructs in EBCDIC Perl. |
190 | |
191 | =item symbol and dingbats as used by Tk on X11. |
192 | |
193 | (The reason Encode got started was to support Perl/Tk.) |
194 | |
195 | =back |
196 | |
197 | That set is rather ad hoc and has been driven by the needs of the |
198 | tests rather than the needs of typical applications. It is likely |
199 | to be rationalized. |
200 | |
201 | =cut |