Commit | Line | Data |
18586f54 |
1 | package Encode::Encoding; |
2 | # Base class for classes which implement encodings |
3 | use strict; |
6d1c0808 |
4 | our $VERSION = do { my @r = (q$Revision: 1.26 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; |
18586f54 |
5 | |
6 | sub Define |
7 | { |
8 | my $obj = shift; |
9 | my $canonical = shift; |
10 | $obj = bless { Name => $canonical },$obj unless ref $obj; |
11 | # warn "$canonical => $obj\n"; |
f2a2953c |
12 | Encode::define_encoding($obj, $canonical, @_); |
18586f54 |
13 | } |
14 | |
15 | sub name { shift->{'Name'} } |
16 | |
17 | # Temporary legacy methods |
18 | sub toUnicode { shift->decode(@_) } |
19 | sub fromUnicode { shift->encode(@_) } |
20 | |
21 | sub new_sequence { return $_[0] } |
22 | |
6d1c0808 |
23 | sub needs_lines { 0 } |
24 | |
284ee456 |
25 | sub DESTROY {} |
26 | |
18586f54 |
27 | 1; |
28 | __END__ |
1b2c56c8 |
29 | |
30 | =head1 NAME |
31 | |
32 | Encode::Encoding - Encode Implementation Base Class |
33 | |
34 | =head1 SYNOPSIS |
35 | |
36 | package Encode::MyEncoding; |
37 | use base qw(Encode::Encoding); |
38 | |
39 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
40 | |
5129552c |
41 | =head1 DESCRIPTION |
1b2c56c8 |
42 | |
43 | As mentioned in L<Encode>, encodings are (in the current |
44 | implementation at least) defined by objects. The mapping of encoding |
45 | name to object is via the C<%encodings> hash. |
46 | |
47 | The values of the hash can currently be either strings or objects. |
48 | The string form may go away in the future. The string form occurs |
49 | when C<encodings()> has scanned C<@INC> for loadable encodings but has |
50 | not actually loaded the encoding in question. This is because the |
51 | current "loading" process is all Perl and a bit slow. |
52 | |
53 | Once an encoding is loaded then value of the hash is object which |
54 | implements the encoding. The object should provide the following |
55 | interface: |
56 | |
57 | =over 4 |
58 | |
59 | =item -E<gt>name |
60 | |
61 | Should return the string representing the canonical name of the encoding. |
62 | |
63 | =item -E<gt>new_sequence |
64 | |
65 | This is a placeholder for encodings with state. It should return an |
66 | object which implements this interface, all current implementations |
67 | return the original object. |
68 | |
69 | =item -E<gt>encode($string,$check) |
70 | |
71 | Should return the octet sequence representing I<$string>. If I<$check> |
72 | is true it should modify I<$string> in place to remove the converted |
73 | part (i.e. the whole string unless there is an error). If an error |
74 | occurs it should return the octet sequence for the fragment of string |
75 | that has been converted, and modify $string in-place to remove the |
76 | converted part leaving it starting with the problem fragment. |
77 | |
78 | If check is is false then C<encode> should make a "best effort" to |
79 | convert the string - for example by using a replacement character. |
80 | |
81 | =item -E<gt>decode($octets,$check) |
82 | |
83 | Should return the string that I<$octets> represents. If I<$check> is |
84 | true it should modify I<$octets> in place to remove the converted part |
85 | (i.e. the whole sequence unless there is an error). If an error |
86 | occurs it should return the fragment of string that has been |
87 | converted, and modify $octets in-place to remove the converted part |
88 | leaving it starting with the problem fragment. |
89 | |
90 | If check is is false then C<decode> should make a "best effort" to |
91 | convert the string - for example by using Unicode's "\x{FFFD}" as a |
92 | replacement character. |
93 | |
94 | =back |
95 | |
96 | It should be noted that the check behaviour is different from the |
97 | outer public API. The logic is that the "unchecked" case is useful |
98 | when encoding is part of a stream which may be reporting errors |
99 | (e.g. STDERR). In such cases it is desirable to get everything |
100 | through somehow without causing additional errors which obscure the |
101 | original one. Also the encoding is best placed to know what the |
102 | correct replacement character is, so if that is the desired behaviour |
103 | then letting low level code do it is the most efficient. |
104 | |
105 | In contrast if check is true, the scheme above allows the encoding to |
106 | do as much as it can and tell layer above how much that was. What is |
107 | lacking at present is a mechanism to report what went wrong. The most |
108 | likely interface will be an additional method call to the object, or |
109 | perhaps (to avoid forcing per-stream objects on otherwise stateless |
110 | encodings) and additional parameter. |
111 | |
112 | It is also highly desirable that encoding classes inherit from |
113 | C<Encode::Encoding> as a base class. This allows that class to define |
114 | additional behaviour for all encoding objects. For example built in |
115 | Unicode, UCS-2 and UTF-8 classes use : |
116 | |
117 | package Encode::MyEncoding; |
118 | use base qw(Encode::Encoding); |
119 | |
120 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
121 | |
122 | To create an object with bless {Name => ...},$class, and call |
123 | define_encoding. They inherit their C<name> method from |
124 | C<Encode::Encoding>. |
125 | |
126 | =head2 Compiled Encodings |
127 | |
67d7b5ef |
128 | For the sake of speed and efficiency, Most of the encodings are now |
129 | supported via I<Compiled Form> that are XS modules generated from UCM |
130 | files. Encode provides enc2xs tool to achieve that. Please see |
131 | L<enc2xs> for more details. |
1b2c56c8 |
132 | |
67d7b5ef |
133 | =head1 SEE ALSO |
1b2c56c8 |
134 | |
67d7b5ef |
135 | L<perlmod>, L<enc2xs> |
1b2c56c8 |
136 | |
f2a2953c |
137 | =for future |
138 | |
139 | |
140 | =over 4 |
141 | |
142 | =item Scheme 1 |
143 | |
144 | Passed remaining fragment of string being processed. |
145 | Modifies it in place to remove bytes/characters it can understand |
146 | and returns a string used to represent them. |
147 | e.g. |
148 | |
149 | sub fixup { |
150 | my $ch = substr($_[0],0,1,''); |
151 | return sprintf("\x{%02X}",ord($ch); |
152 | } |
153 | |
154 | This scheme is close to how underlying C code for Encode works, but gives |
155 | the fixup routine very little context. |
156 | |
157 | =item Scheme 2 |
158 | |
159 | Passed original string, and an index into it of the problem area, and |
160 | output string so far. Appends what it will to output string and |
161 | returns new index into original string. For example: |
162 | |
163 | sub fixup { |
164 | # my ($s,$i,$d) = @_; |
165 | my $ch = substr($_[0],$_[1],1); |
166 | $_[2] .= sprintf("\x{%02X}",ord($ch); |
167 | return $_[1]+1; |
168 | } |
169 | |
170 | This scheme gives maximal control to the fixup routine but is more |
171 | complicated to code, and may need internals of Encode to be tweaked to |
172 | keep original string intact. |
173 | |
174 | =item Other Schemes |
175 | |
176 | Hybrids of above. |
177 | |
178 | Multiple return values rather than in-place modifications. |
179 | |
180 | Index into the string could be C<pos($str)> allowing C<s/\G...//>. |
181 | |
182 | =back |
183 | |
1b2c56c8 |
184 | =cut |