Commit | Line | Data |
18586f54 |
1 | package Encode::Encoding; |
2 | # Base class for classes which implement encodings |
3 | use strict; |
f2a2953c |
4 | our $VERSION = do { my @r = (q$Revision: 1.25 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; |
18586f54 |
5 | |
6 | sub Define |
7 | { |
8 | my $obj = shift; |
9 | my $canonical = shift; |
10 | $obj = bless { Name => $canonical },$obj unless ref $obj; |
11 | # warn "$canonical => $obj\n"; |
f2a2953c |
12 | Encode::define_encoding($obj, $canonical, @_); |
18586f54 |
13 | } |
14 | |
15 | sub name { shift->{'Name'} } |
16 | |
17 | # Temporary legacy methods |
18 | sub toUnicode { shift->decode(@_) } |
19 | sub fromUnicode { shift->encode(@_) } |
20 | |
21 | sub new_sequence { return $_[0] } |
22 | |
284ee456 |
23 | sub DESTROY {} |
24 | |
18586f54 |
25 | 1; |
26 | __END__ |
1b2c56c8 |
27 | |
28 | =head1 NAME |
29 | |
30 | Encode::Encoding - Encode Implementation Base Class |
31 | |
32 | =head1 SYNOPSIS |
33 | |
34 | package Encode::MyEncoding; |
35 | use base qw(Encode::Encoding); |
36 | |
37 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
38 | |
5129552c |
39 | =head1 DESCRIPTION |
1b2c56c8 |
40 | |
41 | As mentioned in L<Encode>, encodings are (in the current |
42 | implementation at least) defined by objects. The mapping of encoding |
43 | name to object is via the C<%encodings> hash. |
44 | |
45 | The values of the hash can currently be either strings or objects. |
46 | The string form may go away in the future. The string form occurs |
47 | when C<encodings()> has scanned C<@INC> for loadable encodings but has |
48 | not actually loaded the encoding in question. This is because the |
49 | current "loading" process is all Perl and a bit slow. |
50 | |
51 | Once an encoding is loaded then value of the hash is object which |
52 | implements the encoding. The object should provide the following |
53 | interface: |
54 | |
55 | =over 4 |
56 | |
57 | =item -E<gt>name |
58 | |
59 | Should return the string representing the canonical name of the encoding. |
60 | |
61 | =item -E<gt>new_sequence |
62 | |
63 | This is a placeholder for encodings with state. It should return an |
64 | object which implements this interface, all current implementations |
65 | return the original object. |
66 | |
67 | =item -E<gt>encode($string,$check) |
68 | |
69 | Should return the octet sequence representing I<$string>. If I<$check> |
70 | is true it should modify I<$string> in place to remove the converted |
71 | part (i.e. the whole string unless there is an error). If an error |
72 | occurs it should return the octet sequence for the fragment of string |
73 | that has been converted, and modify $string in-place to remove the |
74 | converted part leaving it starting with the problem fragment. |
75 | |
76 | If check is is false then C<encode> should make a "best effort" to |
77 | convert the string - for example by using a replacement character. |
78 | |
79 | =item -E<gt>decode($octets,$check) |
80 | |
81 | Should return the string that I<$octets> represents. If I<$check> is |
82 | true it should modify I<$octets> in place to remove the converted part |
83 | (i.e. the whole sequence unless there is an error). If an error |
84 | occurs it should return the fragment of string that has been |
85 | converted, and modify $octets in-place to remove the converted part |
86 | leaving it starting with the problem fragment. |
87 | |
88 | If check is is false then C<decode> should make a "best effort" to |
89 | convert the string - for example by using Unicode's "\x{FFFD}" as a |
90 | replacement character. |
91 | |
92 | =back |
93 | |
94 | It should be noted that the check behaviour is different from the |
95 | outer public API. The logic is that the "unchecked" case is useful |
96 | when encoding is part of a stream which may be reporting errors |
97 | (e.g. STDERR). In such cases it is desirable to get everything |
98 | through somehow without causing additional errors which obscure the |
99 | original one. Also the encoding is best placed to know what the |
100 | correct replacement character is, so if that is the desired behaviour |
101 | then letting low level code do it is the most efficient. |
102 | |
103 | In contrast if check is true, the scheme above allows the encoding to |
104 | do as much as it can and tell layer above how much that was. What is |
105 | lacking at present is a mechanism to report what went wrong. The most |
106 | likely interface will be an additional method call to the object, or |
107 | perhaps (to avoid forcing per-stream objects on otherwise stateless |
108 | encodings) and additional parameter. |
109 | |
110 | It is also highly desirable that encoding classes inherit from |
111 | C<Encode::Encoding> as a base class. This allows that class to define |
112 | additional behaviour for all encoding objects. For example built in |
113 | Unicode, UCS-2 and UTF-8 classes use : |
114 | |
115 | package Encode::MyEncoding; |
116 | use base qw(Encode::Encoding); |
117 | |
118 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
119 | |
120 | To create an object with bless {Name => ...},$class, and call |
121 | define_encoding. They inherit their C<name> method from |
122 | C<Encode::Encoding>. |
123 | |
124 | =head2 Compiled Encodings |
125 | |
67d7b5ef |
126 | For the sake of speed and efficiency, Most of the encodings are now |
127 | supported via I<Compiled Form> that are XS modules generated from UCM |
128 | files. Encode provides enc2xs tool to achieve that. Please see |
129 | L<enc2xs> for more details. |
1b2c56c8 |
130 | |
67d7b5ef |
131 | =head1 SEE ALSO |
1b2c56c8 |
132 | |
67d7b5ef |
133 | L<perlmod>, L<enc2xs> |
1b2c56c8 |
134 | |
f2a2953c |
135 | =for future |
136 | |
137 | |
138 | =over 4 |
139 | |
140 | =item Scheme 1 |
141 | |
142 | Passed remaining fragment of string being processed. |
143 | Modifies it in place to remove bytes/characters it can understand |
144 | and returns a string used to represent them. |
145 | e.g. |
146 | |
147 | sub fixup { |
148 | my $ch = substr($_[0],0,1,''); |
149 | return sprintf("\x{%02X}",ord($ch); |
150 | } |
151 | |
152 | This scheme is close to how underlying C code for Encode works, but gives |
153 | the fixup routine very little context. |
154 | |
155 | =item Scheme 2 |
156 | |
157 | Passed original string, and an index into it of the problem area, and |
158 | output string so far. Appends what it will to output string and |
159 | returns new index into original string. For example: |
160 | |
161 | sub fixup { |
162 | # my ($s,$i,$d) = @_; |
163 | my $ch = substr($_[0],$_[1],1); |
164 | $_[2] .= sprintf("\x{%02X}",ord($ch); |
165 | return $_[1]+1; |
166 | } |
167 | |
168 | This scheme gives maximal control to the fixup routine but is more |
169 | complicated to code, and may need internals of Encode to be tweaked to |
170 | keep original string intact. |
171 | |
172 | =item Other Schemes |
173 | |
174 | Hybrids of above. |
175 | |
176 | Multiple return values rather than in-place modifications. |
177 | |
178 | Index into the string could be C<pos($str)> allowing C<s/\G...//>. |
179 | |
180 | =back |
181 | |
1b2c56c8 |
182 | =cut |