Upgrade to Encode 1.56, from Dan Kogai.
[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / Encoding.pm
CommitLineData
18586f54 1package Encode::Encoding;
2# Base class for classes which implement encodings
3use strict;
0ab8f81e 4our $VERSION = do { my @r = (q$Revision: 1.27 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
18586f54 5
6sub Define
7{
8 my $obj = shift;
9 my $canonical = shift;
10 $obj = bless { Name => $canonical },$obj unless ref $obj;
11 # warn "$canonical => $obj\n";
f2a2953c 12 Encode::define_encoding($obj, $canonical, @_);
18586f54 13}
14
15sub name { shift->{'Name'} }
16
17# Temporary legacy methods
18sub toUnicode { shift->decode(@_) }
19sub fromUnicode { shift->encode(@_) }
20
21sub new_sequence { return $_[0] }
22
0ab8f81e 23sub perlio_ok { 0 }
24
6d1c0808 25sub needs_lines { 0 }
26
284ee456 27sub DESTROY {}
28
18586f54 291;
30__END__
1b2c56c8 31
32=head1 NAME
33
34Encode::Encoding - Encode Implementation Base Class
35
36=head1 SYNOPSIS
37
38 package Encode::MyEncoding;
39 use base qw(Encode::Encoding);
40
41 __PACKAGE__->Define(qw(myCanonical myAlias));
42
5129552c 43=head1 DESCRIPTION
1b2c56c8 44
45As mentioned in L<Encode>, encodings are (in the current
46implementation at least) defined by objects. The mapping of encoding
47name to object is via the C<%encodings> hash.
48
49The values of the hash can currently be either strings or objects.
50The string form may go away in the future. The string form occurs
51when C<encodings()> has scanned C<@INC> for loadable encodings but has
52not actually loaded the encoding in question. This is because the
53current "loading" process is all Perl and a bit slow.
54
0ab8f81e 55Once an encoding is loaded, the value of the hash is the object which
1b2c56c8 56implements the encoding. The object should provide the following
57interface:
58
59=over 4
60
61=item -E<gt>name
62
0ab8f81e 63MUST return the string representing the canonical name of the encoding.
1b2c56c8 64
65=item -E<gt>new_sequence
66
67This is a placeholder for encodings with state. It should return an
0ab8f81e 68object which implements this interface. All current implementations
1b2c56c8 69return the original object.
70
71=item -E<gt>encode($string,$check)
72
0ab8f81e 73MUST return the octet sequence representing I<$string>.
74
75=over 2
76
77=item *
78
79If I<$check> is true, it SHOULD modify I<$string> in place to remove
80the converted part (i.e. the whole string unless there is an error).
81If perlio_ok() is true, SHOULD becomes MUST.
82
83=item *
84
85If an error occurs, it SHOULD return the octet sequence for the
86fragment of string that has been converted and modify $string in-place
87to remove the converted part leaving it starting with the problem
88fragment. If perlio_ok() is true, SHOULD becomes MUST.
89
90=item *
1b2c56c8 91
0ab8f81e 92If I<$check> is is false then C<encode> MUST make a "best effort" to
93convert the string - for example, by using a replacement character.
94
95=back
1b2c56c8 96
97=item -E<gt>decode($octets,$check)
98
0ab8f81e 99MUST return the string that I<$octets> represents.
100
101=over 2
102
103=item *
104
105If I<$check> is true, it SHOULD modify I<$octets> in place to remove
106the converted part (i.e. the whole sequence unless there is an
107error). If perlio_ok() is true, SHOULD becomes MUST.
108
109=item *
1b2c56c8 110
0ab8f81e 111If an error occurs, it SHOULD return the fragment of string that has
112been converted and modify $octets in-place to remove the converted
113part leaving it starting with the problem fragment. If perlio_ok() is
114true, SHOULD becomes MUST.
115
116=item *
117
118If I<$check> is false then C<decode> should make a "best effort" to
1b2c56c8 119convert the string - for example by using Unicode's "\x{FFFD}" as a
120replacement character.
121
122=back
123
0ab8f81e 124=item -E<gt>perlio_ok()
125
126If you want your encoding to work with PerlIO, you MUST define this
127method so that it returns 1 when PerlIO is enabled. Here is an
128example;
129
130 sub perlio_ok { exists $INC{"PerlIO/encoding.pm"} }
131
132By default, this method is defined as follows;
133
134 sub perlio_ok { 0 }
135
136=item -E<gt>needs_lines()
137
138If your encoding can work with PerlIO but needs line buffering, you
139MUST define this method so it returns true. 7bit ISO-2022 encodings
140are one example that needs this. When this method is missing, false
141is assumed.
142
143=back
144
145It should be noted that the I<$check> behaviour is different from the
1b2c56c8 146outer public API. The logic is that the "unchecked" case is useful
0ab8f81e 147when the encoding is part of a stream which may be reporting errors
148(e.g. STDERR). In such cases, it is desirable to get everything
1b2c56c8 149through somehow without causing additional errors which obscure the
0ab8f81e 150original one. Also, the encoding is best placed to know what the
1b2c56c8 151correct replacement character is, so if that is the desired behaviour
152then letting low level code do it is the most efficient.
153
0ab8f81e 154By contrast, if I<$check> is true, the scheme above allows the
155encoding to do as much as it can and tell the layer above how much
156that was. What is lacking at present is a mechanism to report what
157went wrong. The most likely interface will be an additional method
158call to the object, or perhaps (to avoid forcing per-stream objects
159on otherwise stateless encodings) an additional parameter.
1b2c56c8 160
161It is also highly desirable that encoding classes inherit from
162C<Encode::Encoding> as a base class. This allows that class to define
0ab8f81e 163additional behaviour for all encoding objects. For example, built-in
164Unicode, UCS-2, and UTF-8 classes use
1b2c56c8 165
166 package Encode::MyEncoding;
167 use base qw(Encode::Encoding);
168
169 __PACKAGE__->Define(qw(myCanonical myAlias));
170
0ab8f81e 171to create an object with C<< bless {Name => ...}, $class >>, and call
1b2c56c8 172define_encoding. They inherit their C<name> method from
173C<Encode::Encoding>.
174
175=head2 Compiled Encodings
176
0ab8f81e 177For the sake of speed and efficiency, most of the encodings are now
178supported via a I<compiled form>: XS modules generated from UCM
179files. Encode provides the enc2xs tool to achieve that. Please see
67d7b5ef 180L<enc2xs> for more details.
1b2c56c8 181
67d7b5ef 182=head1 SEE ALSO
1b2c56c8 183
67d7b5ef 184L<perlmod>, L<enc2xs>
1b2c56c8 185
0ab8f81e 186=begin future
f2a2953c 187
188=over 4
189
190=item Scheme 1
191
0ab8f81e 192The fixup routine gets passed the remaining fragment of string being
193processed. It modifies it in place to remove bytes/characters it can
194understand and returns a string used to represent them. For example:
f2a2953c 195
196 sub fixup {
197 my $ch = substr($_[0],0,1,'');
198 return sprintf("\x{%02X}",ord($ch);
199 }
200
0ab8f81e 201This scheme is close to how the underlying C code for Encode works,
202but gives the fixup routine very little context.
f2a2953c 203
204=item Scheme 2
205
0ab8f81e 206The fixup routine gets passed the original string, an index into
207it of the problem area, and the output string so far. It appends
208what it wants to the output string and returns a new index into the
209original string. For example:
f2a2953c 210
211 sub fixup {
212 # my ($s,$i,$d) = @_;
213 my $ch = substr($_[0],$_[1],1);
214 $_[2] .= sprintf("\x{%02X}",ord($ch);
215 return $_[1]+1;
216 }
217
218This scheme gives maximal control to the fixup routine but is more
0ab8f81e 219complicated to code, and may require that the internals of Encode be tweaked to
220keep the original string intact.
f2a2953c 221
222=item Other Schemes
223
0ab8f81e 224Hybrids of the above.
f2a2953c 225
226Multiple return values rather than in-place modifications.
227
228Index into the string could be C<pos($str)> allowing C<s/\G...//>.
229
230=back
231
0ab8f81e 232=end future
233
1b2c56c8 234=cut