Add the Encode extension. The code is still largely just skeleton.
[p5sagit/p5-mst-13.2.git] / ext / Encode / Encode.pm
CommitLineData
2c674647 1package Encode;
2
3$VERSION = 0.01;
4
5require DynaLoader;
6require Exporter;
7
8@ISA = qw(Exporter DynaLoader);
9
10@EXPORT_OK =
11 qw(
12 bytes_to_utf8
13 utf8_to_bytes
14 chars_to_utf8
15 utf8_to_chars
16 utf8_to_chars_check
17 bytes_to_chars
18 chars_to_bytes
19 from_to
20 is_utf8
21 on_utf8
22 off_utf8
23 utf_to_utf
24 );
25
26bootstrap Encode ();
27
28=pod
29
30=head1 NAME
31
32Encode - character encodings
33
34=head2 TERMINOLOGY
35
36=over
37
38=item *
39
40I<char>: a character in the range 0..maxint (at least 2**32-1)
41
42=item *
43
44I<byte>: a character in the range 0..255
45
46=back
47
48The marker [INTERNAL] marks Internal Implementation Details, in
49general meant only for those who think they know what they are doing,
50and such details may change in future releases.
51
52=head2 bytes
53
54=over 4
55
56=item *
57
58 bytes_to_utf8(STRING [, FROM])
59
60The bytes in STRING are recoded in-place into UTF-8. If no FROM is
61specified the bytes are expected to be encoded in US-ASCII or ISO
628859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
63there's a failure.
64
65[INTERNAL] Also the UTF-8 flag of STRING is turned on.
66
67=item *
68
69 utf8_to_bytes(STRING [, TO [, CHECK]])
70
71The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding
72is specified the bytes are expected to be encoded in US-ASCII or ISO
738859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
74there's a failure.
75
76What if there are characters > 255? What if the UTF-8 in STRING is
77malformed? See L</"Handling Malformed Data">.
78
79[INTERNAL] The UTF-8 flag of STRING is not checked.
80
81=back
82
83=head2 chars
84
85=over 4
86
87=item *
88
89 chars_to_utf8(STRING)
90
91The chars in STRING are encoded in-place into UTF-8. Returns the new
92size of STRING, or C<undef> if there's a failure.
93
94No assumptions are made on the encoding of the chars. If you want to
95assume that the chars are Unicode and to trap illegal Unicode
96characters, you must use C<from_to('Unicode', ...)>.
97
98[INTERNAL] Also the UTF-8 flag of STRING is turned on.
99
100=over 4
101
102=item *
103
104 utf8_to_chars(STRING)
105
106The UTF-8 in STRING is decoded in-place into chars. Returns the new
107size of STRING, or C<undef> if there's a failure.
108
109If the UTF-8 in STRING is malformed C<undef> is returned, and also an
110optional lexical warning (category utf8) is given.
111
112[INTERNAL] The UTF-8 flag of STRING is not checked.
113
114=item *
115
116 utf8_to_chars_check(STRING [, CHECK])
117
118(Note that special naming of this interface since a two-argument
119utf8_to_chars() has different semantics.)
120
121The UTF-8 in STRING is decoded in-place into chars. Returns the new
122size of STRING, or C<undef> if there is a failure.
123
124If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
125
126[INTERNAL] The UTF-8 flag of STRING is not checked.
127
128=back
129
130=head2 chars With Encoding
131
132=over 4
133
134=item *
135
136 chars_to_utf8(STRING, FROM [, CHECK])
137
138The chars in STRING encoded in FROM are recoded in-place into UTF-8.
139Returns the new size of STRING, or C<undef> if there's a failure.
140
141No assumptions are made on the encoding of the chars. If you want to
142assume that the chars are Unicode and to trap illegal Unicode
143characters, you must use C<from_to('Unicode', ...)>.
144
145[INTERNAL] Also the UTF-8 flag of STRING is turned on.
146
147=item *
148
149 utf8_to_chars(STRING, TO [, CHECK])
150
151The UTF-8 in STRING is decoded in-place into chars encoded in TO.
152Returns the new size of STRING, or C<undef> if there's a failure.
153
154If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
155
156[INTERNAL] The UTF-8 flag of STRING is not checked.
157
158=item *
159
160 bytes_to_chars(STRING, FROM [, CHECK])
161
162The bytes in STRING encoded in FROM are recoded in-place into chars.
163Returns the new size of STRING in bytes, or C<undef> if there's a
164failure.
165
166If the mapping is impossible? See L</"Handling Malformed Data">.
167
168=item *
169
170 chars_to_bytes(STRING, TO [, CHECK])
171
172The chars in STRING are recoded in-place to bytes encoded in TO.
173Returns the new size of STRING in bytes, or C<undef> if there's a
174failure.
175
176If the mapping is impossible? See L</"Handling Malformed Data">.
177
178=item *
179
180 from_to(STRING, FROM, TO [, CHECK])
181
182The chars in STRING encoded in FROM are recoded in-place into TO.
183Returns the new size of STRING, or C<undef> if there's a failure.
184
185If mapping between the encodings is impossible?
186See L</"Handling Malformed Data">.
187
188[INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.
189
190=back
191
192=head2 Testing For UTF-8
193
194=over 4
195
196=item *
197
198 is_utf8(STRING [, CHECK])
199
200[INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
201If CHECK is true, also checks the data in STRING for being
202well-formed UTF-8. Returns true if successful, false otherwise.
203
204=back
205
206=head2 Toggling UTF-8-ness
207
208=over 4
209
210=item *
211
212 on_utf8(STRING)
213
214[INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
215B<not> checked for being well-formed UTF-8. Do not use unless you
216B<know> that the STRING is well-formed UTF-8. Returns the previous
217state of the UTF-8 flag (so please don't test the return value as
218I<not> success or failure), or C<undef> if STRING is not a string.
219
220=item *
221
222 off_utf8(STRING)
223
224[INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
225Returns the previous state of the UTF-8 flag (so please don't test the
226return value as I<not> success or failure), or C<undef> if STRING is
227not a string.
228
229=back
230
231=head2 UTF-16 and UTF-32 Encodings
232
233=over 4
234
235=item *
236
237 utf_to_utf(STRING, FROM, TO [, CHECK])
238
239The data in STRING is converted from Unicode Transfer Encoding FROM to
240Unicode Transfer Encoding TO. Both FROM and TO may be any of the
241following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
242
243 tag meaning
244
245 '7' UTF-7
246 '8' UTF-8
247 '16be' UTF-16 big-endian
248 '16le' UTF-16 little-endian
249 '16' UTF-16 native-endian
250 '32be' UTF-32 big-endian
251 '32le' UTF-32 little-endian
252 '32' UTF-32 native-endian
253
254UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
255UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
256C<undef> is there's a failure.
257
258If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
259L</"Handling Malformed Data">.
260
261[INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
262STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
263turned on. Identical FROM and TO are fine.
264
265=back
266
267=head2 Handling Malformed Data
268
269If CHECK is not set, C<undef> is returned. If the data is supposed to
270be UTF-8, an optional lexical warning (category utf8) is given. If
271CHECK is true but not a code reference, dies. If CHECK is a code
272reference, it is called with the arguments
273
274 (MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
275
276Two return values are expected from the call: the string to be used in
277the result string in place of the malformed section, and the length of
278the malformed section in bytes.
279
280=cut
281
282sub bytes_to_utf8 {
283 &_bytes_to_utf8;
284}
285
286sub utf8_to_bytes {
287 &_utf8_to_bytes;
288}
289
290sub chars_to_utf8 {
291 &C_to_utf8;
292}
293
294sub utf8_to_chars {
295 &_utf8_to_chars;
296}
297
298sub utf8_to_chars_check {
299 &_utf8_to_chars_check;
300}
301
302sub bytes_to_chars {
303 &_bytes_to_chars;
304}
305
306sub chars_to_bytes {
307 &_chars_to_bytes;
308}
309
310sub from_to {
311 &_from_to;
312}
313
314sub is_utf8 {
315 &_is_utf8;
316}
317
318sub on_utf8 {
319 &_on_utf8;
320}
321
322sub off_utf8 {
323 &_off_utf8;
324}
325
326sub utf_to_utf {
327 &_utf_to_utf;
328}
329