Commit | Line | Data |
1b2c56c8 |
1 | =head1 NAME |
2 | |
a63c962f |
3 | Encode::Details - implementation details of Encode.pm |
1b2c56c8 |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | The C<Encode> module provides the interfaces between Perl's strings |
8 | and the rest of the system. Perl strings are sequences of B<characters>. |
9 | |
10 | The repertoire of characters that Perl can represent is at least that |
11 | defined by the Unicode Consortium. On most platforms the ordinal |
12 | values of the characters (as returned by C<ord(ch)>) is the "Unicode |
13 | codepoint" for the character (the exceptions are those platforms where |
14 | the legacy encoding is some variant of EBCDIC rather than a super-set |
15 | of ASCII - see L<perlebcdic>). |
16 | |
a63c962f |
17 | Traditionally computer data has been moved around in 8-bit chunks |
1b2c56c8 |
18 | often called "bytes". These chunks are also known as "octets" in |
19 | networking standards. Perl is widely used to manipulate data of |
20 | many types - not only strings of characters representing human or |
21 | computer languages but also "binary" data being the machines representation |
22 | of numbers, pixels in an image - or just about anything. |
23 | |
24 | When Perl is processing "binary data" the programmer wants Perl to process |
25 | "sequences of bytes". This is not a problem for Perl - as a byte has 256 |
26 | possible values it easily fits in Perl's much larger "logical character". |
27 | |
28 | Due to size concerns, each of B<CJK> (Chinese, Japanese & Korean) modules |
29 | are not loaded in memory until the first time they're used. Although you |
30 | don't have to C<use> the corresponding B<Encode::>(B<TW>|B<CN>|B<JP>|B<KR>) |
31 | modules first, be aware that those encodings will not be in C<%encodings> |
32 | until their module is loaded (either implicitly through using encodings |
33 | contained in the same module, or via an explicit C<use>). |
34 | |
35 | =head2 TERMINOLOGY |
36 | |
37 | =over 4 |
38 | |
39 | =item * |
40 | |
41 | I<character>: a character in the range 0..(2**32-1) (or more). |
42 | (What Perl's strings are made of.) |
43 | |
44 | =item * |
45 | |
46 | I<byte>: a character in the range 0..255 |
47 | (A special case of a Perl character.) |
48 | |
49 | =item * |
50 | |
51 | I<octet>: 8 bits of data, with ordinal values 0..255 |
52 | (Term for bytes passed to or from a non-Perl context, e.g. disk file.) |
53 | |
54 | =back |
55 | |
56 | The marker [INTERNAL] marks Internal Implementation Details, in |
57 | general meant only for those who think they know what they are doing, |
58 | and such details may change in future releases. |
59 | |
60 | =head1 ENCODINGS |
61 | |
62 | =head2 Characteristics of an Encoding |
63 | |
64 | An encoding has a "repertoire" of characters that it can represent, |
65 | and for each representable character there is at least one sequence of |
66 | octets that represents it. |
67 | |
68 | =head2 Types of Encodings |
69 | |
70 | Encodings can be divided into the following types: |
71 | |
72 | =over 4 |
73 | |
74 | =item * Fixed length 8-bit (or less) encodings. |
75 | |
76 | Each character is a single octet so may have a repertoire of up to |
77 | 256 characters. ASCII and iso-8859-* are typical examples. |
78 | |
79 | =item * Fixed length 16-bit encodings |
80 | |
81 | Each character is two octets so may have a repertoire of up to |
82 | 65 536 characters. Unicode's UCS-2 is an example. Also used for |
83 | encodings for East Asian languages. |
84 | |
85 | =item * Fixed length 32-bit encodings. |
86 | |
87 | Not really very "encoded" encodings. The Unicode code points |
88 | are just represented as 4-octet integers. None the less because |
89 | different architectures use different representations of integers |
a63c962f |
90 | (so called "endian") there at least two distinct encodings. |
1b2c56c8 |
91 | |
92 | =item * Multi-byte encodings |
93 | |
94 | The number of octets needed to represent a character varies. |
95 | UTF-8 is a particularly complex but regular case of a multi-byte |
96 | encoding. Several East Asian countries use a multi-byte encoding |
97 | where 1-octet is used to cover western roman characters and Asian |
98 | characters get 2-octets. |
99 | (UTF-16 is strictly a multi-byte encoding taking either 2 or 4 octets |
100 | to represent a Unicode code point.) |
101 | |
102 | =item * "Escape" encodings. |
103 | |
104 | These encodings embed "escape sequences" into the octet sequence |
105 | which describe how the following octets are to be interpreted. |
106 | The iso-2022-* family is typical. Following the escape sequence |
107 | octets are encoded by an "embedded" encoding (which will be one |
108 | of the above types) until another escape sequence switches to |
109 | a different "embedded" encoding. |
110 | |
111 | These schemes are very flexible and can handle mixed languages but are |
112 | very complex to process (and have state). No escape encodings are |
113 | implemented for Perl yet. |
114 | |
115 | =back |
116 | |
117 | =head2 Specifying Encodings |
118 | |
119 | Encodings can be specified to the API described below in two ways: |
120 | |
121 | =over 4 |
122 | |
123 | =item 1. By name |
124 | |
125 | Encoding names are strings with characters taken from a restricted |
126 | repertoire. See L</"Encoding Names">. |
127 | |
128 | =item 2. As an object |
129 | |
130 | Encoding objects are returned by C<find_encoding($name, [$skip_external])>. |
131 | If the second parameter is true, Encode will refrain from loading external |
132 | modules for CJK encodings. |
133 | |
134 | =back |
135 | |
136 | =head2 Encoding Names |
137 | |
138 | Encoding names are case insensitive. White space in names is ignored. |
139 | In addition an encoding may have aliases. Each encoding has one |
140 | "canonical" name. The "canonical" name is chosen from the names of |
141 | the encoding by picking the first in the following sequence: |
142 | |
143 | =over 4 |
144 | |
145 | =item * The MIME name as defined in IETF RFCs. |
146 | |
147 | =item * The name in the IANA registry. |
148 | |
149 | =item * The name used by the organization that defined it. |
150 | |
151 | =back |
152 | |
153 | Because of all the alias issues, and because in the general case |
154 | encodings have state C<Encode> uses the encoding object internally |
155 | once an operation is in progress. |
156 | |
157 | As of Perl 5.8.0, at least the following encodings are recognized |
158 | (the => marks aliases): |
159 | |
160 | ASCII |
161 | |
162 | US-ASCII => ASCII |
163 | |
164 | The Unicode: |
165 | |
166 | UTF-8 |
167 | UTF-16 |
168 | UCS-2 |
169 | |
170 | ISO 10646-1 => UCS-2 |
171 | |
172 | The ISO 8859 and KOI: |
173 | |
174 | ISO 8859-1 ISO 8859-6 ISO 8859-11 KOI8-F |
175 | ISO 8859-2 ISO 8859-7 (12 doesn't exist) KOI8-R |
176 | ISO 8859-3 ISO 8859-8 ISO 8859-13 KOI8-U |
177 | ISO 8859-4 ISO 8859-9 ISO 8859-14 |
178 | ISO 8859-5 ISO 8859-10 ISO 8859-15 |
179 | ISO 8859-16 |
180 | |
181 | Latin1 => 8859-1 Latin6 => 8859-10 |
182 | Latin2 => 8859-2 Latin7 => 8859-13 |
183 | Latin3 => 8859-3 Latin8 => 8859-14 |
184 | Latin4 => 8859-4 Latin9 => 8859-15 |
185 | Latin5 => 8859-9 Latin10 => 8859-16 |
186 | |
187 | Cyrillic => 8859-5 |
188 | Arabic => 8859-6 |
189 | Greek => 8859-7 |
190 | Hebrew => 8859-8 |
191 | Thai => 8859-11 |
192 | TIS620 => 8859-11 |
193 | |
194 | The CJKV: Chinese, Japanese, Korean, Vietnamese: |
195 | |
196 | ISO 2022 ISO 2022 JP-1 JIS 0201 GB 1988 Big5 EUC-CN |
197 | ISO 2022 CN ISO 2022 JP-2 JIS 0208 GB 2312 HZ EUC-JP |
198 | ISO 2022 JP ISO 2022 KR JIS 0210 GB 12345 CNS 11643 EUC-JP-0212 |
199 | Shift-JIS GBK Big5-HKSCS EUC-KR |
200 | VISCII ISO-IR-165 |
201 | |
202 | (Due to size concerns, additional Chinese encodings including C<GB 18030>, |
203 | C<EUC-TW> and C<BIG5PLUS> are distributed separately on CPAN, under the name |
204 | L<Encode::HanExtra>.) |
205 | |
206 | The PC codepages: |
207 | |
208 | CP37 CP852 CP861 CP866 CP949 CP1251 CP1256 |
209 | CP424 CP855 CP862 CP869 CP950 CP1252 CP1257 |
210 | CP737 CP856 CP863 CP874 CP1006 CP1253 CP1258 |
211 | CP775 CP857 CP864 CP932 CP1047 CP1254 |
212 | CP850 CP860 CP865 CP936 CP1250 CP1255 |
213 | |
214 | WinLatin1 => CP1252 |
215 | WinLatin2 => CP1250 |
216 | WinCyrillic => CP1251 |
217 | WinGreek => CP1253 |
218 | WinTurkiskh => CP1254 |
219 | WinHebrew => CP1255 |
220 | WinArabic => CP1256 |
221 | WinBaltic => CP1257 |
222 | WinVietnamese => CP1258 |
223 | |
224 | (All the CPI<NNN...> are available also as IBMI<NNN...>.) |
225 | |
226 | The Mac codepages: |
227 | |
228 | MacCentralEuropean MacJapanese |
229 | MacCroatian MacRoman |
230 | MacCyrillic MacRomanian |
231 | MacDingbats MacSami |
232 | MacGreek MacThai |
233 | MacIcelandic MacTurkish |
234 | MacUkraine |
235 | |
236 | Miscellaneous: |
237 | |
238 | 7bit-greek IR-197 |
239 | 7bit-kana NeXTstep |
240 | 7bit-latin1 POSIX-BC |
241 | DingBats Roman8 |
242 | GSM 0338 Symbol |
243 | |
244 | =head2 Encoding Classification |
245 | |
246 | Encodings |
247 | |
248 | US-ASCII UTF-8 KOI8-R ISO-8859-* |
249 | ISO-2022-CN ISO-2022-JP ISO-2022-KR Big5 |
250 | EUC-CN EUC-JP EUC-KR |
251 | |
252 | are L<http://www.iana.org/assignments/character-sets>-registered |
253 | as preferred MIME names and may probably be used over the Internet. |
254 | So is |
255 | |
256 | Shift_JIS |
257 | |
258 | but despite its wide spread it bears the label of being |
259 | Microsft proprietary. |
260 | |
261 | UTF-16 KOI8-U ISO-2022-JP-2 |
262 | |
a63c962f |
263 | are IANA-registered preferred MIME names but probably should |
1b2c56c8 |
264 | be avoided as encoding for web pages due to lack of browser |
265 | support. |
266 | |
267 | |
268 | ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) |
269 | ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html) |
270 | ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) |
271 | GBK |
272 | VISCII |
273 | GB 12345 (only plains 1 and 2 available) |
274 | GB 18030 |
275 | CNS 11643 |
276 | |
277 | are totally valid encodings but not registered at IANA. |
278 | |
279 | BIG5PLUS |
280 | EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended) |
281 | |
282 | are a bit proprietary |
283 | |
284 | You may probably get some info on CJK encodings at |
285 | |
286 | brief description for most of the mentioned CJK encodings |
287 | http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html |
288 | |
289 | several years old, but still useful |
290 | http://www.oreilly.com/people/authors/lunde/cjk_inf.html |
291 | |
292 | and some in-depth reading for the heroes :-) |
293 | http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022) |
294 | http://www.faqs.org/rfcs/rfc1345.txt |
295 | |
296 | |
297 | =head1 PERL ENCODING API |
298 | |
299 | =head2 Generic Encoding Interface |
300 | |
301 | =over 4 |
302 | |
303 | =item * |
304 | |
305 | $bytes = encode(ENCODING, $string[, CHECK]) |
306 | |
307 | Encodes string from Perl's internal form into I<ENCODING> and returns |
308 | a sequence of octets. For CHECK see L</"Handling Malformed Data">. |
309 | |
310 | For example to convert (internally UTF-8 encoded) Unicode data |
311 | to octets: |
312 | |
313 | $octets = encode("utf8", $unicode); |
314 | |
315 | =item * |
316 | |
317 | $string = decode(ENCODING, $bytes[, CHECK]) |
318 | |
319 | Decode sequence of octets assumed to be in I<ENCODING> into Perl's |
320 | internal form and returns the resulting string. For CHECK see |
321 | L</"Handling Malformed Data">. |
322 | |
323 | For example to convert ISO-8859-1 data to UTF-8: |
324 | |
325 | $utf8 = decode("latin1", $latin1); |
326 | |
327 | =item * |
328 | |
329 | from_to($string, FROM_ENCODING, TO_ENCODING[, CHECK]) |
330 | |
331 | Convert B<in-place> the data between two encodings. How did the data |
332 | in $string originally get to be in FROM_ENCODING? Either using |
333 | encode() or through PerlIO: See L</"Encoding and IO">. For CHECK |
334 | see L</"Handling Malformed Data">. |
335 | |
336 | For example to convert ISO-8859-1 data to UTF-8: |
337 | |
338 | from_to($data, "iso-8859-1", "utf-8"); |
339 | |
340 | and to convert it back: |
341 | |
342 | from_to($data, "utf-8", "iso-8859-1"); |
343 | |
344 | Note that because the conversion happens in place, the data to be |
345 | converted cannot be a string constant, it must be a scalar variable. |
346 | |
347 | =back |
348 | |
349 | =head2 Handling Malformed Data |
350 | |
351 | If CHECK is not set, C<undef> is returned. If the data is supposed to |
352 | be UTF-8, an optional lexical warning (category utf8) is given. If |
353 | CHECK is true but not a code reference, dies. |
354 | |
355 | It would desirable to have a way to indicate that transform should use |
356 | the encodings "replacement character" - no such mechanism is defined yet. |
357 | |
358 | It is also planned to allow I<CHECK> to be a code reference. |
359 | |
360 | This is not yet implemented as there are design issues with what its |
361 | arguments should be and how it returns its results. |
362 | |
363 | =over 4 |
364 | |
365 | =item Scheme 1 |
366 | |
367 | Passed remaining fragment of string being processed. |
368 | Modifies it in place to remove bytes/characters it can understand |
369 | and returns a string used to represent them. |
370 | e.g. |
371 | |
372 | sub fixup { |
373 | my $ch = substr($_[0],0,1,''); |
374 | return sprintf("\x{%02X}",ord($ch); |
375 | } |
376 | |
377 | This scheme is close to how underlying C code for Encode works, but gives |
378 | the fixup routine very little context. |
379 | |
380 | =item Scheme 2 |
381 | |
382 | Passed original string, and an index into it of the problem area, and |
383 | output string so far. Appends what it will to output string and |
384 | returns new index into original string. For example: |
385 | |
386 | sub fixup { |
387 | # my ($s,$i,$d) = @_; |
388 | my $ch = substr($_[0],$_[1],1); |
389 | $_[2] .= sprintf("\x{%02X}",ord($ch); |
390 | return $_[1]+1; |
391 | } |
392 | |
393 | This scheme gives maximal control to the fixup routine but is more |
394 | complicated to code, and may need internals of Encode to be tweaked to |
395 | keep original string intact. |
396 | |
397 | =item Other Schemes |
398 | |
399 | Hybrids of above. |
400 | |
401 | Multiple return values rather than in-place modifications. |
402 | |
403 | Index into the string could be pos($str) allowing s/\G...//. |
404 | |
405 | =back |
406 | |
407 | =head2 UTF-8 / utf8 |
408 | |
409 | The Unicode consortium defines the UTF-8 standard as a way of encoding |
a63c962f |
410 | the entire Unicode repertoire as sequences of octets. This encoding is |
411 | expected to become very widespread. Perl can use this form internally |
1b2c56c8 |
412 | to represent strings, so conversions to and from this form are |
413 | particularly efficient (as octets in memory do not have to change, |
414 | just the meta-data that tells Perl how to treat them). |
415 | |
416 | =over 4 |
417 | |
a63c962f |
418 | =item $bytes = encode_utf8($string); |
1b2c56c8 |
419 | |
420 | The characters that comprise string are encoded in Perl's superset of UTF-8 |
421 | and the resulting octets returned as a sequence of bytes. All possible |
422 | characters have a UTF-8 representation so this function cannot fail. |
423 | |
a63c962f |
424 | =item $string = decode_utf8($bytes [,CHECK]); |
1b2c56c8 |
425 | |
426 | The sequence of octets represented by $bytes is decoded from UTF-8 |
427 | into a sequence of logical characters. Not all sequences of octets |
428 | form valid UTF-8 encodings, so it is possible for this call to fail. |
429 | For CHECK see L</"Handling Malformed Data">. |
430 | |
431 | =back |
432 | |
433 | =head2 Other Encodings of Unicode |
434 | |
435 | UTF-16 is similar to UCS-2, 16 bit or 2-byte chunks. UCS-2 can only |
436 | represent 0..0xFFFF, while UTF-16 has a I<surrogate pair> scheme which |
437 | allows it to cover the whole Unicode range. |
438 | |
439 | Surrogates are code points set aside to encode the 0x01000..0x10FFFF |
440 | range of Unicode code points in pairs of 16-bit units. The I<high |
441 | surrogates> are the range 0xD800..0xDBFF, and the I<low surrogates> |
442 | are the range 0xDC00..0xDFFFF. The surrogate encoding is |
443 | |
444 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
445 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; |
446 | |
447 | and the decoding is |
448 | |
449 | $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00); |
450 | |
451 | Encode implements big-endian UCS-2 aliased to "iso-10646-1" as that |
452 | happens to be the name used by that representation when used with X11 |
453 | fonts. |
454 | |
455 | UTF-32 or UCS-4 is 32-bit or 4-byte chunks. Perl's logical characters |
456 | can be considered as being in this form without encoding. An encoding |
457 | to transfer strings in this form (e.g. to write them to a file) would |
458 | need to |
459 | |
460 | pack('L*', unpack('U*', $string)); # native |
461 | or |
462 | pack('V*', unpack('U*', $string)); # little-endian |
463 | or |
464 | pack('N*', unpack('U*', $string)); # big-endian |
465 | |
466 | depending on the endianness required. |
467 | |
468 | No UTF-32 encodings are implemented yet. |
469 | |
470 | Both UCS-2 and UCS-4 style encodings can have "byte order marks" by |
471 | representing the code point 0xFFFE as the very first thing in a file. |
472 | |
473 | =head2 Listing available encodings |
474 | |
475 | use Encode qw(encodings); |
476 | @list = encodings(); |
477 | |
478 | Returns a list of the canonical names of the available encodings. |
479 | |
480 | =head2 Defining Aliases |
481 | |
482 | use Encode qw(define_alias); |
483 | define_alias( newName => ENCODING); |
484 | |
485 | Allows newName to be used as am alias for ENCODING. ENCODING may be |
486 | either the name of an encoding or and encoding object (as above). |
487 | |
488 | Currently I<newName> can be specified in the following ways: |
489 | |
490 | =over 4 |
491 | |
492 | =item As a simple string. |
493 | |
494 | =item As a qr// compiled regular expression, e.g.: |
495 | |
496 | define_alias( qr/^iso8859-(\d+)$/i => '"iso-8859-$1"' ); |
497 | |
498 | In this case if I<ENCODING> is not a reference it is C<eval>-ed to |
a63c962f |
499 | allow C<$1> etc. to be substituted. The example is one way to names as |
1b2c56c8 |
500 | used in X11 font names to alias the MIME names for the iso-8859-* |
501 | family. Note the double quote inside the single quote. If you are |
a63c962f |
502 | using regex here, you have to do so or it won't work in this case. |
1b2c56c8 |
503 | |
504 | =item As a code reference, e.g.: |
505 | |
506 | define_alias( sub { return /^iso8859-(\d+)$/i ? "iso-8859-$1" : undef } , ''); |
507 | |
508 | In this case C<$_> will be set to the name that is being looked up and |
509 | I<ENCODING> is passed to the sub as its first argument. The example |
510 | is another way to names as used in X11 font names to alias the MIME |
511 | names for the iso-8859-* family. |
512 | |
513 | =back |
514 | |
515 | =head2 Defining Encodings |
516 | |
517 | use Encode qw(define_alias); |
518 | define_encoding( $object, 'canonicalName' [,alias...]); |
519 | |
520 | Causes I<canonicalName> to be associated with I<$object>. The object |
521 | should provide the interface described in L</"IMPLEMENTATION CLASSES"> |
522 | below. If more than two arguments are provided then additional |
523 | arguments are taken as aliases for I<$object> as for C<define_alias>. |
524 | |
525 | =head1 Encoding and IO |
526 | |
527 | It is very common to want to do encoding transformations when |
528 | reading or writing files, network connections, pipes etc. |
529 | If Perl is configured to use the new 'perlio' IO system then |
530 | C<Encode> provides a "layer" (See L<perliol>) which can transform |
531 | data as it is read or written. |
532 | |
533 | Here is how the blind poet would modernise the encoding: |
534 | |
535 | use Encode; |
536 | open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek'); |
537 | open(my $utf8,'>:utf8','iliad.utf8'); |
538 | my @epic = <$iliad>; |
539 | print $utf8 @epic; |
540 | close($utf8); |
541 | close($illiad); |
542 | |
543 | In addition the new IO system can also be configured to read/write |
544 | UTF-8 encoded characters (as noted above this is efficient): |
545 | |
546 | open(my $fh,'>:utf8','anything'); |
547 | print $fh "Any \x{0021} string \N{SMILEY FACE}\n"; |
548 | |
549 | Either of the above forms of "layer" specifications can be made the default |
550 | for a lexical scope with the C<use open ...> pragma. See L<open>. |
551 | |
552 | Once a handle is open is layers can be altered using C<binmode>. |
553 | |
554 | Without any such configuration, or if Perl itself is built using |
555 | system's own IO, then write operations assume that file handle accepts |
556 | only I<bytes> and will C<die> if a character larger than 255 is |
557 | written to the handle. When reading, each octet from the handle |
558 | becomes a byte-in-a-character. Note that this default is the same |
559 | behaviour as bytes-only languages (including Perl before v5.6) would |
560 | have, and is sufficient to handle native 8-bit encodings |
561 | e.g. iso-8859-1, EBCDIC etc. and any legacy mechanisms for handling |
562 | other encodings and binary data. |
563 | |
564 | In other cases it is the programs responsibility to transform |
565 | characters into bytes using the API above before doing writes, and to |
566 | transform the bytes read from a handle into characters before doing |
567 | "character operations" (e.g. C<lc>, C</\W+/>, ...). |
568 | |
569 | You can also use PerlIO to convert larger amounts of data you don't |
570 | want to bring into memory. For example to convert between ISO-8859-1 |
571 | (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines): |
572 | |
573 | open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!; |
574 | open(G, ">:utf8", "data.utf") or die $!; |
575 | while (<F>) { print G } |
576 | |
577 | # Could also do "print G <F>" but that would pull |
578 | # the whole file into memory just to write it out again. |
579 | |
580 | More examples: |
581 | |
582 | open(my $f, "<:encoding(cp1252)") |
583 | open(my $g, ">:encoding(iso-8859-2)") |
584 | open(my $h, ">:encoding(latin9)") # iso-8859-15 |
585 | |
586 | See L<PerlIO> for more information. |
587 | |
588 | See also L<encoding> for how to change the default encoding of the |
589 | data in your script. |
590 | |
591 | =head1 Encoding How to ... |
592 | |
593 | To do: |
594 | |
595 | =over 4 |
596 | |
597 | =item * IO with mixed content (faking iso-2022-*) |
598 | |
599 | Encode::JP implements its own iso-2022 routines, however. |
600 | |
601 | =item * MIME's Content-Length: |
602 | |
603 | =item * UTF-8 strings in binary data. |
604 | |
605 | =item * Perl/Encode wrappers on non-Unicode XS modules. |
606 | |
607 | =back |
608 | |
609 | =head1 Messing with Perl's Internals |
610 | |
611 | The following API uses parts of Perl's internals in the current |
612 | implementation. As such they are efficient, but may change. |
613 | |
614 | =over 4 |
615 | |
a63c962f |
616 | =item is_utf8(STRING [, CHECK]) |
1b2c56c8 |
617 | |
618 | [INTERNAL] Test whether the UTF-8 flag is turned on in the STRING. |
619 | If CHECK is true, also checks the data in STRING for being well-formed |
620 | UTF-8. Returns true if successful, false otherwise. |
621 | |
a63c962f |
622 | =item _utf8_on(STRING) |
1b2c56c8 |
623 | |
624 | [INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is |
625 | B<not> checked for being well-formed UTF-8. Do not use unless you |
626 | B<know> that the STRING is well-formed UTF-8. Returns the previous |
627 | state of the UTF-8 flag (so please don't test the return value as |
628 | I<not> success or failure), or C<undef> if STRING is not a string. |
629 | |
a63c962f |
630 | =item _utf8_off(STRING) |
1b2c56c8 |
631 | |
632 | [INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously. |
633 | Returns the previous state of the UTF-8 flag (so please don't test the |
634 | return value as I<not> success or failure), or C<undef> if STRING is |
635 | not a string. |
636 | |
637 | =back |
638 | |
639 | =head1 IMPLEMENTATION CLASSES |
640 | |
641 | As mentioned above encodings are (in the current implementation at least) |
642 | defined by objects. The mapping of encoding name to object is via the |
643 | C<%encodings> hash. |
644 | |
645 | The values of the hash can currently be either strings or objects. |
646 | The string form may go away in the future. The string form occurs |
647 | when C<encodings()> has scanned C<@INC> for loadable encodings but has |
648 | not actually loaded the encoding in question. This is because the |
649 | current "loading" process is all Perl and a bit slow. |
650 | |
651 | Once an encoding is loaded then value of the hash is object which |
652 | implements the encoding. The object should provide the following |
653 | interface: |
654 | |
655 | =over 4 |
656 | |
657 | =item -E<gt>name |
658 | |
659 | Should return the string representing the canonical name of the encoding. |
660 | |
661 | =item -E<gt>new_sequence |
662 | |
663 | This is a placeholder for encodings with state. It should return an |
664 | object which implements this interface, all current implementations |
665 | return the original object. |
666 | |
667 | =item -E<gt>encode($string,$check) |
668 | |
669 | Should return the octet sequence representing I<$string>. If I<$check> |
670 | is true it should modify I<$string> in place to remove the converted |
671 | part (i.e. the whole string unless there is an error). If an error |
672 | occurs it should return the octet sequence for the fragment of string |
673 | that has been converted, and modify $string in-place to remove the |
674 | converted part leaving it starting with the problem fragment. |
675 | |
676 | If check is is false then C<encode> should make a "best effort" to |
677 | convert the string - for example by using a replacement character. |
678 | |
679 | =item -E<gt>decode($octets,$check) |
680 | |
681 | Should return the string that I<$octets> represents. If I<$check> is |
682 | true it should modify I<$octets> in place to remove the converted part |
683 | (i.e. the whole sequence unless there is an error). If an error |
684 | occurs it should return the fragment of string that has been |
685 | converted, and modify $octets in-place to remove the converted part |
686 | leaving it starting with the problem fragment. |
687 | |
688 | If check is is false then C<decode> should make a "best effort" to |
689 | convert the string - for example by using Unicode's "\x{FFFD}" as a |
690 | replacement character. |
691 | |
692 | =back |
693 | |
694 | It should be noted that the check behaviour is different from the |
695 | outer public API. The logic is that the "unchecked" case is useful |
696 | when encoding is part of a stream which may be reporting errors |
697 | (e.g. STDERR). In such cases it is desirable to get everything |
698 | through somehow without causing additional errors which obscure the |
699 | original one. Also the encoding is best placed to know what the |
700 | correct replacement character is, so if that is the desired behaviour |
701 | then letting low level code do it is the most efficient. |
702 | |
703 | In contrast if check is true, the scheme above allows the encoding to |
704 | do as much as it can and tell layer above how much that was. What is |
705 | lacking at present is a mechanism to report what went wrong. The most |
706 | likely interface will be an additional method call to the object, or |
707 | perhaps (to avoid forcing per-stream objects on otherwise stateless |
708 | encodings) and additional parameter. |
709 | |
710 | It is also highly desirable that encoding classes inherit from |
711 | C<Encode::Encoding> as a base class. This allows that class to define |
712 | additional behaviour for all encoding objects. For example built in |
713 | Unicode, UCS-2 and UTF-8 classes use : |
714 | |
715 | package Encode::MyEncoding; |
716 | use base qw(Encode::Encoding); |
717 | |
718 | __PACKAGE__->Define(qw(myCanonical myAlias)); |
719 | |
720 | To create an object with bless {Name => ...},$class, and call |
721 | define_encoding. They inherit their C<name> method from |
722 | C<Encode::Encoding>. |
723 | |
724 | =head2 Compiled Encodings |
725 | |
726 | F<Encode.xs> provides a class C<Encode::XS> which provides the |
727 | interface described above. It calls a generic octet-sequence to |
728 | octet-sequence "engine" that is driven by tables (defined in |
729 | F<encengine.c>). The same engine is used for both encode and |
730 | decode. C<Encode:XS>'s C<encode> forces Perl's characters to their |
731 | UTF-8 form and then treats them as just another multibyte |
732 | encoding. C<Encode:XS>'s C<decode> transforms the sequence and then |
733 | turns the UTF-8-ness flag as that is the form that the tables are |
734 | defined to produce. For details of the engine see the comments in |
735 | F<encengine.c>. |
736 | |
737 | The tables are produced by the Perl script F<compile> (the name needs |
738 | to change so we can eventually install it somewhere). F<compile> can |
739 | currently read two formats: |
740 | |
741 | =over 4 |
742 | |
743 | =item *.enc |
744 | |
745 | This is a coined format used by Tcl. It is documented in |
746 | Encode/EncodeFormat.pod. |
747 | |
748 | =item *.ucm |
749 | |
750 | This is the semi-standard format used by IBM's ICU package. |
751 | |
752 | =back |
753 | |
754 | F<compile> can write the following forms: |
755 | |
756 | =over 4 |
757 | |
758 | =item *.ucm |
759 | |
760 | See above - the F<Encode/*.ucm> files provided with the distribution have |
761 | been created from the original Tcl .enc files using this approach. |
762 | |
763 | =item *.c |
764 | |
765 | Produces tables as C data structures - this is used to build in encodings |
766 | into F<Encode.so>/F<Encode.dll>. |
767 | |
768 | =item *.xs |
769 | |
770 | In theory this allows encodings to be stand-alone loadable Perl |
771 | extensions. The process has not yet been tested. The plan is to use |
772 | this approach for large East Asian encodings. |
773 | |
774 | =back |
775 | |
776 | The set of encodings built-in to F<Encode.so>/F<Encode.dll> is |
777 | determined by F<Makefile.PL>. The current set is as follows: |
778 | |
779 | =over 4 |
780 | |
781 | =item ascii and iso-8859-* |
782 | |
783 | That is all the common 8-bit "western" encodings. |
784 | |
785 | =item IBM-1047 and two other variants of EBCDIC. |
786 | |
787 | These are the same variants that are supported by EBCDIC Perl as |
788 | "native" encodings. They are included to prove "reversibility" of |
789 | some constructs in EBCDIC Perl. |
790 | |
791 | =item symbol and dingbats as used by Tk on X11. |
792 | |
793 | (The reason Encode got started was to support Perl/Tk.) |
794 | |
795 | =back |
796 | |
797 | That set is rather ad hoc and has been driven by the needs of the |
798 | tests rather than the needs of typical applications. It is likely |
799 | to be rationalized. |
800 | |
801 | =head1 SEE ALSO |
802 | |
803 | L<perlunicode>, L<perlebcdic>, L<perlfunc/open>, L<PerlIO>, L<encoding>, |
804 | L<utf8>, the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt> |
805 | |
1b2c56c8 |
806 | =cut |