Commit | Line | Data |
85982a32 |
1 | =head1 NAME |
2 | |
3 | Encode::PerlIO -- a detailed document on Encode and PerlIO |
4 | |
5 | =head1 Overview |
6 | |
7 | It is very common to want to do encoding transformations when |
8 | reading or writing files, network connections, pipes etc. |
9 | If Perl is configured to use the new 'perlio' IO system then |
10 | C<Encode> provides a "layer" (See L<PerlIO>) which can transform |
11 | data as it is read or written. |
12 | |
13 | Here is how the blind poet would modernise the encoding: |
14 | |
15 | use Encode; |
16 | open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek'); |
17 | open(my $utf8,'>:utf8','iliad.utf8'); |
18 | my @epic = <$iliad>; |
19 | print $utf8 @epic; |
20 | close($utf8); |
21 | close($illiad); |
22 | |
23 | In addition the new IO system can also be configured to read/write |
24 | UTF-8 encoded characters (as noted above this is efficient): |
25 | |
26 | open(my $fh,'>:utf8','anything'); |
27 | print $fh "Any \x{0021} string \N{SMILEY FACE}\n"; |
28 | |
29 | Either of the above forms of "layer" specifications can be made the default |
30 | for a lexical scope with the C<use open ...> pragma. See L<open>. |
31 | |
32 | Once a handle is open is layers can be altered using C<binmode>. |
33 | |
34 | Without any such configuration, or if Perl itself is built using |
35 | system's own IO, then write operations assume that file handle accepts |
36 | only I<bytes> and will C<die> if a character larger than 255 is |
37 | written to the handle. When reading, each octet from the handle |
38 | becomes a byte-in-a-character. Note that this default is the same |
39 | behaviour as bytes-only languages (including Perl before v5.6) would |
40 | have, and is sufficient to handle native 8-bit encodings |
41 | e.g. iso-8859-1, EBCDIC etc. and any legacy mechanisms for handling |
42 | other encodings and binary data. |
43 | |
44 | In other cases it is the programs responsibility to transform |
45 | characters into bytes using the API above before doing writes, and to |
46 | transform the bytes read from a handle into characters before doing |
47 | "character operations" (e.g. C<lc>, C</\W+/>, ...). |
48 | |
49 | You can also use PerlIO to convert larger amounts of data you don't |
50 | want to bring into memory. For example to convert between ISO-8859-1 |
51 | (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines): |
52 | |
53 | open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!; |
54 | open(G, ">:utf8", "data.utf") or die $!; |
55 | while (<F>) { print G } |
56 | |
57 | # Could also do "print G <F>" but that would pull |
58 | # the whole file into memory just to write it out again. |
59 | |
60 | More examples: |
61 | |
62 | open(my $f, "<:encoding(cp1252)") |
63 | open(my $g, ">:encoding(iso-8859-2)") |
64 | open(my $h, ">:encoding(latin9)") # iso-8859-15 |
65 | |
66 | See also L<encoding> for how to change the default encoding of the |
67 | data in your script. |
68 | |
69 | =head1 How does it work? |
70 | |
71 | Here is a crude diagram of how filehandle, PerlIO, and Encode |
72 | interact. |
73 | |
74 | filehandle <-> PerlIO PerlIO <-> scalar (read/printed) |
75 | \ / |
76 | Encode |
77 | |
78 | When PerlIO receives data from either direction, it fills in the buffer |
79 | (currently with 1024 bytes) and pass the buffer to Encode. Encode tries |
80 | to convert the valid part and pass it back to PerlIO, leaving invalid |
81 | parts (usually partial character) in buffer. PerlIO then appends more |
82 | data in buffer, call Encode, and so on until the data stream ends. |
83 | |
84 | To do so, PerlIO always calls (de|en)code methods with CHECK set to 1. |
85 | this ensures that the method stops at the right place when it |
86 | encounters partial character. The following is what happens when |
87 | PerlIO and Encode tries to encode (from utf8) more than 1024 bytes |
88 | long and the buffer boundary happens to be between a character. |
89 | |
90 | A B C .... ~ \x{3000} .... |
91 | 41 42 43 .... 7E e3 80 80 .... |
92 | <- buffer ---------------> |
93 | << encoded >>>>>>>>>> |
94 | <- next buffer ------ |
95 | |
96 | Encode converts from the beginning to \x7E, leaving \xe3 in the buffer |
97 | because it is invalid (partial character). |
98 | |
99 | Unfortunately, this scheme does not work well with escape-based |
100 | encoding such as ISO-2022-JP. Let's see what happens in that case |
101 | in the next chapter. |
102 | |
103 | =head1 BUGS |
104 | |
105 | Now let's see what happens when you try to decode form ISO-2022-JP and |
106 | the buffer cuts in the middle of a character |
107 | |
108 | JIS208-ESC \x{5f3e} |
109 | A B C .... ~ \e $ B |DAN | .... |
110 | 41 42 43 .... 7E 1b 24 41 43 46 .... |
111 | <- buffer ---------------------------> |
112 | << encoded >>>>>>>>>>>>>>>>>>>>>>> |
113 | |
114 | As you see, the next buffer begins with \x43. But \x43 is 'C' in |
115 | ASCII, which is wrong in this case because we are now in JISX 0208 |
116 | area so it has to convert \x43\x46, not \x43. Unlike utf8 and EUC, |
117 | in escape-based encoding you can't tell if it a given octed is a whole |
118 | character or just part of it. |
119 | |
120 | There are actually several ways to solve this problem but none of |
121 | which is fast enough to be practical. From Encode's point of view |
122 | the easiest solution is for PerlIO to implement line buffer instead |
123 | of fixed-length buffer but that makes PerlIO really complicated. |
124 | |
125 | So for the time being, using escape-based encodings in ":encoding()" |
126 | layer of PerlIO does not work well. |
127 | |
128 | =head2 Workaround |
129 | |
130 | If you still insist, you can at least use ":encoding()" by making sure |
131 | the buffer never gets full. Here is an example. |
132 | |
133 | use FileHandle; |
134 | binmode(STDOUT, ":encoding(7bit-jis)"); |
135 | STDOUT->autoflush(1); # don't forget this! |
136 | for my $l (@lines){ # $l cannot be longer than 1023 bytes |
137 | print $l; |
138 | } |
139 | |
140 | =head2 How can you tell my encoding fully supports PerlIO ? |
141 | |
142 | As of this writing, Any encoding which class belongs to Encode::XS and |
143 | Encode::Unicode works. Encode module has C<perlio_ok> method so you |
144 | can use it before appling PerlIO encoding to the filehandle. Here is |
145 | an example; |
146 | |
147 | my $use_perlio = perlio_ok($enc); |
148 | my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)"; |
149 | open my $fh, $layer, $file or die "$file : $!"; |
150 | while(<$fh>){ |
151 | $_ = decode($enc, $_) unless $use_perlio; |
152 | # .... |
153 | } |
154 | |
155 | =head1 SEE ALSO |
156 | |
157 | L<Encode::Encoding>, |
158 | L<Encode::Supported>, |
159 | L<Encode::PerlIO>, |
160 | L<encoding>, |
161 | L<perlebcdic>, |
162 | L<perlfunc/open>, |
163 | L<perlunicode>, |
164 | L<utf8>, |
165 | the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt> |
166 | |
167 | |
168 | =cut |
169 | |