[p5sagit/p5-mst-13.2.git] / ext / Encode / lib / Encode / PerlIO.pod

=head1 NAME

Encode::PerlIO -- a detailed document on Encode and PerlIO

=head1 Overview

It is very common to want to do encoding transformations when
reading or writing files, network connections, pipes etc.
If Perl is configured to use the new 'perlio' IO system then
C<Encode> provides a "layer" (See L<PerlIO>) which can transform
data as it is read or written.

Here is how the blind poet would modernise the encoding:

    use Encode;
    open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek');
    open(my $utf8,'>:utf8','iliad.utf8');
    my @epic = <$iliad>;
    print $utf8 @epic;
    close($utf8);
    close($illiad);

In addition the new IO system can also be configured to read/write
UTF-8 encoded characters (as noted above this is efficient):

    open(my $fh,'>:utf8','anything');
    print $fh "Any \x{0021} string \N{SMILEY FACE}\n";

Either of the above forms of "layer" specifications can be made the default
for a lexical scope with the C<use open ...> pragma. See L<open>.

Once a handle is open is layers can be altered using C<binmode>.

Without any such configuration, or if Perl itself is built using
system's own IO, then write operations assume that file handle accepts
only I<bytes> and will C<die> if a character larger than 255 is
written to the handle. When reading, each octet from the handle
becomes a byte-in-a-character. Note that this default is the same
behaviour as bytes-only languages (including Perl before v5.6) would
have, and is sufficient to handle native 8-bit encodings
e.g. iso-8859-1, EBCDIC etc. and any legacy mechanisms for handling
other encodings and binary data.

In other cases it is the programs responsibility to transform
characters into bytes using the API above before doing writes, and to
transform the bytes read from a handle into characters before doing
"character operations" (e.g. C<lc>, C</\W+/>, ...).

You can also use PerlIO to convert larger amounts of data you don't
want to bring into memory.  For example to convert between ISO-8859-1
(Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):

    open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
    open(G, ">:utf8",                 "data.utf") or die $!;
    while (<F>) { print G }

    # Could also do "print G <F>" but that would pull
    # the whole file into memory just to write it out again.

More examples:

    open(my $f, "<:encoding(cp1252)")
    open(my $g, ">:encoding(iso-8859-2)")
    open(my $h, ">:encoding(latin9)")       # iso-8859-15

See also L<encoding> for how to change the default encoding of the
data in your script.

=head1 How does it work?

Here is a crude diagram of how filehandle, PerlIO, and Encode
interact.

  filehandle <-> PerlIO       PerlIO <-> scalar (read/printed)
                       \     /
                        Encode   

When PerlIO receives data from either direction, it fills in the buffer 
(currently with 1024 bytes) and pass the buffer to Encode.  Encode tries
to convert the valid part and pass it back to PerlIO, leaving invalid
parts (usually partial character) in buffer.  PerlIO then appends more
data in buffer, call Encode, and so on until the data stream ends.

To do so, PerlIO always calls (de|en)code methods with CHECK set to 1.
this ensures that the method stops at the right place when it
encounters partial character.  The following is what happens when
PerlIO and Encode tries to encode (from utf8) more than 1024 bytes
long and the buffer boundary happens to be between a character.

   A   B   C   ....   ~     \x{3000}    ....
  41  42  43   ....  7E   e3   80   80  ....
  <- buffer --------------->
  << encoded >>>>>>>>>>
                       <- next buffer ------

Encode converts from the beginning to \x7E, leaving \xe3 in the buffer
because it is invalid (partial character).

Unfortunately, this scheme does not work well with escape-based
encoding such as ISO-2022-JP.  Let's see what happens in that case
in the next chapter.

=head1 BUGS

Now let's see what happens when you try to decode form ISO-2022-JP and
the buffer cuts in the middle of a character
 
			  JIS208-ESC   \x{5f3e}
   A   B   C   ....   ~   \e   $   B  |DAN | ....
  41  42  43   ....  7E   1b  24  41  43  46 ....
  <- buffer --------------------------->
  << encoded >>>>>>>>>>>>>>>>>>>>>>>

As you see, the next buffer begins with \x43.  But \x43 is 'C' in
ASCII, which is wrong in this case because we are now in JISX 0208
area so it has to convert \x43\x46, not \x43.  Unlike utf8 and EUC,
in escape-based encoding you can't tell if it a given octed is a whole
character or just part of it.

There are actually several ways to solve this problem but none of
which is fast enough to be practical.  From Encode's point of view
the easiest solution is for PerlIO to implement line buffer instead
of fixed-length buffer but that makes PerlIO really complicated.

So for the time being, using escape-based encodings in ":encoding()"
layer of PerlIO does not work well.

=head2 Workaround

If you still insist, you can at least use ":encoding()" by making sure
the buffer never gets full.  Here is an example.

  use FileHandle;
  binmode(STDOUT, ":encoding(7bit-jis)");
  STDOUT->autoflush(1); # don't forget this!
  for my $l (@lines){   # $l cannot be longer than 1023 bytes
    print $l;
  } 

=head2 How can you tell my encoding fully supports PerlIO ?

As of this writing, Any encoding which class belongs to Encode::XS and
Encode::Unicode works.  Encode module has C<perlio_ok> method so you 
can use it before appling PerlIO encoding to the filehandle.  Here is
an example;

  my $use_perlio = perlio_ok($enc);
  my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)";
  open my $fh, $layer, $file or die "$file : $!";
  while(<$fh>){
    $_ = decode($enc, $_) unless $use_perlio;
    # .... 
  }

=head1 SEE ALSO

L<Encode::Encoding>,
L<Encode::Supported>,
L<Encode::PerlIO>, 
L<encoding>,
L<perlebcdic>, 
L<perlfunc/open>, 
L<perlunicode>, 
L<utf8>, 
the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt>


=cut
Commit	Line	Data
85982a32	1	=head1 NAME
	2
	3	Encode::PerlIO -- a detailed document on Encode and PerlIO
	4
	5	=head1 Overview
	6
	7	It is very common to want to do encoding transformations when
	8	reading or writing files, network connections, pipes etc.
	9	If Perl is configured to use the new 'perlio' IO system then
	10	C<Encode> provides a "layer" (See L<PerlIO>) which can transform
	11	data as it is read or written.
	12
	13	Here is how the blind poet would modernise the encoding:
	14
	15	use Encode;
	16	open(my $iliad,'<:encoding(iso-8859-7)','iliad.greek');
	17	open(my $utf8,'>:utf8','iliad.utf8');
	18	my @epic = <$iliad>;
	19	print $utf8 @epic;
	20	close($utf8);
	21	close($illiad);
	22
	23	In addition the new IO system can also be configured to read/write
	24	UTF-8 encoded characters (as noted above this is efficient):
	25
	26	open(my $fh,'>:utf8','anything');
	27	print $fh "Any \x{0021} string \N{SMILEY FACE}\n";
	28
	29	Either of the above forms of "layer" specifications can be made the default
	30	for a lexical scope with the C<use open ...> pragma. See L<open>.
	31
	32	Once a handle is open is layers can be altered using C<binmode>.
	33
	34	Without any such configuration, or if Perl itself is built using
	35	system's own IO, then write operations assume that file handle accepts
	36	only I<bytes> and will C<die> if a character larger than 255 is
	37	written to the handle. When reading, each octet from the handle
	38	becomes a byte-in-a-character. Note that this default is the same
	39	behaviour as bytes-only languages (including Perl before v5.6) would
	40	have, and is sufficient to handle native 8-bit encodings
	41	e.g. iso-8859-1, EBCDIC etc. and any legacy mechanisms for handling
	42	other encodings and binary data.
	43
	44	In other cases it is the programs responsibility to transform
	45	characters into bytes using the API above before doing writes, and to
	46	transform the bytes read from a handle into characters before doing
	47	"character operations" (e.g. C<lc>, C</\W+/>, ...).
	48
	49	You can also use PerlIO to convert larger amounts of data you don't
	50	want to bring into memory. For example to convert between ISO-8859-1
	51	(Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines):
	52
	53	open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!;
	54	open(G, ">:utf8", "data.utf") or die $!;
	55	while (<F>) { print G }
	56
	57	# Could also do "print G <F>" but that would pull
	58	# the whole file into memory just to write it out again.
	59
	60	More examples:
	61
	62	open(my $f, "<:encoding(cp1252)")
	63	open(my $g, ">:encoding(iso-8859-2)")
	64	open(my $h, ">:encoding(latin9)") # iso-8859-15
65
66	See also L<encoding> for how to change the default encoding of the
67	data in your script.
68
69	=head1 How does it work?
70
71	Here is a crude diagram of how filehandle, PerlIO, and Encode
72	interact.
73
74	filehandle <-> PerlIO PerlIO <-> scalar (read/printed)
75	\ /
76	Encode
77
78	When PerlIO receives data from either direction, it fills in the buffer
79	(currently with 1024 bytes) and pass the buffer to Encode. Encode tries
80	to convert the valid part and pass it back to PerlIO, leaving invalid
81	parts (usually partial character) in buffer. PerlIO then appends more
82	data in buffer, call Encode, and so on until the data stream ends.
83
84	To do so, PerlIO always calls (de\|en)code methods with CHECK set to 1.
85	this ensures that the method stops at the right place when it
86	encounters partial character. The following is what happens when
87	PerlIO and Encode tries to encode (from utf8) more than 1024 bytes
88	long and the buffer boundary happens to be between a character.
89
90	A B C .... ~ \x{3000} ....
91	41 42 43 .... 7E e3 80 80 ....
92	<- buffer --------------->
93	<< encoded >>>>>>>>>>
94	<- next buffer ------
95
96	Encode converts from the beginning to \x7E, leaving \xe3 in the buffer
97	because it is invalid (partial character).
98
99	Unfortunately, this scheme does not work well with escape-based
100	encoding such as ISO-2022-JP. Let's see what happens in that case
101	in the next chapter.
102
103	=head1 BUGS
104
105	Now let's see what happens when you try to decode form ISO-2022-JP and
106	the buffer cuts in the middle of a character
107
108	JIS208-ESC \x{5f3e}
109	A B C .... ~ \e $ B \|DAN \| ....
110	41 42 43 .... 7E 1b 24 41 43 46 ....
111	<- buffer --------------------------->
112	<< encoded >>>>>>>>>>>>>>>>>>>>>>>
113
114	As you see, the next buffer begins with \x43. But \x43 is 'C' in
115	ASCII, which is wrong in this case because we are now in JISX 0208
116	area so it has to convert \x43\x46, not \x43. Unlike utf8 and EUC,
117	in escape-based encoding you can't tell if it a given octed is a whole
118	character or just part of it.
119
120	There are actually several ways to solve this problem but none of
121	which is fast enough to be practical. From Encode's point of view
122	the easiest solution is for PerlIO to implement line buffer instead
123	of fixed-length buffer but that makes PerlIO really complicated.
124
125	So for the time being, using escape-based encodings in ":encoding()"
126	layer of PerlIO does not work well.
127
128	=head2 Workaround
129
130	If you still insist, you can at least use ":encoding()" by making sure
131	the buffer never gets full. Here is an example.
132
133	use FileHandle;
134	binmode(STDOUT, ":encoding(7bit-jis)");
135	STDOUT->autoflush(1); # don't forget this!
136	for my $l (@lines){ # $l cannot be longer than 1023 bytes
137	print $l;
138	}
139
140	=head2 How can you tell my encoding fully supports PerlIO ?
141
142	As of this writing, Any encoding which class belongs to Encode::XS and
143	Encode::Unicode works. Encode module has C<perlio_ok> method so you
144	can use it before appling PerlIO encoding to the filehandle. Here is
145	an example;
146
147	my $use_perlio = perlio_ok($enc);
148	my $layer = $use_perlio ? "<:raw" : "<:encoding($enc)";
149	open my $fh, $layer, $file or die "$file : $!";
150	while(<$fh>){
151	$_ = decode($enc, $_) unless $use_perlio;
152	# ....
153	}
154
155	=head1 SEE ALSO
156
157	L<Encode::Encoding>,
158	L<Encode::Supported>,
159	L<Encode::PerlIO>,
160	L<encoding>,
161	L<perlebcdic>,
162	L<perlfunc/open>,
163	L<perlunicode>,
164	L<utf8>,
165	the Perl Unicode Mailing List E<lt>perl-unicode@perl.orgE<gt>
166
167
168	=cut
169