8 @ISA = qw(Exporter DynaLoader);
33 Encode - character encodings
41 I<char>: a character in the range 0..maxint (at least 2**32-1)
45 I<byte>: a character in the range 0..255
49 The marker [INTERNAL] marks Internal Implementation Details, in
50 general meant only for those who think they know what they are doing,
51 and such details may change in future releases.
59 bytes_to_utf8(STRING [, FROM])
61 The bytes in STRING are recoded in-place into UTF-8. If no FROM is
62 specified the bytes are expected to be encoded in US-ASCII or ISO
63 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
66 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
70 utf8_to_bytes(STRING [, TO [, CHECK]])
72 The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding
73 is specified the bytes are expected to be encoded in US-ASCII or ISO
74 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
77 What if there are characters > 255? What if the UTF-8 in STRING is
78 malformed? See L</"Handling Malformed Data">.
80 [INTERNAL] The UTF-8 flag of STRING is not checked.
92 The chars in STRING are encoded in-place into UTF-8. Returns the new
93 size of STRING, or C<undef> if there's a failure.
95 No assumptions are made on the encoding of the chars. If you want to
96 assume that the chars are Unicode and to trap illegal Unicode
97 characters, you must use C<from_to('Unicode', ...)>.
99 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
105 utf8_to_chars(STRING)
107 The UTF-8 in STRING is decoded in-place into chars. Returns the new
108 size of STRING, or C<undef> if there's a failure.
110 If the UTF-8 in STRING is malformed C<undef> is returned, and also an
111 optional lexical warning (category utf8) is given.
113 [INTERNAL] The UTF-8 flag of STRING is not checked.
117 utf8_to_chars_check(STRING [, CHECK])
119 (Note that special naming of this interface since a two-argument
120 utf8_to_chars() has different semantics.)
122 The UTF-8 in STRING is decoded in-place into chars. Returns the new
123 size of STRING, or C<undef> if there is a failure.
125 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
127 [INTERNAL] The UTF-8 flag of STRING is not checked.
131 =head2 chars With Encoding
137 chars_to_utf8(STRING, FROM [, CHECK])
139 The chars in STRING encoded in FROM are recoded in-place into UTF-8.
140 Returns the new size of STRING, or C<undef> if there's a failure.
142 No assumptions are made on the encoding of the chars. If you want to
143 assume that the chars are Unicode and to trap illegal Unicode
144 characters, you must use C<from_to('Unicode', ...)>.
146 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
150 utf8_to_chars(STRING, TO [, CHECK])
152 The UTF-8 in STRING is decoded in-place into chars encoded in TO.
153 Returns the new size of STRING, or C<undef> if there's a failure.
155 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
157 [INTERNAL] The UTF-8 flag of STRING is not checked.
161 bytes_to_chars(STRING, FROM [, CHECK])
163 The bytes in STRING encoded in FROM are recoded in-place into chars.
164 Returns the new size of STRING in bytes, or C<undef> if there's a
167 If the mapping is impossible? See L</"Handling Malformed Data">.
171 chars_to_bytes(STRING, TO [, CHECK])
173 The chars in STRING are recoded in-place to bytes encoded in TO.
174 Returns the new size of STRING in bytes, or C<undef> if there's a
177 If the mapping is impossible? See L</"Handling Malformed Data">.
181 from_to(STRING, FROM, TO [, CHECK])
183 The chars in STRING encoded in FROM are recoded in-place into TO.
184 Returns the new size of STRING, or C<undef> if there's a failure.
186 If mapping between the encodings is impossible?
187 See L</"Handling Malformed Data">.
189 [INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.
193 =head2 Testing For UTF-8
199 is_utf8(STRING [, CHECK])
201 [INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
202 If CHECK is true, also checks the data in STRING for being
203 well-formed UTF-8. Returns true if successful, false otherwise.
207 =head2 Toggling UTF-8-ness
215 [INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
216 B<not> checked for being well-formed UTF-8. Do not use unless you
217 B<know> that the STRING is well-formed UTF-8. Returns the previous
218 state of the UTF-8 flag (so please don't test the return value as
219 I<not> success or failure), or C<undef> if STRING is not a string.
225 [INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
226 Returns the previous state of the UTF-8 flag (so please don't test the
227 return value as I<not> success or failure), or C<undef> if STRING is
232 =head2 UTF-16 and UTF-32 Encodings
238 utf_to_utf(STRING, FROM, TO [, CHECK])
240 The data in STRING is converted from Unicode Transfer Encoding FROM to
241 Unicode Transfer Encoding TO. Both FROM and TO may be any of the
242 following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
248 '16be' UTF-16 big-endian
249 '16le' UTF-16 little-endian
250 '16' UTF-16 native-endian
251 '32be' UTF-32 big-endian
252 '32le' UTF-32 little-endian
253 '32' UTF-32 native-endian
255 UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
256 UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
257 C<undef> is there's a failure.
259 If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
260 L</"Handling Malformed Data">.
262 [INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
263 STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
264 turned on. Identical FROM and TO are fine.
268 =head2 Handling Malformed Data
270 If CHECK is not set, C<undef> is returned. If the data is supposed to
271 be UTF-8, an optional lexical warning (category utf8) is given. If
272 CHECK is true but not a code reference, dies. If CHECK is a code
273 reference, it is called with the arguments
275 (MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
277 Two return values are expected from the call: the string to be used in
278 the result string in place of the malformed section, and the length of
279 the malformed section in bytes.
299 sub utf8_to_chars_check {
300 &_utf8_to_chars_check;
331 my ($string,$from,$to,$check) = @_;
332 my $f = __PACKAGE__->getEncoding($from);
333 croak("Unknown encoding '$from'") unless $f;
334 my $t = __PACKAGE__->getEncoding($to);
335 croak("Unknown encoding '$to'") unless $t;
336 my $uni = $f->toUnicode($string,$check);
337 return undef if ($check && length($string));
338 $string = $t->fromUnicode($uni,$check);
339 return undef if ($check && length($uni));
340 return length($_[0] = $string);
346 my ($dir) = __FILE__ =~ /^(.*)\.pm$/;
347 my @names = ('Unicode');
348 if (opendir(my $dh,$dir))
350 while (defined(my $name = readdir($dh)))
352 push(@names,$1) if ($name =~ /^(.*)\.enc$/);
358 die "Cannot open $dir:$!";
363 my %encoding = ( Unicode => bless({},'Encode::Unicode'),
364 'iso10646-1' => bless({},'Encode::iso10646_1'),
369 my ($class,$name) = @_;
370 unless (exists $encoding{$name})
373 foreach my $dir (@INC)
375 last if -f ($file = "$dir/Encode/$name.enc");
377 if (open(my $fh,$file))
383 $type = substr($line,0,1);
384 last unless $type eq '#';
386 $class .= ('::'.(($type eq 'E') ? 'Escape' : 'Table'));
387 $encoding{$name} = $class->read($fh,$name,$type);
391 $encoding{$name} = undef;
394 return $encoding{$name};
397 package Encode::Unicode;
399 # Dummy package that provides the encode interface
401 sub name { 'Unicode' }
403 sub toUnicode { $_[1] }
405 sub fromUnicode { $_[1] }
407 package Encode::Table;
411 my ($class,$fh,$name,$type) = @_;
412 my $rep = $class->can("rep_$type");
413 my ($def,$sym,$pages) = split(/\s+/,scalar(<$fh>));
422 my $page = hex($line);
424 my $ch = $page * 256;
425 for (my $i = 0; $i < 16; $i++)
428 for (my $j = 0; $j < 16; $j++)
430 my $val = hex(substr($line,0,4,''));
445 $touni[$page] = \@page;
448 return bless {Name => $name,
457 sub name { shift->{'Name'} }
463 sub rep_M { ($_[0] > 255) ? 'S' : 'C' }
468 $ch = 0 unless @_ > 1;
474 my ($obj,$str,$chk) = @_;
475 my $rep = $obj->{'Rep'};
476 my $touni = $obj->{'ToUni'};
480 my $ch = ord(substr($str,0,1,''));
482 if (&$rep($ch) eq 'C')
484 $x = $touni->[0][$ch];
488 $x = $touni->[$ch][ord(substr($str,0,1,''))];
493 # What do we do here ?
498 $_[1] = $str if $chk;
504 my ($obj,$uni,$chk) = @_;
505 my $fmuni = $obj->{'FmUni'};
507 my $def = $obj->{'Def'};
508 my $rep = $obj->{'Rep'};
511 my $ch = substr($uni,0,1,'');
512 my $x = $fmuni->{$ch};
518 $str .= pack(&$rep($x),$x);
520 $_[1] = $uni if $chk;
524 package Encode::iso10646_1;#
526 sub name { 'iso10646-1' }
530 my ($obj,$str,$chk) = @_;
534 my $code = unpack('S',substr($str,0,2,''));
537 $_[1] = $str if $chk;
543 my ($obj,$uni,$chk) = @_;
547 my $ch = substr($uni,0,1,'');
554 $str .= pack('S',$x);
556 $_[1] = $uni if $chk;
560 package Encode::Escape;
565 my ($class,$fh,$name) = @_;
566 my %self = (Name => $name, Num => 0);
569 my ($key,$val) = /^(\S+)\s+(.*)$/;
570 $val =~ s/^\{(.*?)\}/$1/g;
571 $val =~ s/\\x([0-9a-f]{2})/chr(hex($1))/ge;
574 return bless \%self,$class;
577 sub name { shift->{'Name'} }
581 croak("Not implemented yet");
586 croak("Not implemented yet");