8 @ISA = qw(Exporter DynaLoader);
37 Encode - character encodings
45 I<char>: a character in the range 0..maxint (at least 2**32-1)
49 I<byte>: a character in the range 0..255
53 The marker [INTERNAL] marks Internal Implementation Details, in
54 general meant only for those who think they know what they are doing,
55 and such details may change in future releases.
63 bytes_to_utf8(STRING [, FROM])
65 The bytes in STRING are recoded in-place into UTF-8. If no FROM is
66 specified the bytes are expected to be encoded in US-ASCII or ISO
67 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
70 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
74 utf8_to_bytes(STRING [, TO [, CHECK]])
76 The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding
77 is specified the bytes are expected to be encoded in US-ASCII or ISO
78 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
81 What if there are characters > 255? What if the UTF-8 in STRING is
82 malformed? See L</"Handling Malformed Data">.
84 [INTERNAL] The UTF-8 flag of STRING is not checked.
96 The chars in STRING are encoded in-place into UTF-8. Returns the new
97 size of STRING, or C<undef> if there's a failure.
99 No assumptions are made on the encoding of the chars. If you want to
100 assume that the chars are Unicode and to trap illegal Unicode
101 characters, you must use C<from_to('Unicode', ...)>.
103 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
109 utf8_to_chars(STRING)
111 The UTF-8 in STRING is decoded in-place into chars. Returns the new
112 size of STRING, or C<undef> if there's a failure.
114 If the UTF-8 in STRING is malformed C<undef> is returned, and also an
115 optional lexical warning (category utf8) is given.
117 [INTERNAL] The UTF-8 flag of STRING is not checked.
121 utf8_to_chars_check(STRING [, CHECK])
123 (Note that special naming of this interface since a two-argument
124 utf8_to_chars() has different semantics.)
126 The UTF-8 in STRING is decoded in-place into chars. Returns the new
127 size of STRING, or C<undef> if there is a failure.
129 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
131 [INTERNAL] The UTF-8 flag of STRING is not checked.
135 =head2 chars With Encoding
141 chars_to_utf8(STRING, FROM [, CHECK])
143 The chars in STRING encoded in FROM are recoded in-place into UTF-8.
144 Returns the new size of STRING, or C<undef> if there's a failure.
146 No assumptions are made on the encoding of the chars. If you want to
147 assume that the chars are Unicode and to trap illegal Unicode
148 characters, you must use C<from_to('Unicode', ...)>.
150 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
154 utf8_to_chars(STRING, TO [, CHECK])
156 The UTF-8 in STRING is decoded in-place into chars encoded in TO.
157 Returns the new size of STRING, or C<undef> if there's a failure.
159 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
161 [INTERNAL] The UTF-8 flag of STRING is not checked.
165 bytes_to_chars(STRING, FROM [, CHECK])
167 The bytes in STRING encoded in FROM are recoded in-place into chars.
168 Returns the new size of STRING in bytes, or C<undef> if there's a
171 If the mapping is impossible? See L</"Handling Malformed Data">.
175 chars_to_bytes(STRING, TO [, CHECK])
177 The chars in STRING are recoded in-place to bytes encoded in TO.
178 Returns the new size of STRING in bytes, or C<undef> if there's a
181 If the mapping is impossible? See L</"Handling Malformed Data">.
185 from_to(STRING, FROM, TO [, CHECK])
187 The chars in STRING encoded in FROM are recoded in-place into TO.
188 Returns the new size of STRING, or C<undef> if there's a failure.
190 If mapping between the encodings is impossible?
191 See L</"Handling Malformed Data">.
193 [INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.
197 =head2 Testing For UTF-8
203 is_utf8(STRING [, CHECK])
205 [INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
206 If CHECK is true, also checks the data in STRING for being
207 well-formed UTF-8. Returns true if successful, false otherwise.
211 =head2 Toggling UTF-8-ness
219 [INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
220 B<not> checked for being well-formed UTF-8. Do not use unless you
221 B<know> that the STRING is well-formed UTF-8. Returns the previous
222 state of the UTF-8 flag (so please don't test the return value as
223 I<not> success or failure), or C<undef> if STRING is not a string.
229 [INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
230 Returns the previous state of the UTF-8 flag (so please don't test the
231 return value as I<not> success or failure), or C<undef> if STRING is
236 =head2 UTF-16 and UTF-32 Encodings
242 utf_to_utf(STRING, FROM, TO [, CHECK])
244 The data in STRING is converted from Unicode Transfer Encoding FROM to
245 Unicode Transfer Encoding TO. Both FROM and TO may be any of the
246 following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
252 '16be' UTF-16 big-endian
253 '16le' UTF-16 little-endian
254 '16' UTF-16 native-endian
255 '32be' UTF-32 big-endian
256 '32le' UTF-32 little-endian
257 '32' UTF-32 native-endian
259 UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
260 UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
261 C<undef> is there's a failure.
263 If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
264 L</"Handling Malformed Data">.
266 [INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
267 STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
268 turned on. Identical FROM and TO are fine.
272 =head2 Handling Malformed Data
274 If CHECK is not set, C<undef> is returned. If the data is supposed to
275 be UTF-8, an optional lexical warning (category utf8) is given. If
276 CHECK is true but not a code reference, dies. If CHECK is a code
277 reference, it is called with the arguments
279 (MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
281 Two return values are expected from the call: the string to be used in
282 the result string in place of the malformed section, and the length of
283 the malformed section in bytes.
303 sub utf8_to_chars_check {
304 &_utf8_to_chars_check;
335 my ($string,$from,$to,$check) = @_;
336 my $f = __PACKAGE__->getEncoding($from);
337 croak("Unknown encoding '$from'") unless $f;
338 my $t = __PACKAGE__->getEncoding($to);
339 croak("Unknown encoding '$to'") unless $t;
340 my $uni = $f->toUnicode($string,$check);
341 return undef if ($check && length($string));
342 $string = $t->fromUnicode($uni,$check);
343 return undef if ($check && length($uni));
344 return length($_[0] = $string);
347 # The global hash is declared in XS code
348 $encoding{Unicode} = bless({},'Encode::Unicode');
349 $encoding{'iso10646-1'} = bless({},'Encode::iso10646_1');
354 foreach my $dir (@INC)
356 if (opendir(my $dh,"$dir/Encode"))
358 while (defined(my $name = readdir($dh)))
360 if ($name =~ /^(.*)\.enc$/)
362 next if exists $encoding{$1};
363 $encoding{$1} = "$dir/$name";
369 return keys %encoding;
374 my ($class,$name,$file) = @_;
375 if (open(my $fh,$file))
381 $type = substr($line,0,1);
382 last unless $type eq '#';
384 $class .= ('::'.(($type eq 'E') ? 'Escape' : 'Table'));
385 #warn "Loading $file";
386 return $class->read($fh,$name,$type);
396 my ($class,$name) = @_;
398 unless (ref($enc = $encoding{$name}))
400 $enc = $class->loadEncoding($name,$enc) if defined $enc;
403 foreach my $dir (@INC)
405 last if ($enc = $class->loadEncoding($name,"$dir/Encode/$name.enc"));
408 $encoding{$name} = $enc;
413 package Encode::Unicode;
415 # Dummy package that provides the encode interface but leaves data
416 # as UTF-8 encoded. It is here so that from_to() works.
418 sub name { 'Unicode' }
422 my ($obj,$str,$chk) = @_;
423 Encode::utf8_upgrade($str);
428 *fromUnicode = \&toUnicode;
430 package Encode::Table;
434 my ($class,$fh,$name,$type) = @_;
435 my $rep = $class->can("rep_$type");
436 my ($def,$sym,$pages) = split(/\s+/,scalar(<$fh>));
445 my $page = hex($line);
447 my $ch = $page * 256;
448 for (my $i = 0; $i < 16; $i++)
451 for (my $j = 0; $j < 16; $j++)
453 my $val = hex(substr($line,0,4,''));
468 $touni[$page] = \@page;
471 return bless {Name => $name,
480 sub name { shift->{'Name'} }
486 sub rep_M { ($_[0] > 255) ? 'n' : 'C' }
491 $ch = 0 unless @_ > 1;
497 my ($obj,$str,$chk) = @_;
498 my $rep = $obj->{'Rep'};
499 my $touni = $obj->{'ToUni'};
503 my $ch = ord(substr($str,0,1,''));
505 if (&$rep($ch) eq 'C')
507 $x = $touni->[0][$ch];
511 $x = $touni->[$ch][ord(substr($str,0,1,''))];
516 # What do we do here ?
521 $_[1] = $str if $chk;
527 my ($obj,$uni,$chk) = @_;
528 my $fmuni = $obj->{'FmUni'};
530 my $def = $obj->{'Def'};
531 my $rep = $obj->{'Rep'};
534 my $ch = substr($uni,0,1,'');
535 my $x = $fmuni->{chr(ord($ch))};
541 $str .= pack(&$rep($x),$x);
543 $_[1] = $uni if $chk;
547 package Encode::iso10646_1;
548 # Encoding is 16-bit network order Unicode
549 # Used for X font encodings
551 sub name { 'iso10646-1' }
555 my ($obj,$str,$chk) = @_;
559 my $code = unpack('n',substr($str,0,2,'')) & 0xffff;
562 $_[1] = $str if $chk;
563 Encode::utf8_upgrade($uni);
569 my ($obj,$uni,$chk) = @_;
573 my $ch = substr($uni,0,1,'');
580 $str .= pack('n',$x);
582 $_[1] = $uni if $chk;
587 package Encode::Escape;
592 my ($class,$fh,$name) = @_;
593 my %self = (Name => $name, Num => 0);
596 my ($key,$val) = /^(\S+)\s+(.*)$/;
597 $val =~ s/^\{(.*?)\}/$1/g;
598 $val =~ s/\\x([0-9a-f]{2})/chr(hex($1))/ge;
601 return bless \%self,$class;
604 sub name { shift->{'Name'} }
608 croak("Not implemented yet");
613 croak("Not implemented yet");