8 @ISA = qw(Exporter DynaLoader);
32 Encode - character encodings
40 I<char>: a character in the range 0..maxint (at least 2**32-1)
44 I<byte>: a character in the range 0..255
48 The marker [INTERNAL] marks Internal Implementation Details, in
49 general meant only for those who think they know what they are doing,
50 and such details may change in future releases.
58 bytes_to_utf8(STRING [, FROM])
60 The bytes in STRING are recoded in-place into UTF-8. If no FROM is
61 specified the bytes are expected to be encoded in US-ASCII or ISO
62 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
65 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
69 utf8_to_bytes(STRING [, TO [, CHECK]])
71 The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding
72 is specified the bytes are expected to be encoded in US-ASCII or ISO
73 8859-1 (Latin 1). Returns the new size of STRING, or C<undef> if
76 What if there are characters > 255? What if the UTF-8 in STRING is
77 malformed? See L</"Handling Malformed Data">.
79 [INTERNAL] The UTF-8 flag of STRING is not checked.
91 The chars in STRING are encoded in-place into UTF-8. Returns the new
92 size of STRING, or C<undef> if there's a failure.
94 No assumptions are made on the encoding of the chars. If you want to
95 assume that the chars are Unicode and to trap illegal Unicode
96 characters, you must use C<from_to('Unicode', ...)>.
98 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
104 utf8_to_chars(STRING)
106 The UTF-8 in STRING is decoded in-place into chars. Returns the new
107 size of STRING, or C<undef> if there's a failure.
109 If the UTF-8 in STRING is malformed C<undef> is returned, and also an
110 optional lexical warning (category utf8) is given.
112 [INTERNAL] The UTF-8 flag of STRING is not checked.
116 utf8_to_chars_check(STRING [, CHECK])
118 (Note that special naming of this interface since a two-argument
119 utf8_to_chars() has different semantics.)
121 The UTF-8 in STRING is decoded in-place into chars. Returns the new
122 size of STRING, or C<undef> if there is a failure.
124 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
126 [INTERNAL] The UTF-8 flag of STRING is not checked.
130 =head2 chars With Encoding
136 chars_to_utf8(STRING, FROM [, CHECK])
138 The chars in STRING encoded in FROM are recoded in-place into UTF-8.
139 Returns the new size of STRING, or C<undef> if there's a failure.
141 No assumptions are made on the encoding of the chars. If you want to
142 assume that the chars are Unicode and to trap illegal Unicode
143 characters, you must use C<from_to('Unicode', ...)>.
145 [INTERNAL] Also the UTF-8 flag of STRING is turned on.
149 utf8_to_chars(STRING, TO [, CHECK])
151 The UTF-8 in STRING is decoded in-place into chars encoded in TO.
152 Returns the new size of STRING, or C<undef> if there's a failure.
154 If the UTF-8 in STRING is malformed? See L</"Handling Malformed Data">.
156 [INTERNAL] The UTF-8 flag of STRING is not checked.
160 bytes_to_chars(STRING, FROM [, CHECK])
162 The bytes in STRING encoded in FROM are recoded in-place into chars.
163 Returns the new size of STRING in bytes, or C<undef> if there's a
166 If the mapping is impossible? See L</"Handling Malformed Data">.
170 chars_to_bytes(STRING, TO [, CHECK])
172 The chars in STRING are recoded in-place to bytes encoded in TO.
173 Returns the new size of STRING in bytes, or C<undef> if there's a
176 If the mapping is impossible? See L</"Handling Malformed Data">.
180 from_to(STRING, FROM, TO [, CHECK])
182 The chars in STRING encoded in FROM are recoded in-place into TO.
183 Returns the new size of STRING, or C<undef> if there's a failure.
185 If mapping between the encodings is impossible?
186 See L</"Handling Malformed Data">.
188 [INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.
192 =head2 Testing For UTF-8
198 is_utf8(STRING [, CHECK])
200 [INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
201 If CHECK is true, also checks the data in STRING for being
202 well-formed UTF-8. Returns true if successful, false otherwise.
206 =head2 Toggling UTF-8-ness
214 [INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is
215 B<not> checked for being well-formed UTF-8. Do not use unless you
216 B<know> that the STRING is well-formed UTF-8. Returns the previous
217 state of the UTF-8 flag (so please don't test the return value as
218 I<not> success or failure), or C<undef> if STRING is not a string.
224 [INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously.
225 Returns the previous state of the UTF-8 flag (so please don't test the
226 return value as I<not> success or failure), or C<undef> if STRING is
231 =head2 UTF-16 and UTF-32 Encodings
237 utf_to_utf(STRING, FROM, TO [, CHECK])
239 The data in STRING is converted from Unicode Transfer Encoding FROM to
240 Unicode Transfer Encoding TO. Both FROM and TO may be any of the
241 following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):
247 '16be' UTF-16 big-endian
248 '16le' UTF-16 little-endian
249 '16' UTF-16 native-endian
250 '32be' UTF-32 big-endian
251 '32le' UTF-32 little-endian
252 '32' UTF-32 native-endian
254 UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
255 UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or
256 C<undef> is there's a failure.
258 If FROM is UTF-8 and the UTF-8 in STRING is malformed? See
259 L</"Handling Malformed Data">.
261 [INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of
262 STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is
263 turned on. Identical FROM and TO are fine.
267 =head2 Handling Malformed Data
269 If CHECK is not set, C<undef> is returned. If the data is supposed to
270 be UTF-8, an optional lexical warning (category utf8) is given. If
271 CHECK is true but not a code reference, dies. If CHECK is a code
272 reference, it is called with the arguments
274 (MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)
276 Two return values are expected from the call: the string to be used in
277 the result string in place of the malformed section, and the length of
278 the malformed section in bytes.
298 sub utf8_to_chars_check {
299 &_utf8_to_chars_check;