Otherwise, it uses UTF-8.
A user of Perl does not normally need to know nor care how Perl
-happens to encodes its internal strings, but it becomes relevant when
+happens to encode its internal strings, but it becomes relevant when
outputting Unicode strings to a stream without a discipline (one with
the "default default"). In such a case, the raw bytes used internally
(the native character set or UTF-8, as appropriate for each string)
constants: you cannot use variables in them. if you want similar
run-time functionality, use C<chr()> and C<charnames::vianame()>.
+Also note that if all the code points for pack "U" are below 0x100,
+bytes will be generated, just like if you were using C<chr()>.
+
+ my $bytes = pack("U*", 0x80, 0xFF);
+
+If you want to force the result to Unicode characters, use the special
+C<"U0"> prefix. It consumes no arguments but forces the result to be
+in Unicode characters, instead of bytes.
+
+ my $chars = pack("U0U*", 0x80, 0xFF);
+
=head2 Handling Unicode
Handling Unicode is for the most part transparent: just use the
=over 4
-=item Will My Old Scripts Break?
+=item
+
+Will My Old Scripts Break?
Very probably not. Unless you are generating Unicode characters
somehow, any old behaviour should be preserved. About the only
than 255 produced a character modulo 255 (for example, C<chr(300)>
was equal to C<chr(45)>).
-=item How Do I Make My Scripts Work With Unicode?
+=item
+
+How Do I Make My Scripts Work With Unicode?
Very little work should be needed since nothing changes until you
somehow generate Unicode data. The greatest trick will be getting
input as Unicode, and for that see the earlier I/O discussion.
-=item How Do I Know Whether My String Is In Unicode?
+=item
+
+How Do I Know Whether My String Is In Unicode?
You shouldn't care. No, you really shouldn't. If you have
to care (beyond the cases described above), it means that we
use bytes;
print length($unicode), "\n"; # will print 2 (the 0xC4 0x80 of the UTF-8)
-=item How Do I Detect Data That's Not Valid In a Particular Encoding
+=item
+
+How Do I Detect Data That's Not Valid In a Particular Encoding?
Use the C<Encode> package to try converting it.
For example,
If invalid, a C<Malformed UTF-8 character (byte 0x##) in
unpack> is produced. The "U0" means "expect strictly UTF-8
-encoded Unicode". Without that the C<unpack("U*", ...)>
-would accept also data like C<chr(0xFF>).
+encoded Unicode". Without that the C<unpack("U*", ...)>
+would accept also data like C<chr(0xFF>), similarly to the
+C<pack> as we saw earlier.
+
+=item
-=item How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
+How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
This probably isn't as useful as you might think.
Normally, you shouldn't need to.
use C<unpack("C*", $string)> for the former, and you can create
well-formed Unicode data by C<pack("U*", 0xff, ...)>.
-=item How Do I Display Unicode? How Do I Input Unicode?
+=item
+
+How Do I Display Unicode? How Do I Input Unicode?
See http://www.hclrss.demon.co.uk/unicode/ and
http://www.cl.cam.ac.uk/~mgk25/unicode.html
-=item How Does Unicode Work With Traditional Locales?
+=item
+
+How Does Unicode Work With Traditional Locales?
In Perl, not very well. Avoid using locales through the C<locale>
pragma. Use only one or the other.