You can also use the C<encoding> pragma to change the default encoding
of the data in your script; see L<encoding>.
+=item C<use encoding> needed to upgrade non-Latin-1 byte strings
+
+By default, there is a fundamental asymmetry in Perl's unicode model:
+implicit upgrading from byte strings to Unicode strings assumes that
+they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
+downgraded with UTF-8 encoding. This happens because the first 256
+codepoints in Unicode happens to agree with Latin-1.
+
+If you wish to interpret byte strings as UTF-8 instead, use the
+C<encoding> pragma:
+
+ use encoding 'utf8';
+
+See L</"Byte and Character Semantics"> for more details.
+
=back
=head2 Byte and Character Semantics
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and
-non-EBCDIC native encodings use the C<encoding> pragma. See
-L<encoding>.
+character data are concatenated, the new string will be created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC. This translation is done without
+regard to the system's native 8-bit encoding. To change this for
+systems with non-Latin-1 and non-EBCDIC native encodings, use the
+C<encoding> pragma. See L<encoding>.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
+(However, and as a limitation of the current implementation, using
+C<\w> or C<\W> I<inside> a C<[...]> character class will still match
+with byte semantics.)
+
=item *
Named Unicode properties, scripts, and block ranges may be used like
encoding or another) could be given as arguments or received as
results, or both, but it is not.
-The following are such interfaces. For all of these Perl currently
-(as of 5.8.1) simply assumes byte strings both as arguments and results.
+The following are such interfaces. For all of these interfaces Perl
+currently (as of 5.8.3) simply assumes byte strings both as arguments
+and results, or UTF-8 strings if the C<encoding> pragma has been used.
One reason why Perl does not attempt to resolve the role of Unicode in
this cases is that the answers are highly dependent on the operating
=item *
-chmod, chmod, chown, chroot, exec, link, mkdir
-rename, rmdir stat, symlink, truncate, unlink, utime
+chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
+rename, rmdir, stat, symlink, truncate, unlink, utime, -X
=item *
=item *
-C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into
+C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
a buffer encoding the code point as UTF-8, and returns a pointer
pointing after the UTF-8 bytes.
In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
a caching scheme was introduced which will hopefully make the slowness
-somewhat less spectacular. Operations with UTF-8 encoded strings are
-still slower, though.
+somewhat less spectacular, at least for some operations. In general,
+operations with UTF-8 encoded strings are still slower. As an example,
+the Unicode properties (character classes) like C<\p{Nd}> are known to
+be quite a bit slower (5-20 times) than their simpler counterparts
+like C<\d> (then again, there 268 Unicode characters matching C<Nd>
+compared with the 10 ASCII characters matching C<d>).
=head2 Porting code from perl-5.6.X