presented with byte data. The implementation is still new and
(particularly on EBCDIC platforms) may need further work.
-=item C<use utf8> still needed to enable a few features
+=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
The C<utf8> pragma implements the tables used for Unicode support.
These tables are automatically loaded on demand, so the C<utf8> pragma
need not normally be used.
However, as a compatibility measure, this pragma must be explicitly
-used to enable recognition of UTF-8 encoded literals and identifiers
-in the source text on ASCII based machines or recognize UTF-EBCDIC
-encoded literals and identifiers on EBCDIC based machines.
+used to enable recognition of UTF-8 in the Perl scripts themselves on
+ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
+B<NOTE: this should be the only place where an explicit C<use utf8> is
+needed>.
=back
The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
-It may also be used for enabling some of the more experimental Unicode
-support features. Note that this pragma is only required until a
-future version of Perl in which character semantics will become the
-default. This pragma may then become a no-op. See L<utf8>.
+Note that this pragma is only required until a future version of Perl
+in which character semantics will become the default. This pragma may
+then become a no-op. See L<utf8>.
Unless mentioned otherwise, Perl operators will use character semantics
when they are dealing with Unicode data, and byte semantics otherwise.
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the C<bytes> pragma should be used.
+Notice that if you have a string with byte semantics and you then
+add character data into it, the bytes will be upgraded I<as if they
+were ISO 8859-1 (Latin-1)> (or if in EBCDIC, after a translation
+to ISO 8859-1).
+
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes no
difference, because UTF-8 stores ASCII in single bytes, but for any