unless defined $locale_encoding;
(warnings::warnif("layer", "Cannot figure out an encoding to use"), last)
unless defined $locale_encoding;
- if ($locale_encoding =~ /^utf-?8$/i) {
- $layer = "utf8";
- } else {
- $layer = "encoding($locale_encoding)";
- }
+ $layer = "encoding($locale_encoding)";
$std = 1;
} else {
my $target = $layer; # the layer name itself
use open IO => ':locale';
- use open ':utf8';
+ use open ':encoding(utf8)';
use open ':locale';
use open ':encoding(iso-8859-7)';
These are equivalent
- use open ':utf8';
- use open IO => ':utf8';
+ use open ':encoding(utf8)';
+ use open IO => ':encoding(utf8)';
as are these
many encodings have several aliases. See L<Encode::Supported> for
details and the list of supported locales.
-Note that C<:utf8> PerlIO layer must always be specified exactly like
-that, it is not subject to the loose matching of encoding names.
-
When open() is given an explicit list of layers (with the three-arg
syntax), they override the list declared using this pragma.
the C<:utf8> or C<:encoding> subpragmas, it converts the standard
filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected
for input/output handles. For example, if both input and out are
-chosen to be C<:utf8>, a C<:std> will mean that STDIN, STDOUT, and
-STDERR are also in C<:utf8>. On the other hand, if only output is
-chosen to be in C<< :encoding(koi8r) >>, a C<:std> will cause only the
-STDOUT and STDERR to be in C<koi8r>. The C<:locale> subpragma
+chosen to be C<:encoding(utf8)>, a C<:std> will mean that STDIN, STDOUT,
+and STDERR are also in C<:encoding(utf8)>. On the other hand, if only
+output is chosen to be in C<< :encoding(koi8r) >>, a C<:std> will cause
+only the STDOUT and STDERR to be in C<koi8r>. The C<:locale> subpragma
implicitly turns on C<:std>.
The logic of C<:locale> is described in full in L<encoding>,
Note the I<characters>: depending on the status of the socket, either
(8-bit) bytes or characters are received. By default all sockets
operate on bytes, but for example if the socket has been changed using
-binmode() to operate with the C<:utf8> I/O layer (see the C<open>
-pragma, L<open>), the I/O will operate on UTF-8 encoded Unicode
-characters, not bytes. Similarly for the C<:encoding> pragma:
-in that case pretty much any characters can be read.
+binmode() to operate with the C<:encoding(utf8)> I/O layer (see the
+C<open> pragma, L<open>), the I/O will operate on UTF-8 encoded Unicode
+characters, not bytes. Similarly for the C<:encoding> pragma: in that
+case pretty much any characters can be read.
=item redo LABEL
X<redo>
otherwise.
Note the I<in bytes>: even if the filehandle has been set to
-operate on characters (for example by using the C<:utf8> open
+operate on characters (for example by using the C<:encoding(utf8)> open
layer), tell() will return byte offsets, not character offsets
(because implementing that would render seek() and tell() rather slow).
Note the I<characters>: depending on the status of the socket, either
(8-bit) bytes or characters are sent. By default all sockets operate
on bytes, but for example if the socket has been changed using
-binmode() to operate with the C<:utf8> I/O layer (see L</open>, or the
-C<open> pragma, L<open>), the I/O will operate on UTF-8 encoded
-Unicode characters, not bytes. Similarly for the C<:encoding> pragma:
-in that case pretty much any characters can be sent.
+binmode() to operate with the C<:encoding(utf8)> I/O layer (see
+L</open>, or the C<open> pragma, L<open>), the I/O will operate on UTF-8
+encoded Unicode characters, not bytes. Similarly for the C<:encoding>
+pragma: in that case pretty much any characters can be sent.
=item setpgrp PID,PGRP
X<setpgrp> X<group>
negative).
Note the I<in bytes>: even if the filehandle has been set to operate
-on characters (for example by using the C<:utf8> I/O layer), tell()
-will return byte offsets, not character offsets (because implementing
-that would render sysseek() very slow).
+on characters (for example by using the C<:encoding(utf8)> I/O layer),
+tell() will return byte offsets, not character offsets (because
+implementing that would render sysseek() very slow).
sysseek() bypasses normal buffered IO, so mixing this with reads (other
than C<sysread>, for example C<< <> >> or read()) C<print>, C<write>,
last read.
Note the I<in bytes>: even if the filehandle has been set to
-operate on characters (for example by using the C<:utf8> open
-layer), tell() will return byte offsets, not character offsets
-(because that would render seek() and tell() rather slow).
+operate on characters (for example by using the C<:encoding(utf8)> open
+layer), tell() will return byte offsets, not character offsets (because
+that would render seek() and tell() rather slow).
The return value of tell() for the standard streams like the STDIN
depends on the operating system: it may return -1 or something else.
perlunifaq - Perl Unicode FAQ
-=head1 DESCRIPTION
+=head1 Q and A
This is a list of questions and answers about Unicode in Perl, intended to be
read after L<perlunitut>.
think that Unicode is special and magical, and I didn't want to disappoint
them, so I decided to call the document a Unicode tutorial.
+=head2 What character encodings does Perl support?
+
+To find out which character encodings your Perl supports, run:
+
+ perl -MEncode -le "print for Encode->encodings(':all')"
+
+=head2 Which version of perl should I use?
+
+Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
+The tutorial and FAQ are based on the status quo as of C<5.8.8>.
+
+You should also check your modules, and upgrade them if necessary. For example,
+HTML::Entities requires version >= 1.32 to function correctly, even though the
+changelog is silent about this.
+
=head2 What about binary data, like images?
Well, apart from a bare C<binmode $fh>, you shouldn't treat them specially.
appropriate encoding, then join them with binary strings. See also: "What if I
don't encode?".
-=head2 What about the UTF8 flag?
-
-Please, unless you're hacking the internals, or debugging weirdness, don't
-think about the UTF8 flag at all. That means that you very probably shouldn't
-use C<is_utf8>, C<_utf8_on> or C<_utf8_off> at all.
-
-Perl's internal format happens to be UTF-8. Unfortunately, Perl can't keep a
-secret, so everyone knows about this. That is the source of much confusion.
-It's better to pretend that the internal format is some unknown encoding,
-and that you always have to encode and decode explicitly.
-
=head2 When should I decode or encode?
-Whenever you're communicating with anything that is external to your perl
+Whenever you're communicating text with anything that is external to your perl
process, like a database, a text file, a socket, or another program. Even if
the thing you're communicating with is also written in Perl.
binmode $fh, ':encoding(UTF-8)';
Some database drivers for DBI can also automatically encode and decode, but
-that is typically limited to the UTF-8 encoding, because they cheat.
-
-=head2 Cheat?! Tell me, how can I cheat?
-
-Well, because Perl's internal format is UTF-8, you can just skip the encoding
-or decoding step, and manipulate the UTF8 flag directly.
-
-Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>. This is widely
-accepted as good behavior when you're writing, but it can be dangerous when
-reading, because it causes internal inconsistency when you have invalid byte
-sequences.
-
-Instead of C<decode> and C<encode>, you could use C<_utf8_on> and C<_utf8_off>,
-but this is considered bad style. Especially C<_utf8_on> can be dangerous, for
-the same reason that C<:utf8> can.
-
-There are some shortcuts for oneliners; see C<-C> in L<perlrun>.
+that is sometimes limited to the UTF-8 encoding.
=head2 What if I don't know which encoding was used?
If you properly encode your strings for output, none of this is of your
concern, and you can just C<eval> dumped data as always.
+=head2 Why do regex character classes sometimes match only in the ASCII range?
+
+=head2 Why do some characters not uppercase or lowercase correctly?
+
+It seemed like a good idea at the time, to keep the semantics the same for
+standard strings, when Perl got Unicode support. While it might be repaired
+in the future, we now have to deal with the fact that Perl treats equal
+strings differently, depending on the internal state.
+
+Affected are C<uc>, C<lc>, C<ucfirst>, C<lcfirst>, C<\U>, C<\L>, C<\u>, C<\l>,
+C<\d>, C<\s>, C<\w>, C<\D>, C<\S>, C<\W>, C</.../i>, C<(?i:...)>,
+C</[[:posix:]]/>.
+
+To force Unicode semantics, you can upgrade the internal representation to
+by doing C<utf8::upgrade($string)>. This does not change strings that were
+already upgraded.
+
+For a more detailed discussion, see L<Unicode::Semantics> on CPAN.
+
=head2 How can I determine if a string is a text string or a binary string?
You can't. Some use the UTF8 flag for this, but that's misuse, and makes well
open my $barfh, '>:encoding(BAR)', 'example.bar.txt';
print { $barfh } $_ while <$foofh>;
+=head2 What are C<decode_utf8> and C<encode_utf8>?
+
+These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
+...)>.
+
+=head2 What is a "wide character"?
+
+This is a term used both for characters with an ordinal value greater than 127,
+characters with an ordinal value greater than 255, or any character occupying
+than one byte, depending on the context.
+
+The Perl warning "Wide character in ..." is caused by a character with an
+ordinal value greater than 255. With no specified encoding layer, Perl tries to
+fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it
+emits this warning (if warnings are enabled), and outputs UTF-8 encoded data
+instead.
+
+To avoid this warning and to avoid having different output encodings in a single
+stream, always specify an encoding explicitly, for example with a PerlIO layer:
+
+ binmode STDOUT, ":encoding(UTF-8)";
+
+=head1 INTERNALS
+
+=head2 What is "the UTF8 flag"?
+
+Please, unless you're hacking the internals, or debugging weirdness, don't
+think about the UTF8 flag at all. That means that you very probably shouldn't
+use C<is_utf8>, C<_utf8_on> or C<_utf8_off> at all.
+
+The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the
+current internal representation is UTF-8. Without the flag, it is assumed to be
+ISO-8859-1. Perl converts between these automatically.
+
+One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't
+keep a secret, so everyone knows about this. That is the source of much
+confusion. It's better to pretend that the internal format is some unknown
+encoding, and that you always have to encode and decode explicitly.
+
=head2 What about the C<use bytes> pragma?
Don't use it. It makes no sense to deal with bytes in a text string, and it
C<use bytes> is usually a failed attempt to do something useful. Just forget
about it.
-=head2 What are C<decode_utf8> and C<encode_utf8>?
+=head2 What about the C<use encoding> pragma?
-These are alternate syntaxes for C<decode('utf8', ...)> and C<encode('utf8',
-...)>.
+Don't use it. Unfortunately, it assumes that the programmer's environment and
+that of the user will use the same encoding. It will use the same encoding for
+the source code and for STDIN and STDOUT. When a program is copied to another
+machine, the source code does not change, but the STDIO environment might.
+
+If you need non-ASCII characters in your source code, make it a UTF-8 encoded
+file and C<use utf8>.
+
+If you need to set the encoding for STDIN, STDOUT, and STDERR, for example
+based on the user's locale, C<use open>.
+
+=head2 What is the difference between C<:encoding> and C<:utf8>?
+
+Because UTF-8 is one of Perl's internal formats, you can often just skip the
+encoding or decoding step, and manipulate the UTF8 flag directly.
+
+Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>, which skips the
+encoding step if the data was already represented as UTF8 internally. This is
+widely accepted as good behavior when you're writing, but it can be dangerous
+when reading, because it causes internal inconsistency when you have invalid
+byte sequences. Using C<:utf8> for input can sometimes result in security
+breaches, so please use C<:encoding(UTF-8)> instead.
+
+Instead of C<decode> and C<encode>, you could use C<_utf8_on> and C<_utf8_off>,
+but this is considered bad style. Especially C<_utf8_on> can be dangerous, for
+the same reason that C<:utf8> can.
+
+There are some shortcuts for oneliners; see C<-C> in L<perlrun>.
=head2 What's the difference between C<UTF-8> and C<utf8>?
encoding for a certain string is, but instead just encode it into the encoding
that you want.
-=head2 What character encodings does Perl support?
-
-To find out which character encodings your Perl supports, run:
-
- perl -MEncode -le "print for Encode->encodings(':all')"
-
-=head2 Which version of perl should I use?
-
-Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
-The tutorial and FAQ are based on the status quo as of C<5.8.8>.
-
-You should also check your modules, and upgrade them if necessary. For example,
-HTML::Entities requires version >= 1.32 to function correctly, even though the
-changelog is silent about this.
-
=head1 AUTHOR
-Juerd Waalboer <juerd@cpan.org>
+Juerd Waalboer <#####@juerd.nl>
=head1 SEE ALSO
Wide character in print at ...
-To output UTF-8, use the C<:utf8> output layer. Prepending
+To output UTF-8, use the C<:encoding> or C<:utf8> output layer. Prepending
binmode(STDOUT, ":utf8");
The matching of encoding names is loose: case does not matter, and
many encodings have several aliases. Note that the C<:utf8> layer
must always be specified exactly like that; it is I<not> subject to
-the loose matching of encoding names.
+the loose matching of encoding names. Also note that C<:utf8> is unsafe for
+input, because it accepts the data without validating that it is indeed valid
+UTF8.
See L<PerlIO> for the C<:utf8> layer, L<PerlIO::encoding> and
L<Encode::PerlIO> for the C<:encoding()> layer, and
Unicode in Perl's eyes. To do that, specify the appropriate
layer when opening files
- open(my $fh,'<:utf8', 'anything');
+ open(my $fh,'<:encoding(utf8)', 'anything');
my $line_of_unicode = <$fh>;
open(my $fh,'<:encoding(Big5)', 'anything');
The I/O layers can also be specified more flexibly with
the C<open> pragma. See L<open>, or look at the following example.
- use open ':utf8'; # input and output default layer will be UTF-8
+ use open ':encoding(utf8)'; # input/output default encoding will be UTF-8
open X, ">file";
print X chr(0x100), "\n";
close X;
printf "%#x\n", ord(<I>), "\n"; # this should print 0xc1
close I;
-or you can also use the C<':encoding(...)'> layer
-
- open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
- my $line_of_unicode = <$epic>;
-
These methods install a transparent filter on the I/O stream that
converts data from the specified encoding when it is read in from the
stream. The result is always Unicode.
local $/; ## read in the whole file of 8-bit characters
$t = <F>;
close F;
- open F, ">:utf8", "file";
+ open F, ">:encoding(utf8)", "file";
print F $t; ## convert to UTF-8 on output
close F;
If you run this code twice, the contents of the F<file> will be twice
-UTF-8 encoded. A C<use open ':utf8'> would have avoided the bug, or
-explicitly opening also the F<file> for input as UTF-8.
+UTF-8 encoded. A C<use open ':encoding(utf8)'> would have avoided the
+bug, or explicitly opening also the F<file> for input as UTF-8.
B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
Perl has been built with the new PerlIO feature (which is the default