Advertise the encoding pragma at the utf8 pragma.
[p5sagit/p5-mst-13.2.git] / lib / utf8.pm
CommitLineData
a0ed51b3 1package utf8;
2
d5448623 3$utf8::hint_bits = 0x00800000;
4
973655a8 5our $VERSION = '1.02';
b75c8c73 6
a0ed51b3 7sub import {
d5448623 8 $^H |= $utf8::hint_bits;
a0ed51b3 9 $enc{caller()} = $_[1] if $_[1];
10}
11
12sub unimport {
d5448623 13 $^H &= ~$utf8::hint_bits;
a0ed51b3 14}
15
16sub AUTOLOAD {
17 require "utf8_heavy.pl";
daf4d4ea 18 goto &$AUTOLOAD if defined &$AUTOLOAD;
19 Carp::croak("Undefined subroutine $AUTOLOAD called");
a0ed51b3 20}
21
221;
23__END__
24
25=head1 NAME
26
b3419ed8 27utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
a0ed51b3 28
29=head1 SYNOPSIS
30
31 use utf8;
32 no utf8;
33
973655a8 34 $num_octets = utf8::upgrade($string);
35 $success = utf8::downgrade($string[, FAIL_OK]);
36
37 utf8::encode($string);
38 utf8::decode($string);
39
786c9463 40 $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
973655a8 41 $flag = utf8::valid(STRING);
42
a0ed51b3 43=head1 DESCRIPTION
44
393fec97 45The C<use utf8> pragma tells the Perl parser to allow UTF-8 in the
b3419ed8 46program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
70122e76 47platforms). The C<no utf8> pragma tells Perl to switch back to treating
b3419ed8 48the source text as literal bytes in the current lexical scope.
a0ed51b3 49
393fec97 50This pragma is primarily a compatibility device. Perl versions
51earlier than 5.6 allowed arbitrary bytes in source code, whereas
52in future we would like to standardize on the UTF-8 encoding for
63708e74 53source text.
54
55Until UTF-8 becomes the default format for source text, either this
56pragma or the L</encoding> pragma should be used to recognize UTF-8
57in the source. When UTF-8 becomes the standard source format, this
58pragma will effectively become a no-op. For convenience in what
59follows the term I<UTF-X> is used to refer to UTF-8 on ASCII and ISO
60Latin based platforms and UTF-EBCDIC on EBCDIC based platforms.
a0ed51b3 61
ad0029c4 62Enabling the C<utf8> pragma has the following effect:
a0ed51b3 63
4ac9195f 64=over 4
a0ed51b3 65
66=item *
67
393fec97 68Bytes in the source text that have their high-bit set will be treated
ad0029c4 69as being part of a literal UTF-8 character. This includes most
c20e2abd 70literals such as identifier names, string constants, and constant
8f8cf39c 71regular expression patterns.
72
73On EBCDIC platforms characters in the Latin 1 character set are
74treated as being part of a literal UTF-EBCDIC character.
a0ed51b3 75
4ac9195f 76=back
77
ae90e350 78Note that if you have bytes with the eighth bit on in your script
79(for example embedded Latin-1 in your string literals), C<use utf8>
80will be unhappy since the bytes are most probably not well-formed
81UTF-8. If you want to have such bytes and use utf8, you can disable
82utf8 until the end the block (or file, if at top level) by C<no utf8;>.
83
63708e74 84If you want to automatically upgrade your 8-bit legacy bytes to UTF-8,
85use the L</encoding> pragma instead of this pragma. For example, if
86you want to implicitly upgrade your ISO 8859-1 (Latin-1) bytes to UTF-8
87as used in e.g. C<chr()> and C<\x{...}>, try this:
88
89 use encoding "latin-1";
90 my $c = chr(0xc4);
91 my $x = "\x{c5}";
92
93In case you are wondering: yes, C<use encoding 'utf8';> works much
94the same as C<use utf8;>.
95
1b026014 96=head2 Utility functions
97
8800c35a 98The following functions are defined in the C<utf8::> package by the
99Perl core. You do not need to say C<use utf8> to use these and in fact
100you should not unless you really want to have UTF-8 source code.
1b026014 101
102=over 4
103
973655a8 104=item * $num_octets = utf8::upgrade($string)
1b026014 105
6e37fd2a 106Converts (in-place) internal representation of string to Perl's
107internal I<UTF-X> form. Returns the number of octets necessary to
108represent the string as I<UTF-X>. Can be used to make sure that the
8dd9dd9f 109UTF-8 flag is on, so that C<\w> or C<lc()> work as expected on strings
6e37fd2a 110containing characters in the range 0x80-0xFF (oon ASCII and
111derivatives). Note that this should not be used to convert a legacy
112byte encoding to Unicode: use Encode for that. Affected by the
113encoding pragma.
1b026014 114
973655a8 115=item * $success = utf8::downgrade($string[, FAIL_OK])
1b026014 116
7d865a91 117Converts (in-place) internal representation of string to be un-encoded
118bytes. Returns true on success. On failure dies or, if the value of
119FAIL_OK is true, returns false. Can be used to make sure that the
8dd9dd9f 120UTF-8 flag is off, e.g. when you want to make sure that the substr()
121or length() function works with the usually faster byte algorithm.
13a6c0e0 122Note that this should not be used to convert Unicode back to a legacy
123byte encoding: use Encode for that. B<Not> affected by the encoding
124pragma.
1b026014 125
126=item * utf8::encode($string)
127
13a6c0e0 128Converts (in-place) I<$string> from logical characters to octet
6e37fd2a 129sequence representing it in Perl's I<UTF-X> encoding. Returns
130nothing. Same as Encode::encode_utf8(). Note that this should not be
131used to convert a legacy byte encoding to Unicode: use Encode for
132that.
094ce63c 133
973655a8 134=item * utf8::decode($string)
1b026014 135
ad0029c4 136Attempts to convert I<$string> in-place from Perl's I<UTF-X> encoding
6e37fd2a 137into logical characters. Returns nothing. Same as Encode::decode_utf8().
138Note that this should not be used to convert Unicode back to a legacy
139byte encoding: use Encode for that.
1b026014 140
8800c35a 141=item * $flag = utf8::is_utf8(STRING)
142
786c9463 143(Since Perl 5.8.1) Test whether STRING is in UTF-8. Functionally
144the same as Encode::is_utf8().
8800c35a 145
70122e76 146=item * $flag = utf8::valid(STRING)
147
8800c35a 148[INTERNAL] Test whether STRING is in a consistent state regarding
149UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag
150on B<or> if string is held as bytes (both these states are 'consistent').
151Main reason for this routine is to allow Perl's testsuite to check
152that operations have left strings in a consistent state. You most
153probably want to use utf8::is_utf8() instead.
70122e76 154
1b026014 155=back
156
7d865a91 157C<utf8::encode> is like C<utf8::upgrade>, but the UTF8 flag is
158cleared. See L<perlunicode> for more on the UTF8 flag and the C API
159functions C<sv_utf8_upgrade>, C<sv_utf8_downgrade>, C<sv_utf8_encode>,
094ce63c 160and C<sv_utf8_decode>, which are wrapped by the Perl functions
161C<utf8::upgrade>, C<utf8::downgrade>, C<utf8::encode> and
786c9463 162C<utf8::decode>. Note that in the Perl 5.8.0 and 5.8.1 implementation
163the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode,
6e37fd2a 164utf8::upgrade, and utf8::downgrade are always available, without a
165C<require utf8> statement-- this may change in future releases.
f1e62f77 166
8f8cf39c 167=head1 BUGS
168
169One can have Unicode in identifier names, but not in package/class or
170subroutine names. While some limited functionality towards this does
171exist as of Perl 5.8.0, that is more accidental than designed; use of
172Unicode for the said purposes is unsupported.
173
174One reason of this unfinishedness is its (currently) inherent
175unportability: since both package names and subroutine names may need
176to be mapped to file and directory names, the Unicode capability of
177the filesystem becomes important-- and there unfortunately aren't
178portable answers.
179
393fec97 180=head1 SEE ALSO
a0ed51b3 181
63708e74 182L<perluniintro>, L<encoding>, L<perlunicode>, L<bytes>
a0ed51b3 183
184=cut