A fix of sorts for the flush-before-dup scenario.
[p5sagit/p5-mst-13.2.git] / lib / utf8.pm
CommitLineData
a0ed51b3 1package utf8;
2
3sub import {
4 $^H |= 0x00000008;
5 $enc{caller()} = $_[1] if $_[1];
6}
7
8sub unimport {
9 $^H &= ~0x00000008;
10}
11
12sub AUTOLOAD {
13 require "utf8_heavy.pl";
14 goto &$AUTOLOAD;
15}
16
171;
18__END__
19
20=head1 NAME
21
22utf8 - Perl pragma to turn on UTF-8 and Unicode support
23
24=head1 SYNOPSIS
25
26 use utf8;
27 no utf8;
28
29=head1 DESCRIPTION
30
31The utf8 pragma tells Perl to use UTF-8 as its internal string
32representation for the rest of the enclosing block. (The "no utf8"
33pragma tells Perl to switch back to ordinary byte-oriented processing
34for the rest of the enclosing block.) Under utf8, many operations that
35formerly operated on bytes change to operating on characters. For
36ASCII data this makes no difference, because UTF-8 stores ASCII in
37single bytes, but for any character greater than C<chr(127)>, the
38character is stored in a sequence of two or more bytes, all of which
39have the high bit set. But by and large, the user need not worry about
40this, because the utf8 pragma hides it from the user. A character
41under utf8 is logically just a number ranging from 0 to 2**32 or so.
42Larger characters encode to longer sequences of bytes, but again, this
43is hidden.
44
45Use of the utf8 pragma has the following effects:
46
47=over 4
48
49=item *
50
51Strings and patterns may contain characters that have an ordinal value
52larger than 255. Presuming you use a Unicode editor to edit your
53program, these will typically occur directly within the literal strings
54as UTF-8 characters, but you can also specify a particular character
55with an extension of the C<\x> notation. UTF-8 characters are
f244e06d 56specified by putting the hexadecimal code within curlies after the
a0ed51b3 57C<\x>. For instance, a Unicode smiley face is C<\x{263A}>. A
58character in the Latin-1 range (128..255) should be written C<\x{ab}>
59rather than C<\xab>, since the former will turn into a two-byte UTF-8
60code, while the latter will continue to be interpreted as generating a
618-bit byte rather than a character. In fact, if -w is turned on, it will
62produce a warning that you might be generating invalid UTF-8.
63
64=item *
65
66Identifiers within the Perl script may contain Unicode alphanumeric
67characters, including ideographs. (You are currently on your own when
68it comes to using the canonical forms of characters--Perl doesn't (yet)
69attempt to canonicalize variable names for you.)
70
71=item *
72
73Regular expressions match characters instead of bytes. For instance,
423cee85 74"." matches a character instead of a byte. (However, the C<\O> pattern
75is provided to force a match a single byte ("octet", hence C<\O>).)
a0ed51b3 76
77=item *
78
79Character classes in regular expressions match characters instead of
80bytes, and match against the character properties specified in the
81Unicode properties database. So C<\w> can be used to match an ideograph,
82for instance.
83
84=item *
85
86Named Unicode properties and block ranges make be used as character
87classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
88match property) constructs. For instance, C<\p{Lu}> matches any
89character with the Unicode uppercase property, while C<\p{M}> matches
90any mark character. Single letter properties may omit the brackets, so
91that can be written C<\pM> also. Many predefined character classes are
92available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
93
94=item *
95
96The special pattern C<\X> match matches any extended Unicode sequence
97(a "combining character sequence" in Standardese), where the first
98character is a base character and subsequent characters are mark
99characters that apply to the base character. It is equivalent to
22244bdb 100C<(?:\PM\pM*)>.
a0ed51b3 101
102=item *
103
104The C<tr///> operator translates characters instead of bytes. It can also
105be forced to translate between 8-bit codes and UTF-8 regardless of the
106surrounding utf8 state. For instance, if you know your input in Latin-1,
107you can say:
108
109 use utf8;
110 while (<>) {
111 tr/\0-\xff//CU; # latin1 char to utf8
112 ...
113 }
114
115Similarly you could translate your output with
116
117 tr/\0-\x{ff}//UC; # utf8 to latin1 char
118
119No, C<s///> doesn't take /U or /C (yet?).
120
121=item *
122
123Case translation operators use the Unicode case translation tables.
124Note that C<uc()> translates to uppercase, while C<ucfirst> translates
125to titlecase (for languages that make the distinction). Naturally
126the corresponding backslash sequences have the same semantics.
127
128=item *
129
130Most operators that deal with positions or lengths in the string will
131automatically switch to using character positions, including C<chop()>,
132C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
133C<write()>, and C<length()>. Operators that specifically don't switch
134include C<vec()>, C<pack()>, and C<unpack()>. Operators that really
135don't care include C<chomp()>, as well as any other operator that
136treats a string as a bucket of bits, such as C<sort()>, and the
137operators dealing with filenames.
138
139=item *
140
141The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
142since they're often used for byte-oriented formats. (Again, think
143"C<char>" in the C language.) However, there is a new "C<U>" specifier
144that will convert between UTF-8 characters and integers. (It works
145outside of the utf8 pragma too.)
146
147=item *
148
149The C<chr()> and C<ord()> functions work on characters. This is like
150C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
151C<unpack("C")>. In fact, the latter are how you now emulate
152byte-oriented C<chr()> and C<ord()> under utf8.
153
154=item *
155
156And finally, C<scalar reverse()> reverses by character rather than by byte.
157
158=back
159
160=head1 CAVEATS
161
162As of yet, there is no method for automatically coercing input and
163output to some encoding other than UTF-8. This is planned in the near
164future, however.
165
166In any event, you'll need to keep track of whether interfaces to other
167modules expect UTF-8 data or something else. The utf8 pragma does not
168magically mark strings for you in order to remember their encoding, nor
169will any automatic coercion happen (other than that eventually planned
170for I/O). If you want such automatic coercion, you can build yourself
171a set of pretty object-oriented modules. Expect it to run considerably
172slower than than this low-level support.
173
174Use of locales with utf8 may lead to odd results. Currently there is
175some attempt to apply 8-bit locale info to characters in the range
1760..255, but this is demonstrably incorrect for locales that use
177characters above that range (when mapped into Unicode). It will also
178tend to run slower. Avoidance of locales is strongly encouraged.
179
180=cut