Commit | Line | Data |
a0ed51b3 |
1 | package utf8; |
2 | |
3 | sub import { |
4 | $^H |= 0x00000008; |
5 | $enc{caller()} = $_[1] if $_[1]; |
6 | } |
7 | |
8 | sub unimport { |
9 | $^H &= ~0x00000008; |
10 | } |
11 | |
12 | sub AUTOLOAD { |
13 | require "utf8_heavy.pl"; |
14 | goto &$AUTOLOAD; |
15 | } |
16 | |
17 | 1; |
18 | __END__ |
19 | |
20 | =head1 NAME |
21 | |
22 | utf8 - Perl pragma to turn on UTF-8 and Unicode support |
23 | |
24 | =head1 SYNOPSIS |
25 | |
26 | use utf8; |
27 | no utf8; |
28 | |
29 | =head1 DESCRIPTION |
30 | |
31 | The utf8 pragma tells Perl to use UTF-8 as its internal string |
32 | representation for the rest of the enclosing block. (The "no utf8" |
33 | pragma tells Perl to switch back to ordinary byte-oriented processing |
34 | for the rest of the enclosing block.) Under utf8, many operations that |
35 | formerly operated on bytes change to operating on characters. For |
36 | ASCII data this makes no difference, because UTF-8 stores ASCII in |
37 | single bytes, but for any character greater than C<chr(127)>, the |
38 | character is stored in a sequence of two or more bytes, all of which |
39 | have the high bit set. But by and large, the user need not worry about |
40 | this, because the utf8 pragma hides it from the user. A character |
41 | under utf8 is logically just a number ranging from 0 to 2**32 or so. |
42 | Larger characters encode to longer sequences of bytes, but again, this |
43 | is hidden. |
44 | |
45 | Use of the utf8 pragma has the following effects: |
46 | |
47 | =over 4 |
48 | |
49 | =item * |
50 | |
51 | Strings and patterns may contain characters that have an ordinal value |
52 | larger than 255. Presuming you use a Unicode editor to edit your |
53 | program, these will typically occur directly within the literal strings |
54 | as UTF-8 characters, but you can also specify a particular character |
55 | with an extension of the C<\x> notation. UTF-8 characters are |
56 | specified by putting the hexidecimal code within curlies after the |
57 | C<\x>. For instance, a Unicode smiley face is C<\x{263A}>. A |
58 | character in the Latin-1 range (128..255) should be written C<\x{ab}> |
59 | rather than C<\xab>, since the former will turn into a two-byte UTF-8 |
60 | code, while the latter will continue to be interpreted as generating a |
61 | 8-bit byte rather than a character. In fact, if -w is turned on, it will |
62 | produce a warning that you might be generating invalid UTF-8. |
63 | |
64 | =item * |
65 | |
66 | Identifiers within the Perl script may contain Unicode alphanumeric |
67 | characters, including ideographs. (You are currently on your own when |
68 | it comes to using the canonical forms of characters--Perl doesn't (yet) |
69 | attempt to canonicalize variable names for you.) |
70 | |
71 | =item * |
72 | |
73 | Regular expressions match characters instead of bytes. For instance, |
74 | "." matches a character instead of a byte. (However, the C<\C> pattern |
75 | is provided to force a match a single byte ("C<char>" in C, hence |
76 | C<\C>).) |
77 | |
78 | =item * |
79 | |
80 | Character classes in regular expressions match characters instead of |
81 | bytes, and match against the character properties specified in the |
82 | Unicode properties database. So C<\w> can be used to match an ideograph, |
83 | for instance. |
84 | |
85 | =item * |
86 | |
87 | Named Unicode properties and block ranges make be used as character |
88 | classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't |
89 | match property) constructs. For instance, C<\p{Lu}> matches any |
90 | character with the Unicode uppercase property, while C<\p{M}> matches |
91 | any mark character. Single letter properties may omit the brackets, so |
92 | that can be written C<\pM> also. Many predefined character classes are |
93 | available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. |
94 | |
95 | =item * |
96 | |
97 | The special pattern C<\X> match matches any extended Unicode sequence |
98 | (a "combining character sequence" in Standardese), where the first |
99 | character is a base character and subsequent characters are mark |
100 | characters that apply to the base character. It is equivalent to |
22244bdb |
101 | C<(?:\PM\pM*)>. |
a0ed51b3 |
102 | |
103 | =item * |
104 | |
105 | The C<tr///> operator translates characters instead of bytes. It can also |
106 | be forced to translate between 8-bit codes and UTF-8 regardless of the |
107 | surrounding utf8 state. For instance, if you know your input in Latin-1, |
108 | you can say: |
109 | |
110 | use utf8; |
111 | while (<>) { |
112 | tr/\0-\xff//CU; # latin1 char to utf8 |
113 | ... |
114 | } |
115 | |
116 | Similarly you could translate your output with |
117 | |
118 | tr/\0-\x{ff}//UC; # utf8 to latin1 char |
119 | |
120 | No, C<s///> doesn't take /U or /C (yet?). |
121 | |
122 | =item * |
123 | |
124 | Case translation operators use the Unicode case translation tables. |
125 | Note that C<uc()> translates to uppercase, while C<ucfirst> translates |
126 | to titlecase (for languages that make the distinction). Naturally |
127 | the corresponding backslash sequences have the same semantics. |
128 | |
129 | =item * |
130 | |
131 | Most operators that deal with positions or lengths in the string will |
132 | automatically switch to using character positions, including C<chop()>, |
133 | C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>, |
134 | C<write()>, and C<length()>. Operators that specifically don't switch |
135 | include C<vec()>, C<pack()>, and C<unpack()>. Operators that really |
136 | don't care include C<chomp()>, as well as any other operator that |
137 | treats a string as a bucket of bits, such as C<sort()>, and the |
138 | operators dealing with filenames. |
139 | |
140 | =item * |
141 | |
142 | The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, |
143 | since they're often used for byte-oriented formats. (Again, think |
144 | "C<char>" in the C language.) However, there is a new "C<U>" specifier |
145 | that will convert between UTF-8 characters and integers. (It works |
146 | outside of the utf8 pragma too.) |
147 | |
148 | =item * |
149 | |
150 | The C<chr()> and C<ord()> functions work on characters. This is like |
151 | C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and |
152 | C<unpack("C")>. In fact, the latter are how you now emulate |
153 | byte-oriented C<chr()> and C<ord()> under utf8. |
154 | |
155 | =item * |
156 | |
157 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
158 | |
159 | =back |
160 | |
161 | =head1 CAVEATS |
162 | |
163 | As of yet, there is no method for automatically coercing input and |
164 | output to some encoding other than UTF-8. This is planned in the near |
165 | future, however. |
166 | |
167 | In any event, you'll need to keep track of whether interfaces to other |
168 | modules expect UTF-8 data or something else. The utf8 pragma does not |
169 | magically mark strings for you in order to remember their encoding, nor |
170 | will any automatic coercion happen (other than that eventually planned |
171 | for I/O). If you want such automatic coercion, you can build yourself |
172 | a set of pretty object-oriented modules. Expect it to run considerably |
173 | slower than than this low-level support. |
174 | |
175 | Use of locales with utf8 may lead to odd results. Currently there is |
176 | some attempt to apply 8-bit locale info to characters in the range |
177 | 0..255, but this is demonstrably incorrect for locales that use |
178 | characters above that range (when mapped into Unicode). It will also |
179 | tend to run slower. Avoidance of locales is strongly encouraged. |
180 | |
181 | =cut |