Commit | Line | Data |
393fec97 |
1 | =head1 NAME |
2 | |
3 | perlunicode - Unicode support in Perl |
4 | |
5 | =head1 DESCRIPTION |
6 | |
0a1f2d14 |
7 | =head2 Important Caveats |
21bad921 |
8 | |
376d9008 |
9 | Unicode support is an extensive requirement. While Perl does not |
c349b1b9 |
10 | implement the Unicode standard or the accompanying technical reports |
11 | from cover to cover, Perl does support many Unicode features. |
21bad921 |
12 | |
2575c402 |
13 | People who want to learn to use Unicode in Perl, should probably read |
e4911a48 |
14 | L<the Perl Unicode tutorial, perlunitut|perlunitut>, before reading |
15 | this reference document. |
2575c402 |
16 | |
13a2d996 |
17 | =over 4 |
21bad921 |
18 | |
fae2c0fb |
19 | =item Input and Output Layers |
21bad921 |
20 | |
376d9008 |
21 | Perl knows when a filehandle uses Perl's internal Unicode encodings |
1bfb14c4 |
22 | (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with |
23 | the ":utf8" layer. Other encodings can be converted to Perl's |
24 | encoding on input or from Perl's encoding on output by use of the |
25 | ":encoding(...)" layer. See L<open>. |
c349b1b9 |
26 | |
2575c402 |
27 | To indicate that Perl source itself is in UTF-8, use C<use utf8;>. |
21bad921 |
28 | |
29 | =item Regular Expressions |
30 | |
c349b1b9 |
31 | The regular expression compiler produces polymorphic opcodes. That is, |
376d9008 |
32 | the pattern adapts to the data and automatically switches to the Unicode |
2575c402 |
33 | character scheme when presented with data that is internally encoded in |
34 | UTF-8 -- or instead uses a traditional byte scheme when presented with |
35 | byte data. |
21bad921 |
36 | |
ad0029c4 |
37 | =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts |
21bad921 |
38 | |
376d9008 |
39 | As a compatibility measure, the C<use utf8> pragma must be explicitly |
40 | included to enable recognition of UTF-8 in the Perl scripts themselves |
1bfb14c4 |
41 | (in string or regular expression literals, or in identifier names) on |
42 | ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based |
376d9008 |
43 | machines. B<These are the only times when an explicit C<use utf8> |
8f8cf39c |
44 | is needed.> See L<utf8>. |
21bad921 |
45 | |
7aa207d6 |
46 | =item BOM-marked scripts and UTF-16 scripts autodetected |
47 | |
48 | If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, |
49 | or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either |
50 | endianness, Perl will correctly read in the script as Unicode. |
51 | (BOMless UTF-8 cannot be effectively recognized or differentiated from |
52 | ISO 8859-1 or other eight-bit encodings.) |
53 | |
990e18f7 |
54 | =item C<use encoding> needed to upgrade non-Latin-1 byte strings |
55 | |
38a44b82 |
56 | By default, there is a fundamental asymmetry in Perl's Unicode model: |
990e18f7 |
57 | implicit upgrading from byte strings to Unicode strings assumes that |
58 | they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are |
59 | downgraded with UTF-8 encoding. This happens because the first 256 |
51f494cc |
60 | codepoints in Unicode happens to agree with Latin-1. |
990e18f7 |
61 | |
990e18f7 |
62 | See L</"Byte and Character Semantics"> for more details. |
63 | |
21bad921 |
64 | =back |
65 | |
376d9008 |
66 | =head2 Byte and Character Semantics |
393fec97 |
67 | |
376d9008 |
68 | Beginning with version 5.6, Perl uses logically-wide characters to |
3e4dbfed |
69 | represent strings internally. |
393fec97 |
70 | |
376d9008 |
71 | In future, Perl-level operations will be expected to work with |
72 | characters rather than bytes. |
393fec97 |
73 | |
376d9008 |
74 | However, as an interim compatibility measure, Perl aims to |
75daf61c |
75 | provide a safe migration path from byte semantics to character |
76 | semantics for programs. For operations where Perl can unambiguously |
376d9008 |
77 | decide that the input data are characters, Perl switches to |
75daf61c |
78 | character semantics. For operations where this determination cannot |
79 | be made without additional information from the user, Perl decides in |
376d9008 |
80 | favor of compatibility and chooses to use byte semantics. |
8cbd9a7a |
81 | |
51f494cc |
82 | Under byte semantics, when C<use locale> is in effect, Perl uses the |
e1b711da |
83 | semantics associated with the current locale. Absent a C<use locale>, and |
84 | absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII |
85 | (or Basic Latin in Unicode terminology) byte semantics, meaning that characters |
86 | whose ordinal numbers are in the range 128 - 255 are undefined except for their |
87 | ordinal numbers. This means that none have case (upper and lower), nor are any |
88 | a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong |
89 | to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) |
2bbc8d55 |
90 | |
8cbd9a7a |
91 | This behavior preserves compatibility with earlier versions of Perl, |
376d9008 |
92 | which allowed byte semantics in Perl operations only if |
e1b711da |
93 | none of the program's inputs were marked as being a source of Unicode |
8cbd9a7a |
94 | character data. Such data may come from filehandles, from calls to |
95 | external programs, from information provided by the system (such as %ENV), |
21bad921 |
96 | or from literals and constants in the source text. |
8cbd9a7a |
97 | |
376d9008 |
98 | The C<bytes> pragma will always, regardless of platform, force byte |
99 | semantics in a particular lexical scope. See L<bytes>. |
8cbd9a7a |
100 | |
e1b711da |
101 | The C<use feature 'unicode_strings'> pragma is intended to always, regardless |
102 | of platform, force Unicode semantics in a particular lexical scope. In |
103 | release 5.12, it is partially implemented, applying only to case changes. |
104 | See L</The "Unicode Bug"> below. |
105 | |
8cbd9a7a |
106 | The C<utf8> pragma is primarily a compatibility device that enables |
75daf61c |
107 | recognition of UTF-(8|EBCDIC) in literals encountered by the parser. |
376d9008 |
108 | Note that this pragma is only required while Perl defaults to byte |
109 | semantics; when character semantics become the default, this pragma |
110 | may become a no-op. See L<utf8>. |
111 | |
112 | Unless explicitly stated, Perl operators use character semantics |
113 | for Unicode data and byte semantics for non-Unicode data. |
114 | The decision to use character semantics is made transparently. If |
115 | input data comes from a Unicode source--for example, if a character |
fae2c0fb |
116 | encoding layer is added to a filehandle or a literal Unicode |
376d9008 |
117 | string constant appears in a program--character semantics apply. |
118 | Otherwise, byte semantics are in effect. The C<bytes> pragma should |
e1b711da |
119 | be used to force byte semantics on Unicode data, and the C<use feature |
120 | 'unicode_strings'> pragma to force Unicode semantics on byte data (though in |
121 | 5.12 it isn't fully implemented). |
376d9008 |
122 | |
123 | If strings operating under byte semantics and strings with Unicode |
51f494cc |
124 | character data are concatenated, the new string will have |
42bde815 |
125 | character semantics. This can cause surprises: See L</BUGS>, below |
7dedd01f |
126 | |
feda178f |
127 | Under character semantics, many operations that formerly operated on |
376d9008 |
128 | bytes now operate on characters. A character in Perl is |
feda178f |
129 | logically just a number ranging from 0 to 2**31 or so. Larger |
376d9008 |
130 | characters may encode into longer sequences of bytes internally, but |
131 | this internal detail is mostly hidden for Perl code. |
132 | See L<perluniintro> for more. |
393fec97 |
133 | |
376d9008 |
134 | =head2 Effects of Character Semantics |
393fec97 |
135 | |
136 | Character semantics have the following effects: |
137 | |
138 | =over 4 |
139 | |
140 | =item * |
141 | |
376d9008 |
142 | Strings--including hash keys--and regular expression patterns may |
574c8022 |
143 | contain characters that have an ordinal value larger than 255. |
393fec97 |
144 | |
2575c402 |
145 | If you use a Unicode editor to edit your program, Unicode characters may |
146 | occur directly within the literal strings in UTF-8 encoding, or UTF-16. |
147 | (The former requires a BOM or C<use utf8>, the latter requires a BOM.) |
3e4dbfed |
148 | |
2575c402 |
149 | Unicode characters can also be added to a string by using the C<\x{...}> |
150 | notation. The Unicode code for the desired character, in hexadecimal, |
151 | should be placed in the braces. For instance, a smiley face is |
2bbc8d55 |
152 | C<\x{263A}>. This encoding scheme works for all characters, but |
2575c402 |
153 | for characters under 0x100, note that Perl may use an 8 bit encoding |
154 | internally, for optimization and/or backward compatibility. |
3e4dbfed |
155 | |
156 | Additionally, if you |
574c8022 |
157 | |
3e4dbfed |
158 | use charnames ':full'; |
574c8022 |
159 | |
1bfb14c4 |
160 | you can use the C<\N{...}> notation and put the official Unicode |
161 | character name within the braces, such as C<\N{WHITE SMILING FACE}>. |
376d9008 |
162 | |
393fec97 |
163 | =item * |
164 | |
574c8022 |
165 | If an appropriate L<encoding> is specified, identifiers within the |
166 | Perl script may contain Unicode alphanumeric characters, including |
376d9008 |
167 | ideographs. Perl does not currently attempt to canonicalize variable |
168 | names. |
393fec97 |
169 | |
393fec97 |
170 | =item * |
171 | |
1bfb14c4 |
172 | Regular expressions match characters instead of bytes. "." matches |
2575c402 |
173 | a character instead of a byte. |
393fec97 |
174 | |
393fec97 |
175 | =item * |
176 | |
177 | Character classes in regular expressions match characters instead of |
376d9008 |
178 | bytes and match against the character properties specified in the |
1bfb14c4 |
179 | Unicode properties database. C<\w> can be used to match a Japanese |
75daf61c |
180 | ideograph, for instance. |
393fec97 |
181 | |
393fec97 |
182 | =item * |
183 | |
eb0cc9e3 |
184 | Named Unicode properties, scripts, and block ranges may be used like |
376d9008 |
185 | character classes via the C<\p{}> "matches property" construct and |
822502e5 |
186 | the C<\P{}> negation, "doesn't match property". |
2575c402 |
187 | See L</"Unicode Character Properties"> for more details. |
822502e5 |
188 | |
189 | You can define your own character properties and use them |
190 | in the regular expression with the C<\p{}> or C<\P{}> construct. |
822502e5 |
191 | See L</"User-Defined Character Properties"> for more details. |
192 | |
193 | =item * |
194 | |
51f494cc |
195 | The special pattern C<\X> matches a logical character, an C<extended grapheme |
196 | cluster> in Standardese. In Unicode what appears to the user to be a single |
197 | character, for example an accented C<G>, may in fact be composed of a sequence |
198 | of characters, in this case a C<G> followed by an accent character. C<\X> |
199 | will match the entire sequence. |
822502e5 |
200 | |
201 | =item * |
202 | |
203 | The C<tr///> operator translates characters instead of bytes. Note |
204 | that the C<tr///CU> functionality has been removed. For similar |
205 | functionality see pack('U0', ...) and pack('C0', ...). |
206 | |
207 | =item * |
208 | |
209 | Case translation operators use the Unicode case translation tables |
210 | when character input is provided. Note that C<uc()>, or C<\U> in |
211 | interpolated strings, translates to uppercase, while C<ucfirst>, |
212 | or C<\u> in interpolated strings, translates to titlecase in languages |
e1b711da |
213 | that make the distinction (which is equivalent to uppercase in languages |
214 | without the distinction). |
822502e5 |
215 | |
216 | =item * |
217 | |
218 | Most operators that deal with positions or lengths in a string will |
219 | automatically switch to using character positions, including |
220 | C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, |
221 | C<sprintf()>, C<write()>, and C<length()>. An operator that |
51f494cc |
222 | specifically does not switch is C<vec()>. Operators that really don't |
223 | care include operators that treat strings as a bucket of bits such as |
822502e5 |
224 | C<sort()>, and operators dealing with filenames. |
225 | |
226 | =item * |
227 | |
51f494cc |
228 | The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often |
822502e5 |
229 | used for byte-oriented formats. Again, think C<char> in the C language. |
230 | |
231 | There is a new C<U> specifier that converts between Unicode characters |
232 | and code points. There is also a C<W> specifier that is the equivalent of |
233 | C<chr>/C<ord> and properly handles character values even if they are above 255. |
234 | |
235 | =item * |
236 | |
237 | The C<chr()> and C<ord()> functions work on characters, similar to |
238 | C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and |
239 | C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for |
240 | emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. |
241 | While these methods reveal the internal encoding of Unicode strings, |
242 | that is not something one normally needs to care about at all. |
243 | |
244 | =item * |
245 | |
246 | The bit string operators, C<& | ^ ~>, can operate on character data. |
247 | However, for backward compatibility, such as when using bit string |
248 | operations when characters are all less than 256 in ordinal value, one |
249 | should not use C<~> (the bit complement) with characters of both |
250 | values less than 256 and values greater than 256. Most importantly, |
251 | DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) |
252 | will not hold. The reason for this mathematical I<faux pas> is that |
253 | the complement cannot return B<both> the 8-bit (byte-wide) bit |
254 | complement B<and> the full character-wide bit complement. |
255 | |
256 | =item * |
257 | |
e1b711da |
258 | You can define your own mappings to be used in lc(), |
822502e5 |
259 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). |
822502e5 |
260 | See L</"User-Defined Case Mappings"> for more details. |
261 | |
262 | =back |
263 | |
264 | =over 4 |
265 | |
266 | =item * |
267 | |
268 | And finally, C<scalar reverse()> reverses by character rather than by byte. |
269 | |
270 | =back |
271 | |
272 | =head2 Unicode Character Properties |
273 | |
51f494cc |
274 | Most Unicode character properties are accessible by using regular expressions. |
275 | They are used like character classes via the C<\p{}> "matches property" |
276 | construct and the C<\P{}> negation, "doesn't match property". |
277 | |
278 | For instance, C<\p{Uppercase}> matches any character with the Unicode |
279 | "Uppercase" property, while C<\p{L}> matches any character with a |
280 | General_Category of "L" (letter) property. Brackets are not |
281 | required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. |
282 | |
e1b711da |
283 | More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase |
284 | property value is True, and C<\P{Uppercase}> matches any character whose |
285 | Uppercase property value is False, and they could have been written as |
51f494cc |
286 | C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively |
287 | |
288 | This formality is needed when properties are not binary, that is if they can |
289 | take on more values than just True and False. For example, the Bidi_Class (see |
290 | L</"Bidirectional Character Types"> below), can take on a number of different |
291 | values, such as Left, Right, Whitespace, and others. To match these, one needs |
e1b711da |
292 | to specify the property name (Bidi_Class), and the value being matched against |
51f494cc |
293 | (Left, Right, etc.). This is done, as in the examples above, by having the two |
294 | components separated by an equal sign (or interchangeably, a colon), like |
295 | C<\p{Bidi_Class: Left}>. |
296 | |
297 | All Unicode-defined character properties may be written in these compound forms |
298 | of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some |
299 | additional properties that are written only in the single form, as well as |
300 | single-form short-cuts for all binary properties and certain others described |
301 | below, in which you may omit the property name and the equals or colon |
302 | separator. |
303 | |
304 | Most Unicode character properties have at least two synonyms (or aliases if you |
305 | prefer), a short one that is easier to type, and a longer one which is more |
306 | descriptive and hence it is easier to understand what it means. Thus the "L" |
307 | and "Letter" above are equivalent and can be used interchangeably. Likewise, |
308 | "Upper" is a synonym for "Uppercase", and we could have written |
309 | C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically |
310 | various synonyms for the values the property can be. For binary properties, |
311 | "True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", |
312 | "No", and "N". But be careful. A short form of a value for one property may |
e1b711da |
313 | not mean the same thing as the same short form for another. Thus, for the |
51f494cc |
314 | General_Category property, "L" means "Letter", but for the Bidi_Class property, |
315 | "L" means "Left". A complete list of properties and synonyms is in |
316 | L<perluniprops>. |
317 | |
318 | Upper/lower case differences in the property names and values are irrelevant, |
319 | thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. |
320 | Similarly, you can add or subtract underscores anywhere in the middle of a |
321 | word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space |
322 | is irrelevant adjacent to non-word characters, such as the braces and the equals |
323 | or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are |
324 | equivalent to these as well. In fact, in most cases, white space and even |
325 | hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is |
326 | equivalent. All this is called "loose-matching" by Unicode. The few places |
327 | where stricter matching is employed is in the middle of numbers, and the Perl |
328 | extension properties that begin or end with an underscore. Stricter matching |
329 | cares about white space (except adjacent to the non-word characters) and |
330 | hyphens, and non-interior underscores. |
4193bef7 |
331 | |
376d9008 |
332 | You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret |
333 | (^) between the first brace and the property name: C<\p{^Tamil}> is |
eb0cc9e3 |
334 | equal to C<\P{Tamil}>. |
4193bef7 |
335 | |
51f494cc |
336 | =head3 B<General_Category> |
14bb0a9a |
337 | |
51f494cc |
338 | Every Unicode character is assigned a general category, which is the "most |
339 | usual categorization of a character" (from |
340 | L<http://www.unicode.org/reports/tr44>). |
822502e5 |
341 | |
51f494cc |
342 | The compound way of writing these is like C<{\p{General_Category=Number}> |
343 | (short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up |
344 | through the equal or colon separator is omitted. So you can instead just write |
345 | C<\pN>. |
822502e5 |
346 | |
51f494cc |
347 | Here are the short and long forms of the General Category properties: |
393fec97 |
348 | |
d73e5302 |
349 | Short Long |
350 | |
351 | L Letter |
51f494cc |
352 | LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) |
353 | Lu Uppercase_Letter |
354 | Ll Lowercase_Letter |
355 | Lt Titlecase_Letter |
356 | Lm Modifier_Letter |
357 | Lo Other_Letter |
d73e5302 |
358 | |
359 | M Mark |
51f494cc |
360 | Mn Nonspacing_Mark |
361 | Mc Spacing_Mark |
362 | Me Enclosing_Mark |
d73e5302 |
363 | |
364 | N Number |
51f494cc |
365 | Nd Decimal_Number (also Digit) |
366 | Nl Letter_Number |
367 | No Other_Number |
368 | |
369 | P Punctuation (also Punct) |
370 | Pc Connector_Punctuation |
371 | Pd Dash_Punctuation |
372 | Ps Open_Punctuation |
373 | Pe Close_Punctuation |
374 | Pi Initial_Punctuation |
d73e5302 |
375 | (may behave like Ps or Pe depending on usage) |
51f494cc |
376 | Pf Final_Punctuation |
d73e5302 |
377 | (may behave like Ps or Pe depending on usage) |
51f494cc |
378 | Po Other_Punctuation |
d73e5302 |
379 | |
380 | S Symbol |
51f494cc |
381 | Sm Math_Symbol |
382 | Sc Currency_Symbol |
383 | Sk Modifier_Symbol |
384 | So Other_Symbol |
d73e5302 |
385 | |
386 | Z Separator |
51f494cc |
387 | Zs Space_Separator |
388 | Zl Line_Separator |
389 | Zp Paragraph_Separator |
d73e5302 |
390 | |
391 | C Other |
51f494cc |
392 | Cc Control (also Cntrl) |
e150c829 |
393 | Cf Format |
eb0cc9e3 |
394 | Cs Surrogate (not usable) |
51f494cc |
395 | Co Private_Use |
e150c829 |
396 | Cn Unassigned |
1ac13f9a |
397 | |
376d9008 |
398 | Single-letter properties match all characters in any of the |
3e4dbfed |
399 | two-letter sub-properties starting with the same letter. |
12ac2576 |
400 | C<LC> and C<L&> are special cases, which are aliases for the set of |
401 | C<Ll>, C<Lu>, and C<Lt>. |
32293815 |
402 | |
eb0cc9e3 |
403 | Because Perl hides the need for the user to understand the internal |
1bfb14c4 |
404 | representation of Unicode characters, there is no need to implement |
405 | the somewhat messy concept of surrogates. C<Cs> is therefore not |
eb0cc9e3 |
406 | supported. |
d73e5302 |
407 | |
51f494cc |
408 | =head3 B<Bidirectional Character Types> |
822502e5 |
409 | |
376d9008 |
410 | Because scripts differ in their directionality--Hebrew is |
12ac2576 |
411 | written right to left, for example--Unicode supplies these properties in |
51f494cc |
412 | the Bidi_Class class: |
32293815 |
413 | |
eb0cc9e3 |
414 | Property Meaning |
92e830a9 |
415 | |
12ac2576 |
416 | L Left-to-Right |
417 | LRE Left-to-Right Embedding |
418 | LRO Left-to-Right Override |
419 | R Right-to-Left |
51f494cc |
420 | AL Arabic Letter |
12ac2576 |
421 | RLE Right-to-Left Embedding |
422 | RLO Right-to-Left Override |
423 | PDF Pop Directional Format |
424 | EN European Number |
51f494cc |
425 | ES European Separator |
426 | ET European Terminator |
12ac2576 |
427 | AN Arabic Number |
51f494cc |
428 | CS Common Separator |
12ac2576 |
429 | NSM Non-Spacing Mark |
430 | BN Boundary Neutral |
431 | B Paragraph Separator |
432 | S Segment Separator |
433 | WS Whitespace |
434 | ON Other Neutrals |
435 | |
51f494cc |
436 | This property is always written in the compound form. |
437 | For example, C<\p{Bidi_Class:R}> matches characters that are normally |
eb0cc9e3 |
438 | written right to left. |
439 | |
51f494cc |
440 | =head3 B<Scripts> |
441 | |
e1b711da |
442 | The world's languages are written in a number of scripts. This sentence |
443 | (unless you're reading it in translation) is written in Latin, while Russian is |
444 | written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in |
445 | Hiragana or Katakana. There are many more. |
51f494cc |
446 | |
447 | The Unicode Script property gives what script a given character is in, |
448 | and can be matched with the compound form like C<\p{Script=Hebrew}> (short: |
449 | C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit |
450 | everything up through the equals (or colon), and simply write C<\p{Latin}> or |
451 | C<\P{Cyrillic}>. |
452 | |
453 | A complete list of scripts and their shortcuts is in L<perluniprops>. |
454 | |
455 | =head3 B<Extended property classes> |
456 | |
457 | There are many more property classes than the basic ones described here, |
458 | including some Perl extensions. |
459 | A complete list is in L<perluniprops>. |
460 | The extensions are more fully described in L<perlrecharclass> |
461 | |
462 | =head3 B<Use of "Is" Prefix> |
822502e5 |
463 | |
1bfb14c4 |
464 | For backward compatibility (with Perl 5.6), all properties mentioned |
51f494cc |
465 | so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for |
466 | example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to |
467 | C<\p{Arabic}>. |
eb0cc9e3 |
468 | |
51f494cc |
469 | =head3 B<Blocks> |
2796c109 |
470 | |
1bfb14c4 |
471 | In addition to B<scripts>, Unicode also defines B<blocks> of |
472 | characters. The difference between scripts and blocks is that the |
473 | concept of scripts is closer to natural languages, while the concept |
51f494cc |
474 | of blocks is more of an artificial grouping based on groups of Unicode |
475 | characters with consecutive ordinal values. For example, the C<Basic Latin> |
476 | block is all characters whose ordinals are between 0 and 127, inclusive, in |
477 | other words, the ASCII characters. The C<Latin> script contains some letters |
478 | from this block as well as several more, like C<Latin-1 Supplement>, |
479 | C<Latin Extended-A>, I<etc.>, but it does not contain all the characters from |
480 | those blocks. It does not, for example, contain digits, because digits are |
481 | shared across many scripts. Digits and similar groups, like punctuation, are in |
482 | the script called C<Common>. There is also a script called C<Inherited> for |
483 | characters that modify other characters, and inherit the script value of the |
484 | controlling character. |
485 | |
486 | For more about scripts versus blocks, see UAX#24 "Unicode Script Property": |
487 | L<http://www.unicode.org/reports/tr24> |
488 | |
489 | The Script property is likely to be the one you want to use when processing |
490 | natural language; the Block property may be useful in working with the nuts and |
491 | bolts of Unicode. |
492 | |
493 | Block names are matched in the compound form, like C<\p{Block: Arrows}> or |
494 | C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a |
495 | Unicode-defined short name. But Perl does provide a (slight) shortcut: You |
496 | can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards |
497 | compatibility, the C<In> prefix may be omitted if there is no naming conflict |
498 | with a script or any other property, and you can even use an C<Is> prefix |
499 | instead in those cases. But it is not a good idea to do this, for a couple |
500 | reasons: |
501 | |
502 | =over 4 |
503 | |
504 | =item 1 |
505 | |
506 | It is confusing. There are many naming conflicts, and you may forget some. |
507 | For example, \p{Hebrew} means the I<script> Hebrew, and NOT the I<block> |
508 | Hebrew. But would you remember that 6 months from now? |
509 | |
510 | =item 2 |
511 | |
512 | It is unstable. A new version of Unicode may pre-empt the current meaning by |
513 | creating a property with the same name. There was a time in very early Unicode |
514 | releases when \p{Hebrew} would have matched the I<block> Hebrew; now it |
515 | doesn't. |
32293815 |
516 | |
393fec97 |
517 | =back |
518 | |
51f494cc |
519 | Some people just prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}> |
520 | instead of the shortcuts, for clarity, and because they can't remember the |
521 | difference between 'In' and 'Is' anyway (or aren't confident that those who |
522 | eventually will read their code will know). |
523 | |
524 | A complete list of blocks and their shortcuts is in L<perluniprops>. |
525 | |
376d9008 |
526 | =head2 User-Defined Character Properties |
491fd90a |
527 | |
51f494cc |
528 | You can define your own binary character properties by defining subroutines |
529 | whose names begin with "In" or "Is". The subroutines can be defined in any |
530 | package. The user-defined properties can be used in the regular expression |
531 | C<\p> and C<\P> constructs; if you are using a user-defined property from a |
532 | package other than the one you are in, you must specify its package in the |
533 | C<\p> or C<\P> construct. |
bac0b425 |
534 | |
51f494cc |
535 | # assuming property Is_Foreign defined in Lang:: |
bac0b425 |
536 | package main; # property package name required |
537 | if ($txt =~ /\p{Lang::IsForeign}+/) { ... } |
538 | |
539 | package Lang; # property package name not required |
540 | if ($txt =~ /\p{IsForeign}+/) { ... } |
541 | |
542 | |
543 | Note that the effect is compile-time and immutable once defined. |
491fd90a |
544 | |
376d9008 |
545 | The subroutines must return a specially-formatted string, with one |
546 | or more newline-separated lines. Each line must be one of the following: |
491fd90a |
547 | |
548 | =over 4 |
549 | |
550 | =item * |
551 | |
510254c9 |
552 | A single hexadecimal number denoting a Unicode code point to include. |
553 | |
554 | =item * |
555 | |
99a6b1f0 |
556 | Two hexadecimal numbers separated by horizontal whitespace (space or |
376d9008 |
557 | tabular characters) denoting a range of Unicode code points to include. |
491fd90a |
558 | |
559 | =item * |
560 | |
376d9008 |
561 | Something to include, prefixed by "+": a built-in character |
bac0b425 |
562 | property (prefixed by "utf8::") or a user-defined character property, |
563 | to represent all the characters in that property; two hexadecimal code |
564 | points for a range; or a single hexadecimal code point. |
491fd90a |
565 | |
566 | =item * |
567 | |
376d9008 |
568 | Something to exclude, prefixed by "-": an existing character |
bac0b425 |
569 | property (prefixed by "utf8::") or a user-defined character property, |
570 | to represent all the characters in that property; two hexadecimal code |
571 | points for a range; or a single hexadecimal code point. |
491fd90a |
572 | |
573 | =item * |
574 | |
376d9008 |
575 | Something to negate, prefixed "!": an existing character |
bac0b425 |
576 | property (prefixed by "utf8::") or a user-defined character property, |
577 | to represent all the characters in that property; two hexadecimal code |
578 | points for a range; or a single hexadecimal code point. |
579 | |
580 | =item * |
581 | |
582 | Something to intersect with, prefixed by "&": an existing character |
583 | property (prefixed by "utf8::") or a user-defined character property, |
584 | for all the characters except the characters in the property; two |
585 | hexadecimal code points for a range; or a single hexadecimal code point. |
491fd90a |
586 | |
587 | =back |
588 | |
589 | For example, to define a property that covers both the Japanese |
590 | syllabaries (hiragana and katakana), you can define |
591 | |
592 | sub InKana { |
d5822f25 |
593 | return <<END; |
594 | 3040\t309F |
595 | 30A0\t30FF |
491fd90a |
596 | END |
597 | } |
598 | |
d5822f25 |
599 | Imagine that the here-doc end marker is at the beginning of the line. |
600 | Now you can use C<\p{InKana}> and C<\P{InKana}>. |
491fd90a |
601 | |
602 | You could also have used the existing block property names: |
603 | |
604 | sub InKana { |
605 | return <<'END'; |
606 | +utf8::InHiragana |
607 | +utf8::InKatakana |
608 | END |
609 | } |
610 | |
611 | Suppose you wanted to match only the allocated characters, |
d5822f25 |
612 | not the raw block ranges: in other words, you want to remove |
491fd90a |
613 | the non-characters: |
614 | |
615 | sub InKana { |
616 | return <<'END'; |
617 | +utf8::InHiragana |
618 | +utf8::InKatakana |
619 | -utf8::IsCn |
620 | END |
621 | } |
622 | |
623 | The negation is useful for defining (surprise!) negated classes. |
624 | |
625 | sub InNotKana { |
626 | return <<'END'; |
627 | !utf8::InHiragana |
628 | -utf8::InKatakana |
629 | +utf8::IsCn |
630 | END |
631 | } |
632 | |
bac0b425 |
633 | Intersection is useful for getting the common characters matched by |
634 | two (or more) classes. |
635 | |
636 | sub InFooAndBar { |
637 | return <<'END'; |
638 | +main::Foo |
639 | &main::Bar |
640 | END |
641 | } |
642 | |
643 | It's important to remember not to use "&" for the first set -- that |
644 | would be intersecting with nothing (resulting in an empty set). |
645 | |
822502e5 |
646 | =head2 User-Defined Case Mappings |
647 | |
3a2263fe |
648 | You can also define your own mappings to be used in the lc(), |
649 | lcfirst(), uc(), and ucfirst() (or their string-inlined versions). |
822502e5 |
650 | The principle is similar to that of user-defined character |
51f494cc |
651 | properties: to define subroutines |
3a2263fe |
652 | with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for |
653 | the first character in ucfirst()), and C<ToUpper> (for uc(), and the |
654 | rest of the characters in ucfirst()). |
655 | |
51f494cc |
656 | The string returned by the subroutines needs to be two hexadecimal numbers |
e1b711da |
657 | separated by two tabulators: the two numbers being, respectively, the source |
658 | code point and the destination code point. For example: |
3a2263fe |
659 | |
660 | sub ToUpper { |
661 | return <<END; |
51f494cc |
662 | 0061\t\t0041 |
3a2263fe |
663 | END |
664 | } |
665 | |
51f494cc |
666 | defines an uc() mapping that causes only the character "a" |
667 | to be mapped to "A"; all other characters will remain unchanged. |
3a2263fe |
668 | |
51f494cc |
669 | (For serious hackers only) The above means you have to furnish a complete |
670 | mapping; you can't just override a couple of characters and leave the rest |
671 | unchanged. You can find all the mappings in the directory |
672 | C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as the |
673 | here-document, and the C<utf8::ToSpecFoo> are special exception mappings |
674 | derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. The C<Digit> and |
675 | C<Fold> mappings that one can see in the directory are not directly |
676 | user-accessible, one can use either the C<Unicode::UCD> module, or just match |
677 | case-insensitively (that's when the C<Fold> mapping is used). |
3a2263fe |
678 | |
51f494cc |
679 | The mappings will only take effect on scalars that have been marked as having |
680 | Unicode characters, for example by using C<utf8::upgrade()>. |
681 | Old byte-style strings are not affected. |
3a2263fe |
682 | |
51f494cc |
683 | The mappings are in effect for the package they are defined in. |
3a2263fe |
684 | |
376d9008 |
685 | =head2 Character Encodings for Input and Output |
8cbd9a7a |
686 | |
7221edc9 |
687 | See L<Encode>. |
8cbd9a7a |
688 | |
c29a771d |
689 | =head2 Unicode Regular Expression Support Level |
776f8809 |
690 | |
376d9008 |
691 | The following list of Unicode support for regular expressions describes |
692 | all the features currently supported. The references to "Level N" |
8158862b |
693 | and the section numbers refer to the Unicode Technical Standard #18, |
694 | "Unicode Regular Expressions", version 11, in May 2005. |
776f8809 |
695 | |
696 | =over 4 |
697 | |
698 | =item * |
699 | |
700 | Level 1 - Basic Unicode Support |
701 | |
8158862b |
702 | RL1.1 Hex Notation - done [1] |
703 | RL1.2 Properties - done [2][3] |
704 | RL1.2a Compatibility Properties - done [4] |
705 | RL1.3 Subtraction and Intersection - MISSING [5] |
706 | RL1.4 Simple Word Boundaries - done [6] |
707 | RL1.5 Simple Loose Matches - done [7] |
708 | RL1.6 Line Boundaries - MISSING [8] |
709 | RL1.7 Supplementary Code Points - done [9] |
710 | |
711 | [1] \x{...} |
712 | [2] \p{...} \P{...} |
e1b711da |
713 | [3] supports not only minimal list, but all Unicode character |
714 | properties (see L</Unicode Character Properties>) |
8158862b |
715 | [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] |
716 | [5] can use regular expression look-ahead [a] or |
717 | user-defined character properties [b] to emulate set operations |
718 | [6] \b \B |
e1b711da |
719 | [7] note that Perl does Full case-folding in matching (but with bugs), |
720 | not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, |
2bbc8d55 |
721 | not with 1F80. This difference matters mainly for certain Greek |
376d9008 |
722 | capital letters with certain modifiers: the Full case-folding |
723 | decomposes the letter, while the Simple case-folding would map |
e0f9d4a8 |
724 | it to a single character. |
8158862b |
725 | [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), |
726 | CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); |
727 | should also affect <>, $., and script line numbers; |
728 | should not split lines within CRLF [c] (i.e. there is no empty |
729 | line between \r and \n) |
730 | [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF |
731 | but also beyond U+10FFFF [d] |
7207e29d |
732 | |
237bad5b |
733 | [a] You can mimic class subtraction using lookahead. |
8158862b |
734 | For example, what UTS#18 might write as |
29bdacb8 |
735 | |
dbe420b4 |
736 | [{Greek}-[{UNASSIGNED}]] |
737 | |
738 | in Perl can be written as: |
739 | |
1d81abf3 |
740 | (?!\p{Unassigned})\p{InGreekAndCoptic} |
741 | (?=\p{Assigned})\p{InGreekAndCoptic} |
dbe420b4 |
742 | |
743 | But in this particular example, you probably really want |
744 | |
1bfb14c4 |
745 | \p{GreekAndCoptic} |
dbe420b4 |
746 | |
747 | which will match assigned characters known to be part of the Greek script. |
29bdacb8 |
748 | |
5ca1ac52 |
749 | Also see the Unicode::Regex::Set module, it does implement the full |
8158862b |
750 | UTS#18 grouping, intersection, union, and removal (subtraction) syntax. |
751 | |
752 | [b] '+' for union, '-' for removal (set-difference), '&' for intersection |
753 | (see L</"User-Defined Character Properties">) |
754 | |
755 | [c] Try the C<:crlf> layer (see L<PerlIO>). |
5ca1ac52 |
756 | |
c670e63a |
757 | [d] U+FFFF will currently generate a warning message if 'utf8' warnings are |
758 | enabled |
237bad5b |
759 | |
776f8809 |
760 | =item * |
761 | |
762 | Level 2 - Extended Unicode Support |
763 | |
8158862b |
764 | RL2.1 Canonical Equivalents - MISSING [10][11] |
c670e63a |
765 | RL2.2 Default Grapheme Clusters - MISSING [12] |
8158862b |
766 | RL2.3 Default Word Boundaries - MISSING [14] |
767 | RL2.4 Default Loose Matches - MISSING [15] |
768 | RL2.5 Name Properties - MISSING [16] |
769 | RL2.6 Wildcard Properties - MISSING |
770 | |
771 | [10] see UAX#15 "Unicode Normalization Forms" |
772 | [11] have Unicode::Normalize but not integrated to regexes |
e1b711da |
773 | [12] have \X but we don't have a "Grapheme Cluster Mode" |
8158862b |
774 | [14] see UAX#29, Word Boundaries |
775 | [15] see UAX#21 "Case Mappings" |
776 | [16] have \N{...} but neither compute names of CJK Ideographs |
777 | and Hangul Syllables nor use a loose match [e] |
778 | |
779 | [e] C<\N{...}> allows namespaces (see L<charnames>). |
776f8809 |
780 | |
781 | =item * |
782 | |
8158862b |
783 | Level 3 - Tailored Support |
784 | |
785 | RL3.1 Tailored Punctuation - MISSING |
786 | RL3.2 Tailored Grapheme Clusters - MISSING [17][18] |
787 | RL3.3 Tailored Word Boundaries - MISSING |
788 | RL3.4 Tailored Loose Matches - MISSING |
789 | RL3.5 Tailored Ranges - MISSING |
790 | RL3.6 Context Matching - MISSING [19] |
791 | RL3.7 Incremental Matches - MISSING |
792 | ( RL3.8 Unicode Set Sharing ) |
793 | RL3.9 Possible Match Sets - MISSING |
794 | RL3.10 Folded Matching - MISSING [20] |
795 | RL3.11 Submatchers - MISSING |
796 | |
797 | [17] see UAX#10 "Unicode Collation Algorithms" |
798 | [18] have Unicode::Collate but not integrated to regexes |
799 | [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see |
800 | outside of the target substring |
801 | [20] need insensitive matching for linguistic features other than case; |
802 | for example, hiragana to katakana, wide and narrow, simplified Han |
803 | to traditional Han (see UTR#30 "Character Foldings") |
776f8809 |
804 | |
805 | =back |
806 | |
c349b1b9 |
807 | =head2 Unicode Encodings |
808 | |
376d9008 |
809 | Unicode characters are assigned to I<code points>, which are abstract |
810 | numbers. To use these numbers, various encodings are needed. |
c349b1b9 |
811 | |
812 | =over 4 |
813 | |
c29a771d |
814 | =item * |
5cb3728c |
815 | |
816 | UTF-8 |
c349b1b9 |
817 | |
3e4dbfed |
818 | UTF-8 is a variable-length (1 to 6 bytes, current character allocations |
376d9008 |
819 | require 4 bytes), byte-order independent encoding. For ASCII (and we |
820 | really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is |
821 | transparent. |
c349b1b9 |
822 | |
8c007b5a |
823 | The following table is from Unicode 3.2. |
05632f9a |
824 | |
e1b711da |
825 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
05632f9a |
826 | |
e1b711da |
827 | U+0000..U+007F 00..7F |
828 | U+0080..U+07FF * C2..DF 80..BF |
829 | U+0800..U+0FFF E0 * A0..BF 80..BF |
ec90690f |
830 | U+1000..U+CFFF E1..EC 80..BF 80..BF |
831 | U+D000..U+D7FF ED 80..9F 80..BF |
e1b711da |
832 | U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++ |
ec90690f |
833 | U+E000..U+FFFF EE..EF 80..BF 80..BF |
e1b711da |
834 | U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF |
835 | U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF |
836 | U+100000..U+10FFFF F4 80..8F 80..BF 80..BF |
837 | |
838 | Note the gaps before several of the byte entries above marked by '*'. These are |
839 | caused by legal UTF-8 avoiding non-shortest encodings: it is technically |
840 | possible to UTF-8-encode a single code point in different ways, but that is |
841 | explicitly forbidden, and the shortest possible encoding should always be used |
842 | (and that is what Perl does). |
37361303 |
843 | |
376d9008 |
844 | Another way to look at it is via bits: |
05632f9a |
845 | |
846 | Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte |
847 | |
848 | 0aaaaaaa 0aaaaaaa |
849 | 00000bbbbbaaaaaa 110bbbbb 10aaaaaa |
850 | ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa |
851 | 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa |
852 | |
853 | As you can see, the continuation bytes all begin with C<10>, and the |
e1b711da |
854 | leading bits of the start byte tell how many bytes there are in the |
05632f9a |
855 | encoded character. |
856 | |
c29a771d |
857 | =item * |
5cb3728c |
858 | |
859 | UTF-EBCDIC |
dbe420b4 |
860 | |
376d9008 |
861 | Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. |
dbe420b4 |
862 | |
c29a771d |
863 | =item * |
5cb3728c |
864 | |
1e54db1a |
865 | UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) |
c349b1b9 |
866 | |
1bfb14c4 |
867 | The followings items are mostly for reference and general Unicode |
868 | knowledge, Perl doesn't use these constructs internally. |
dbe420b4 |
869 | |
c349b1b9 |
870 | UTF-16 is a 2 or 4 byte encoding. The Unicode code points |
1bfb14c4 |
871 | C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code |
872 | points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is |
c349b1b9 |
873 | using I<surrogates>, the first 16-bit unit being the I<high |
874 | surrogate>, and the second being the I<low surrogate>. |
875 | |
376d9008 |
876 | Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> |
c349b1b9 |
877 | range of Unicode code points in pairs of 16-bit units. The I<high |
376d9008 |
878 | surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> |
879 | are the range C<U+DC00..U+DFFF>. The surrogate encoding is |
c349b1b9 |
880 | |
881 | $hi = ($uni - 0x10000) / 0x400 + 0xD800; |
882 | $lo = ($uni - 0x10000) % 0x400 + 0xDC00; |
883 | |
884 | and the decoding is |
885 | |
1a3fa709 |
886 | $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); |
c349b1b9 |
887 | |
feda178f |
888 | If you try to generate surrogates (for example by using chr()), you |
e1b711da |
889 | will get a warning, if warnings are turned on, because those code |
376d9008 |
890 | points are not valid for a Unicode character. |
9466bab6 |
891 | |
376d9008 |
892 | Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 |
c349b1b9 |
893 | itself can be used for in-memory computations, but if storage or |
376d9008 |
894 | transfer is required either UTF-16BE (big-endian) or UTF-16LE |
895 | (little-endian) encodings must be chosen. |
c349b1b9 |
896 | |
897 | This introduces another problem: what if you just know that your data |
376d9008 |
898 | is UTF-16, but you don't know which endianness? Byte Order Marks, or |
899 | BOMs, are a solution to this. A special character has been reserved |
86bbd6d1 |
900 | in Unicode to function as a byte order marker: the character with the |
376d9008 |
901 | code point C<U+FEFF> is the BOM. |
042da322 |
902 | |
c349b1b9 |
903 | The trick is that if you read a BOM, you will know the byte order, |
376d9008 |
904 | since if it was written on a big-endian platform, you will read the |
905 | bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, |
906 | you will read the bytes C<0xFF 0xFE>. (And if the originating platform |
907 | was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) |
042da322 |
908 | |
86bbd6d1 |
909 | The way this trick works is that the character with the code point |
376d9008 |
910 | C<U+FFFE> is guaranteed not to be a valid Unicode character, so the |
911 | sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in |
1bfb14c4 |
912 | little-endian format" and cannot be C<U+FFFE>, represented in big-endian |
e1b711da |
913 | format". (Actually, C<U+FFFE> is legal for use by your program, even for |
914 | input/output, but better not use it if you need a BOM. But it is "illegal for |
915 | interchange", so that an unsuspecting program won't get confused.) |
c349b1b9 |
916 | |
c29a771d |
917 | =item * |
5cb3728c |
918 | |
1e54db1a |
919 | UTF-32, UTF-32BE, UTF-32LE |
c349b1b9 |
920 | |
921 | The UTF-32 family is pretty much like the UTF-16 family, expect that |
042da322 |
922 | the units are 32-bit, and therefore the surrogate scheme is not |
376d9008 |
923 | needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and |
924 | C<0xFF 0xFE 0x00 0x00> for LE. |
c349b1b9 |
925 | |
c29a771d |
926 | =item * |
5cb3728c |
927 | |
928 | UCS-2, UCS-4 |
c349b1b9 |
929 | |
86bbd6d1 |
930 | Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit |
376d9008 |
931 | encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, |
339cfa0e |
932 | because it does not use surrogates. UCS-4 is a 32-bit encoding, |
933 | functionally identical to UTF-32. |
c349b1b9 |
934 | |
c29a771d |
935 | =item * |
5cb3728c |
936 | |
937 | UTF-7 |
c349b1b9 |
938 | |
376d9008 |
939 | A seven-bit safe (non-eight-bit) encoding, which is useful if the |
940 | transport or storage is not eight-bit safe. Defined by RFC 2152. |
c349b1b9 |
941 | |
95a1a48b |
942 | =back |
943 | |
0d7c09bb |
944 | =head2 Security Implications of Unicode |
945 | |
e1b711da |
946 | Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. |
947 | Also, note the following: |
948 | |
0d7c09bb |
949 | =over 4 |
950 | |
951 | =item * |
952 | |
953 | Malformed UTF-8 |
bf0fa0b2 |
954 | |
955 | Unfortunately, the specification of UTF-8 leaves some room for |
956 | interpretation of how many bytes of encoded output one should generate |
376d9008 |
957 | from one input Unicode character. Strictly speaking, the shortest |
958 | possible sequence of UTF-8 bytes should be generated, |
959 | because otherwise there is potential for an input buffer overflow at |
feda178f |
960 | the receiving end of a UTF-8 connection. Perl always generates the |
e1b711da |
961 | shortest length UTF-8, and with warnings on, Perl will warn about |
376d9008 |
962 | non-shortest length UTF-8 along with other malformations, such as the |
963 | surrogates, which are not real Unicode code points. |
bf0fa0b2 |
964 | |
0d7c09bb |
965 | =item * |
966 | |
967 | Regular expressions behave slightly differently between byte data and |
376d9008 |
968 | character (Unicode) data. For example, the "word character" character |
969 | class C<\w> will work differently depending on if data is eight-bit bytes |
970 | or Unicode. |
0d7c09bb |
971 | |
376d9008 |
972 | In the first case, the set of C<\w> characters is either small--the |
973 | default set of alphabetic characters, digits, and the "_"--or, if you |
0d7c09bb |
974 | are using a locale (see L<perllocale>), the C<\w> might contain a few |
975 | more letters according to your language and country. |
976 | |
376d9008 |
977 | In the second case, the C<\w> set of characters is much, much larger. |
1bfb14c4 |
978 | Most importantly, even in the set of the first 256 characters, it will |
979 | probably match different characters: unlike most locales, which are |
980 | specific to a language and country pair, Unicode classifies all the |
981 | characters that are letters I<somewhere> as C<\w>. For example, your |
982 | locale might not think that LATIN SMALL LETTER ETH is a letter (unless |
983 | you happen to speak Icelandic), but Unicode does. |
0d7c09bb |
984 | |
376d9008 |
985 | As discussed elsewhere, Perl has one foot (two hooves?) planted in |
1bfb14c4 |
986 | each of two worlds: the old world of bytes and the new world of |
987 | characters, upgrading from bytes to characters when necessary. |
376d9008 |
988 | If your legacy code does not explicitly use Unicode, no automatic |
989 | switch-over to characters should happen. Characters shouldn't get |
1bfb14c4 |
990 | downgraded to bytes, either. It is possible to accidentally mix bytes |
991 | and characters, however (see L<perluniintro>), in which case C<\w> in |
992 | regular expressions might start behaving differently. Review your |
993 | code. Use warnings and the C<strict> pragma. |
0d7c09bb |
994 | |
995 | =back |
996 | |
c349b1b9 |
997 | =head2 Unicode in Perl on EBCDIC |
998 | |
376d9008 |
999 | The way Unicode is handled on EBCDIC platforms is still |
1000 | experimental. On such platforms, references to UTF-8 encoding in this |
1001 | document and elsewhere should be read as meaning the UTF-EBCDIC |
1002 | specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues |
c349b1b9 |
1003 | are specifically discussed. There is no C<utfebcdic> pragma or |
376d9008 |
1004 | ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean |
86bbd6d1 |
1005 | the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> |
1006 | for more discussion of the issues. |
c349b1b9 |
1007 | |
b310b053 |
1008 | =head2 Locales |
1009 | |
4616122b |
1010 | Usually locale settings and Unicode do not affect each other, but |
b310b053 |
1011 | there are a couple of exceptions: |
1012 | |
1013 | =over 4 |
1014 | |
1015 | =item * |
1016 | |
8aa8f774 |
1017 | You can enable automatic UTF-8-ification of your standard file |
1018 | handles, default C<open()> layer, and C<@ARGV> by using either |
1019 | the C<-C> command line switch or the C<PERL_UNICODE> environment |
1020 | variable, see L<perlrun> for the documentation of the C<-C> switch. |
b310b053 |
1021 | |
1022 | =item * |
1023 | |
376d9008 |
1024 | Perl tries really hard to work both with Unicode and the old |
1025 | byte-oriented world. Most often this is nice, but sometimes Perl's |
1026 | straddling of the proverbial fence causes problems. |
b310b053 |
1027 | |
1028 | =back |
1029 | |
1aad1664 |
1030 | =head2 When Unicode Does Not Happen |
1031 | |
1032 | While Perl does have extensive ways to input and output in Unicode, |
1033 | and few other 'entry points' like the @ARGV which can be interpreted |
1034 | as Unicode (UTF-8), there still are many places where Unicode (in some |
1035 | encoding or another) could be given as arguments or received as |
1036 | results, or both, but it is not. |
1037 | |
e1b711da |
1038 | The following are such interfaces. Also, see L</The "Unicode Bug">. |
1039 | For all of these interfaces Perl |
6cd4dd6c |
1040 | currently (as of 5.8.3) simply assumes byte strings both as arguments |
1041 | and results, or UTF-8 strings if the C<encoding> pragma has been used. |
1aad1664 |
1042 | |
1043 | One reason why Perl does not attempt to resolve the role of Unicode in |
e1b711da |
1044 | these cases is that the answers are highly dependent on the operating |
1aad1664 |
1045 | system and the file system(s). For example, whether filenames can be |
1046 | in Unicode, and in exactly what kind of encoding, is not exactly a |
1047 | portable concept. Similarly for the qx and system: how well will the |
1048 | 'command line interface' (and which of them?) handle Unicode? |
1049 | |
1050 | =over 4 |
1051 | |
557a2462 |
1052 | =item * |
1053 | |
51f494cc |
1054 | chdir, chmod, chown, chroot, exec, link, lstat, mkdir, |
1e8e8236 |
1055 | rename, rmdir, stat, symlink, truncate, unlink, utime, -X |
557a2462 |
1056 | |
1057 | =item * |
1058 | |
1059 | %ENV |
1060 | |
1061 | =item * |
1062 | |
1063 | glob (aka the <*>) |
1064 | |
1065 | =item * |
1aad1664 |
1066 | |
557a2462 |
1067 | open, opendir, sysopen |
1aad1664 |
1068 | |
557a2462 |
1069 | =item * |
1aad1664 |
1070 | |
557a2462 |
1071 | qx (aka the backtick operator), system |
1aad1664 |
1072 | |
557a2462 |
1073 | =item * |
1aad1664 |
1074 | |
557a2462 |
1075 | readdir, readlink |
1aad1664 |
1076 | |
1077 | =back |
1078 | |
e1b711da |
1079 | =head2 The "Unicode Bug" |
1080 | |
1081 | The term, the "Unicode bug" has been applied to an inconsistency with the |
1082 | Unicode characters whose code points are in the Latin-1 Supplement block, that |
1083 | is, between 128 and 255. Without a locale specified, unlike all other |
1084 | characters or code points, these characters have very different semantics in |
1085 | byte semantics versus character semantics. |
1086 | |
1087 | In character semantics they are interpreted as Unicode code points, which means |
1088 | they have the same semantics as Latin-1 (ISO-8859-1). |
1089 | |
1090 | In byte semantics, they are considered to be unassigned characters, meaning |
1091 | that the only semantics they have is their ordinal numbers, and that they are |
1092 | not members of various character classes. None are considered to match C<\w> |
1093 | for example, but all match C<\W>. (On EBCDIC platforms, the behavior may |
1094 | be different from this, depending on the underlying C language library |
1095 | functions.) |
1096 | |
1097 | The behavior is known to have effects on these areas: |
1098 | |
1099 | =over 4 |
1100 | |
1101 | =item * |
1102 | |
1103 | Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, |
1104 | and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression |
1105 | substitutions. |
1106 | |
1107 | =item * |
1108 | |
1109 | Using caseless (C</i>) regular expression matching |
1110 | |
1111 | =item * |
1112 | |
1113 | Matching a number of properties in regular expressions, such as C<\w> |
1114 | |
1115 | =item * |
1116 | |
1117 | User-defined case change mappings. You can create a C<ToUpper()> function, for |
1118 | example, which overrides Perl's built-in case mappings. The scalar must be |
1119 | encoded in utf8 for your function to actually be invoked. |
1120 | |
1121 | =back |
1122 | |
1123 | This behavior can lead to unexpected results in which a string's semantics |
1124 | suddenly change if a code point above 255 is appended to or removed from it, |
1125 | which changes the string's semantics from byte to character or vice versa. As |
1126 | an example, consider the following program and its output: |
1127 | |
1128 | $ perl -le' |
1129 | $s1 = "\xC2"; |
1130 | $s2 = "\x{2660}"; |
1131 | for ($s1, $s2, $s1.$s2) { |
1132 | print /\w/ || 0; |
1133 | } |
1134 | ' |
1135 | 0 |
1136 | 0 |
1137 | 1 |
1138 | |
1139 | If there's no \w in s1 or in s2, why does their concatenation have one? |
1140 | |
1141 | This anomaly stems from Perl's attempt to not disturb older programs that |
1142 | didn't use Unicode, and hence had no semantics for characters outside of the |
1143 | ASCII range (except in a locale), along with Perl's desire to add Unicode |
1144 | support seamlessly. The result wasn't seamless: these characters were |
1145 | orphaned. |
1146 | |
1147 | Work is being done to correct this, but only some of it was complete in time |
1148 | for the 5.12 release. What has been finished is the important part of the case |
1149 | changing component. Due to concerns, and some evidence, that older code might |
1150 | have come to rely on the existing behavior, the new behavior must be explicitly |
1151 | enabled by the feature C<unicode_strings> in the L<feature> pragma, even though |
1152 | no new syntax is involved. |
1153 | |
1154 | See L<perlfunc/lc> for details on how this pragma works in combination with |
1155 | various others for casing. Even though the pragma only affects casing |
1156 | operations in the 5.12 release, it is planned to have it affect all the |
1157 | problematic behaviors in later releases: you can't have one without them all. |
1158 | |
1159 | In the meantime, a workaround is to always call utf8::upgrade($string), or to |
1160 | use the standard modules L<Encode> or L<charnames>. |
1161 | |
1aad1664 |
1162 | =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) |
1163 | |
e1b711da |
1164 | Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) |
1165 | there are situations where you simply need to force a byte |
2bbc8d55 |
1166 | string into UTF-8, or vice versa. The low-level calls |
1167 | utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are |
1aad1664 |
1168 | the answers. |
1169 | |
2bbc8d55 |
1170 | Note that utf8::downgrade() can fail if the string contains characters |
1171 | that don't fit into a byte. |
1aad1664 |
1172 | |
e1b711da |
1173 | Calling either function on a string that already is in the desired state is a |
1174 | no-op. |
1175 | |
95a1a48b |
1176 | =head2 Using Unicode in XS |
1177 | |
3a2263fe |
1178 | If you want to handle Perl Unicode in XS extensions, you may find the |
1179 | following C APIs useful. See also L<perlguts/"Unicode Support"> for an |
1180 | explanation about Unicode at the XS level, and L<perlapi> for the API |
1181 | details. |
95a1a48b |
1182 | |
1183 | =over 4 |
1184 | |
1185 | =item * |
1186 | |
1bfb14c4 |
1187 | C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes |
2bbc8d55 |
1188 | pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8> |
1bfb14c4 |
1189 | flag is on; the bytes pragma is ignored. The C<UTF8> flag being on |
1190 | does B<not> mean that there are any characters of code points greater |
1191 | than 255 (or 127) in the scalar or that there are even any characters |
1192 | in the scalar. What the C<UTF8> flag means is that the sequence of |
1193 | octets in the representation of the scalar is the sequence of UTF-8 |
1194 | encoded code points of the characters of a string. The C<UTF8> flag |
1195 | being off means that each octet in this representation encodes a |
1196 | single character with code point 0..255 within the string. Perl's |
1197 | Unicode model is not to use UTF-8 until it is absolutely necessary. |
95a1a48b |
1198 | |
1199 | =item * |
1200 | |
2bbc8d55 |
1201 | C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into |
1bfb14c4 |
1202 | a buffer encoding the code point as UTF-8, and returns a pointer |
2bbc8d55 |
1203 | pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines. |
95a1a48b |
1204 | |
1205 | =item * |
1206 | |
2bbc8d55 |
1207 | C<utf8_to_uvchr(buf, lenp)> reads UTF-8 encoded bytes from a buffer and |
376d9008 |
1208 | returns the Unicode character code point and, optionally, the length of |
2bbc8d55 |
1209 | the UTF-8 byte sequence. It works appropriately on EBCDIC machines. |
95a1a48b |
1210 | |
1211 | =item * |
1212 | |
376d9008 |
1213 | C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer |
1214 | in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded |
95a1a48b |
1215 | scalar. |
1216 | |
1217 | =item * |
1218 | |
376d9008 |
1219 | C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 |
1220 | encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if |
1221 | possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that |
1222 | it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the |
1223 | opposite of C<sv_utf8_encode()>. Note that none of these are to be |
1224 | used as general-purpose encoding or decoding interfaces: C<use Encode> |
1225 | for that. C<sv_utf8_upgrade()> is affected by the encoding pragma |
1226 | but C<sv_utf8_downgrade()> is not (since the encoding pragma is |
1227 | designed to be a one-way street). |
95a1a48b |
1228 | |
1229 | =item * |
1230 | |
376d9008 |
1231 | C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 |
90f968e0 |
1232 | character. |
95a1a48b |
1233 | |
1234 | =item * |
1235 | |
376d9008 |
1236 | C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer |
95a1a48b |
1237 | are valid UTF-8. |
1238 | |
1239 | =item * |
1240 | |
376d9008 |
1241 | C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded |
1242 | character in the buffer. C<UNISKIP(chr)> will return the number of bytes |
1243 | required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> |
90f968e0 |
1244 | is useful for example for iterating over the characters of a UTF-8 |
376d9008 |
1245 | encoded buffer; C<UNISKIP()> is useful, for example, in computing |
90f968e0 |
1246 | the size required for a UTF-8 encoded buffer. |
95a1a48b |
1247 | |
1248 | =item * |
1249 | |
376d9008 |
1250 | C<utf8_distance(a, b)> will tell the distance in characters between the |
95a1a48b |
1251 | two pointers pointing to the same UTF-8 encoded buffer. |
1252 | |
1253 | =item * |
1254 | |
2bbc8d55 |
1255 | C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer |
376d9008 |
1256 | that is C<off> (positive or negative) Unicode characters displaced |
1257 | from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: |
1258 | C<utf8_hop()> will merrily run off the end or the beginning of the |
1259 | buffer if told to do so. |
95a1a48b |
1260 | |
d2cc3551 |
1261 | =item * |
1262 | |
376d9008 |
1263 | C<pv_uni_display(dsv, spv, len, pvlim, flags)> and |
1264 | C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the |
1265 | output of Unicode strings and scalars. By default they are useful |
1266 | only for debugging--they display B<all> characters as hexadecimal code |
1bfb14c4 |
1267 | points--but with the flags C<UNI_DISPLAY_ISPRINT>, |
1268 | C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the |
1269 | output more readable. |
d2cc3551 |
1270 | |
1271 | =item * |
1272 | |
2bbc8d55 |
1273 | C<ibcmp_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to |
376d9008 |
1274 | compare two strings case-insensitively in Unicode. For case-sensitive |
1275 | comparisons you can just use C<memEQ()> and C<memNE()> as usual. |
d2cc3551 |
1276 | |
c349b1b9 |
1277 | =back |
1278 | |
95a1a48b |
1279 | For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> |
1280 | in the Perl source code distribution. |
1281 | |
e1b711da |
1282 | =head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) |
1283 | |
1284 | Perl by default comes with the latest supported Unicode version built in, but |
1285 | you can change to use any earlier one. |
1286 | |
1287 | Download the files in the version of Unicode that you want from the Unicode web |
1288 | site L<http://www.unicode.org>). These should replace the existing files in |
1289 | C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config |
1290 | module.) Follow the instructions in F<README.perl> in that directory to change |
1291 | some of their names, and then run F<make>. |
1292 | |
1293 | It is even possible to download them to a different directory, and then change |
1294 | F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new |
1295 | directory, or maybe make a copy of that directory before making the change, and |
1296 | using C<@INC> or the C<-I> run-time flag to switch between versions at will |
1297 | (but because of caching, not in the middle of a process), but all this is |
1298 | beyond the scope of these instructions. |
1299 | |
c29a771d |
1300 | =head1 BUGS |
1301 | |
376d9008 |
1302 | =head2 Interaction with Locales |
7eabb34d |
1303 | |
376d9008 |
1304 | Use of locales with Unicode data may lead to odd results. Currently, |
1305 | Perl attempts to attach 8-bit locale info to characters in the range |
1306 | 0..255, but this technique is demonstrably incorrect for locales that |
1307 | use characters above that range when mapped into Unicode. Perl's |
1308 | Unicode support will also tend to run slower. Use of locales with |
1309 | Unicode is discouraged. |
c29a771d |
1310 | |
e1b711da |
1311 | =head2 Problems with characters in the C<Latin-1 Supplement> range |
2bbc8d55 |
1312 | |
e1b711da |
1313 | See L</The "Unicode Bug"> |
1314 | |
1315 | =head2 Problems with case-insensitive regular expression matching |
1316 | |
1317 | There are problems with case-insensitive matches, including those involving |
1318 | character classes (enclosed in [square brackets]), characters whose fold |
1319 | is to multiple characters (such as the single character C<LATIN SMALL LIGATURE |
1320 | FFL> matches case-insensitively with the 3-character string C<ffl>), and |
1321 | characters in the C<Latin-1 Supplement>. |
2bbc8d55 |
1322 | |
376d9008 |
1323 | =head2 Interaction with Extensions |
7eabb34d |
1324 | |
376d9008 |
1325 | When Perl exchanges data with an extension, the extension should be |
2575c402 |
1326 | able to understand the UTF8 flag and act accordingly. If the |
376d9008 |
1327 | extension doesn't know about the flag, it's likely that the extension |
1328 | will return incorrectly-flagged data. |
7eabb34d |
1329 | |
1330 | So if you're working with Unicode data, consult the documentation of |
1331 | every module you're using if there are any issues with Unicode data |
1332 | exchange. If the documentation does not talk about Unicode at all, |
a73d23f6 |
1333 | suspect the worst and probably look at the source to learn how the |
376d9008 |
1334 | module is implemented. Modules written completely in Perl shouldn't |
a73d23f6 |
1335 | cause problems. Modules that directly or indirectly access code written |
1336 | in other programming languages are at risk. |
7eabb34d |
1337 | |
376d9008 |
1338 | For affected functions, the simple strategy to avoid data corruption is |
7eabb34d |
1339 | to always make the encoding of the exchanged data explicit. Choose an |
376d9008 |
1340 | encoding that you know the extension can handle. Convert arguments passed |
7eabb34d |
1341 | to the extensions to that encoding and convert results back from that |
1342 | encoding. Write wrapper functions that do the conversions for you, so |
1343 | you can later change the functions when the extension catches up. |
1344 | |
376d9008 |
1345 | To provide an example, let's say the popular Foo::Bar::escape_html |
7eabb34d |
1346 | function doesn't deal with Unicode data yet. The wrapper function |
1347 | would convert the argument to raw UTF-8 and convert the result back to |
376d9008 |
1348 | Perl's internal representation like so: |
7eabb34d |
1349 | |
1350 | sub my_escape_html ($) { |
1351 | my($what) = shift; |
1352 | return unless defined $what; |
1353 | Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); |
1354 | } |
1355 | |
1356 | Sometimes, when the extension does not convert data but just stores |
1357 | and retrieves them, you will be in a position to use the otherwise |
1358 | dangerous Encode::_utf8_on() function. Let's say the popular |
66b79f27 |
1359 | C<Foo::Bar> extension, written in C, provides a C<param> method that |
7eabb34d |
1360 | lets you store and retrieve data according to these prototypes: |
1361 | |
1362 | $self->param($name, $value); # set a scalar |
1363 | $value = $self->param($name); # retrieve a scalar |
1364 | |
1365 | If it does not yet provide support for any encoding, one could write a |
1366 | derived class with such a C<param> method: |
1367 | |
1368 | sub param { |
1369 | my($self,$name,$value) = @_; |
1370 | utf8::upgrade($name); # make sure it is UTF-8 encoded |
af55fc6a |
1371 | if (defined $value) { |
7eabb34d |
1372 | utf8::upgrade($value); # make sure it is UTF-8 encoded |
1373 | return $self->SUPER::param($name,$value); |
1374 | } else { |
1375 | my $ret = $self->SUPER::param($name); |
1376 | Encode::_utf8_on($ret); # we know, it is UTF-8 encoded |
1377 | return $ret; |
1378 | } |
1379 | } |
1380 | |
a73d23f6 |
1381 | Some extensions provide filters on data entry/exit points, such as |
1382 | DB_File::filter_store_key and family. Look out for such filters in |
66b79f27 |
1383 | the documentation of your extensions, they can make the transition to |
7eabb34d |
1384 | Unicode data much easier. |
1385 | |
376d9008 |
1386 | =head2 Speed |
7eabb34d |
1387 | |
c29a771d |
1388 | Some functions are slower when working on UTF-8 encoded strings than |
574c8022 |
1389 | on byte encoded strings. All functions that need to hop over |
7c17141f |
1390 | characters such as length(), substr() or index(), or matching regular |
1391 | expressions can work B<much> faster when the underlying data are |
1392 | byte-encoded. |
1393 | |
1394 | In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 |
1395 | a caching scheme was introduced which will hopefully make the slowness |
a104b433 |
1396 | somewhat less spectacular, at least for some operations. In general, |
1397 | operations with UTF-8 encoded strings are still slower. As an example, |
1398 | the Unicode properties (character classes) like C<\p{Nd}> are known to |
1399 | be quite a bit slower (5-20 times) than their simpler counterparts |
1400 | like C<\d> (then again, there 268 Unicode characters matching C<Nd> |
1401 | compared with the 10 ASCII characters matching C<d>). |
666f95b9 |
1402 | |
e1b711da |
1403 | =head2 Problems on EBCDIC platforms |
1404 | |
1405 | There are a number of known problems with Perl on EBCDIC platforms. If you |
1406 | want to use Perl there, send email to perlbug@perl.org. |
fe749c9a |
1407 | |
1408 | In earlier versions, when byte and character data were concatenated, |
1409 | the new string was sometimes created by |
1410 | decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the |
1411 | old Unicode string used EBCDIC. |
1412 | |
1413 | If you find any of these, please report them as bugs. |
1414 | |
c8d992ba |
1415 | =head2 Porting code from perl-5.6.X |
1416 | |
1417 | Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer |
1418 | was required to use the C<utf8> pragma to declare that a given scope |
1419 | expected to deal with Unicode data and had to make sure that only |
1420 | Unicode data were reaching that scope. If you have code that is |
1421 | working with 5.6, you will need some of the following adjustments to |
1422 | your code. The examples are written such that the code will continue |
1423 | to work under 5.6, so you should be safe to try them out. |
1424 | |
1425 | =over 4 |
1426 | |
1427 | =item * |
1428 | |
1429 | A filehandle that should read or write UTF-8 |
1430 | |
1431 | if ($] > 5.007) { |
740d4bb2 |
1432 | binmode $fh, ":encoding(utf8)"; |
c8d992ba |
1433 | } |
1434 | |
1435 | =item * |
1436 | |
1437 | A scalar that is going to be passed to some extension |
1438 | |
1439 | Be it Compress::Zlib, Apache::Request or any extension that has no |
1440 | mention of Unicode in the manpage, you need to make sure that the |
2575c402 |
1441 | UTF8 flag is stripped off. Note that at the time of this writing |
c8d992ba |
1442 | (October 2002) the mentioned modules are not UTF-8-aware. Please |
1443 | check the documentation to verify if this is still true. |
1444 | |
1445 | if ($] > 5.007) { |
1446 | require Encode; |
1447 | $val = Encode::encode_utf8($val); # make octets |
1448 | } |
1449 | |
1450 | =item * |
1451 | |
1452 | A scalar we got back from an extension |
1453 | |
1454 | If you believe the scalar comes back as UTF-8, you will most likely |
2575c402 |
1455 | want the UTF8 flag restored: |
c8d992ba |
1456 | |
1457 | if ($] > 5.007) { |
1458 | require Encode; |
1459 | $val = Encode::decode_utf8($val); |
1460 | } |
1461 | |
1462 | =item * |
1463 | |
1464 | Same thing, if you are really sure it is UTF-8 |
1465 | |
1466 | if ($] > 5.007) { |
1467 | require Encode; |
1468 | Encode::_utf8_on($val); |
1469 | } |
1470 | |
1471 | =item * |
1472 | |
1473 | A wrapper for fetchrow_array and fetchrow_hashref |
1474 | |
1475 | When the database contains only UTF-8, a wrapper function or method is |
1476 | a convenient way to replace all your fetchrow_array and |
1477 | fetchrow_hashref calls. A wrapper function will also make it easier to |
1478 | adapt to future enhancements in your database driver. Note that at the |
1479 | time of this writing (October 2002), the DBI has no standardized way |
1480 | to deal with UTF-8 data. Please check the documentation to verify if |
1481 | that is still true. |
1482 | |
1483 | sub fetchrow { |
1484 | my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} |
1485 | if ($] < 5.007) { |
1486 | return $sth->$what; |
1487 | } else { |
1488 | require Encode; |
1489 | if (wantarray) { |
1490 | my @arr = $sth->$what; |
1491 | for (@arr) { |
1492 | defined && /[^\000-\177]/ && Encode::_utf8_on($_); |
1493 | } |
1494 | return @arr; |
1495 | } else { |
1496 | my $ret = $sth->$what; |
1497 | if (ref $ret) { |
1498 | for my $k (keys %$ret) { |
1499 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; |
1500 | } |
1501 | return $ret; |
1502 | } else { |
1503 | defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; |
1504 | return $ret; |
1505 | } |
1506 | } |
1507 | } |
1508 | } |
1509 | |
1510 | |
1511 | =item * |
1512 | |
1513 | A large scalar that you know can only contain ASCII |
1514 | |
1515 | Scalars that contain only ASCII and are marked as UTF-8 are sometimes |
1516 | a drag to your program. If you recognize such a situation, just remove |
2575c402 |
1517 | the UTF8 flag: |
c8d992ba |
1518 | |
1519 | utf8::downgrade($val) if $] > 5.007; |
1520 | |
1521 | =back |
1522 | |
393fec97 |
1523 | =head1 SEE ALSO |
1524 | |
51f494cc |
1525 | L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, |
a05d7ebb |
1526 | L<perlretut>, L<perlvar/"${^UNICODE}"> |
51f494cc |
1527 | L<http://www.unicode.org/reports/tr44>). |
393fec97 |
1528 | |
1529 | =cut |