X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=230c105e1b85345e6856ea9e8d8737df5d56943f;hb=e969ef56d2ce92c4add247c1cf714c73bc057bc5;hp=eed2066d26d554736391932d8851f37cedd35a50;hpb=68cd2d32d62fbb83ad09510797bfdbcf6ec6309e;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index eed2066..230c105 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -780,13 +780,13 @@ Level 1 - Basic Unicode Support capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR#13 Unicode Newline Guidelines + [ 9] see UTR #13 Unicode Newline Guidelines [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} (should also affect <>, $., and script line numbers) (the \x{85}, \x{2028} and \x{2029} do match \s) [a] You can mimic class subtraction using lookahead. -For example, what TR18 might write as +For example, what UTR #18 might write as [{Greek}-[{UNASSIGNED}]] @@ -801,6 +801,9 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. +Also see the Unicode::Regex::Set module, it does implement the full +UTR #18 grouping, intersection, union, and removal (subtraction) syntax. + [b] See L. =item * @@ -1043,14 +1046,10 @@ there are a couple of exceptions: =item * -If your locale environment variables (LC_ALL, LC_CTYPE, LANG) -contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) -B you enable using UTF-8 either by using the C<-C> command line -switch or setting the PERL_UTF8_LOCALE environment variable to a true -value, then the default encodings of your STDIN, STDOUT, and STDERR, -and of B, are considered to be UTF-8. -See L, L, and L for more -information. The magic variable C<${^UTF8_LOCALE}> will also be set. +You can enable automatic UTF-8-ification of your standard file +handles, default C layer, and C<@ARGV> by using either +the C<-C> command line switch or the C environment +variable, see L for the documentation of the C<-C> switch. =item * @@ -1060,6 +1059,66 @@ straddling of the proverbial fence causes problems. =back +=head2 When Unicode Does Not Happen + +While Perl does have extensive ways to input and output in Unicode, +and few other 'entry points' like the @ARGV which can be interpreted +as Unicode (UTF-8), there still are many places where Unicode (in some +encoding or another) could be given as arguments or received as +results, or both, but it is not. + +The following are such interfaces. For all of these Perl currently +(as of 5.8.1) simply assumes byte strings both as arguments and results. + +One reason why Perl does not attempt to resolve the role of Unicode in +this cases is that the answers are highly dependent on the operating +system and the file system(s). For example, whether filenames can be +in Unicode, and in exactly what kind of encoding, is not exactly a +portable concept. Similarly for the qx and system: how well will the +'command line interface' (and which of them?) handle Unicode? + +=over 4 + +=item * + +chmod, chmod, chown, chroot, exec, link, mkdir +rename, rmdir stat, symlink, truncate, unlink, utime + +=item * + +%ENV + +=item * + +glob (aka the <*>) + +=item * + +open, opendir, sysopen + +=item * + +qx (aka the backtick operator), system + +=item * + +readdir, readlink + +=back + +=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) + +Sometimes (see L) there are +situations where you simply need to force Perl to believe that a byte +string is UTF-8, or vice versa. The low-level calls +utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are +the answers. + +Do not use them without careful thought, though: Perl may easily get +very confused, angry, or even crash, if you suddenly change the 'nature' +of scalar like that. Especially careful you have to be if you use the +utf8::upgrade(): any random byte string is not valid UTF-8. + =head2 Using Unicode in XS If you want to handle Perl Unicode in XS extensions, you may find the @@ -1244,57 +1303,14 @@ Unicode data much easier. Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over -characters such as length(), substr() or index() can work B -faster when the underlying data are byte-encoded. Witness the -following benchmark: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-2,{ - LENGTH_B => q{ length($b) }, - LENGTH_U => q{ length($u) }, - SUBSTR_B => q{ substr($b, $l/4, $l/2) }, - SUBSTR_U => q{ substr($u, $l/4, $l/2) }, - }); - ' - Benchmark: running LENGTH_B, LENGTH_U, SUBSTR_B, SUBSTR_U for at least 2 CPU seconds... - LENGTH_B: 2 wallclock secs ( 2.36 usr + 0.00 sys = 2.36 CPU) @ 5649983.05/s (n=13333960) - LENGTH_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 12155.45/s (n=25648) - SUBSTR_B: 3 wallclock secs ( 2.16 usr + 0.00 sys = 2.16 CPU) @ 374480.09/s (n=808877) - SUBSTR_U: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @ 6791.00/s (n=14329) - -The numbers show an incredible slowness on long UTF-8 strings. You -should carefully avoid using these functions in tight loops. If you -want to iterate over characters, the superior coding technique would -split the characters into an array instead of using substr, as the following -benchmark shows: - - % perl -e ' - use Benchmark; - use strict; - our $l = 10000; - our $u = our $b = "x" x $l; - substr($u,0,1) = "\x{100}"; - timethese(-5,{ - SPLIT_B => q{ for my $c (split //, $b){} }, - SPLIT_U => q{ for my $c (split //, $u){} }, - SUBSTR_B => q{ for my $i (0..length($b)-1){my $c = substr($b,$i,1);} }, - SUBSTR_U => q{ for my $i (0..length($u)-1){my $c = substr($u,$i,1);} }, - }); - ' - Benchmark: running SPLIT_B, SPLIT_U, SUBSTR_B, SUBSTR_U for at least 5 CPU seconds... - SPLIT_B: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 56.14/s (n=297) - SPLIT_U: 5 wallclock secs ( 5.17 usr + 0.01 sys = 5.18 CPU) @ 55.21/s (n=286) - SUBSTR_B: 5 wallclock secs ( 5.34 usr + 0.00 sys = 5.34 CPU) @ 123.22/s (n=658) - SUBSTR_U: 7 wallclock secs ( 6.20 usr + 0.00 sys = 6.20 CPU) @ 0.81/s (n=5) - -Even though the algorithm based on C is faster than -C for byte-encoded data, it pales in comparison to the speed -of C when used with UTF-8 data. +characters such as length(), substr() or index(), or matching regular +expressions can work B faster when the underlying data are +byte-encoded. + +In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 +a caching scheme was introduced which will hopefully make the slowness +somewhat less spectacular. Operations with UTF-8 encoded strings are +still slower, though. =head2 Porting code from perl-5.6.X @@ -1407,6 +1423,6 @@ the UTF-8 flag: =head1 SEE ALSO L, L, L, L, L, L, -L, L +L, L =cut