X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlunicode.pod;h=190247aea796369a877a6603ef29db9832a4cbf3;hb=209071589ddd827372bd46e1358d1d13f6b4dbcb;hp=4508de7bca8dd3bfa0939cc21938a0099d6d9a5f;hpb=1aad1664cf756e015147414b107d6e07ef43c6bc;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 4508de7..190247a 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -177,6 +177,10 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. +B + Here are the basic Unicode General Category properties, followed by their long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, for instance, are identical. @@ -750,7 +754,8 @@ See L. The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" and the section numbers refer to the Unicode Technical Report 18, -"Unicode Regular Expression Guidelines". +"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0, +Perl 5.8.0). =over 4 @@ -780,13 +785,13 @@ Level 1 - Basic Unicode Support capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. - [ 9] see UTR#13 Unicode Newline Guidelines + [ 9] see UTR #13 Unicode Newline Guidelines [10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029} (should also affect <>, $., and script line numbers) (the \x{85}, \x{2028} and \x{2029} do match \s) [a] You can mimic class subtraction using lookahead. -For example, what TR18 might write as +For example, what UTR #18 might write as [{Greek}-[{UNASSIGNED}]] @@ -801,6 +806,9 @@ But in this particular example, you probably really want which will match assigned characters known to be part of the Greek script. +Also see the Unicode::Regex::Set module, it does implement the full +UTR #18 grouping, intersection, union, and removal (subtraction) syntax. + [b] See L. =item * @@ -896,7 +904,7 @@ Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. =item * -UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks) +UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) The followings items are mostly for reference and general Unicode knowledge, Perl doesn't use these constructs internally. @@ -948,7 +956,7 @@ format". =item * -UTF-32, UTF-32BE, UTF32-LE +UTF-32, UTF-32BE, UTF-32LE The UTF-32 family is pretty much like the UTF-16 family, expect that the units are 32-bit, and therefore the surrogate scheme is not @@ -1076,17 +1084,30 @@ portable concept. Similarly for the qx and system: how well will the =over 4 -=item chmod, chmod, chown, chroot, exec, link, mkdir, rename, rmdir, stat, symlink, truncate, unlink, utime +=item * + +chmod, chmod, chown, chroot, exec, link, mkdir +rename, rmdir stat, symlink, truncate, unlink, utime + +=item * + +%ENV -=item %ENV +=item * + +glob (aka the <*>) -=item glob (aka the <*>) +=item * -=item open, opendir, sysopen +open, opendir, sysopen -=item qx (aka the backtick operator), system +=item * + +qx (aka the backtick operator), system + +=item * -=item readdir, readlink +readdir, readlink =back @@ -1128,7 +1149,7 @@ Unicode model is not to use UTF-8 until it is absolutely necessary. =item * -C) writes a Unicode character code point into +C writes a Unicode character code point into a buffer encoding the code point as UTF-8, and returns a pointer pointing after the UTF-8 bytes. @@ -1293,8 +1314,12 @@ byte-encoded. In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a caching scheme was introduced which will hopefully make the slowness -somewhat less spectacular. Operations with UTF-8 encoded strings are -still slower, though. +somewhat less spectacular, at least for some operations. In general, +operations with UTF-8 encoded strings are still slower. As an example, +the Unicode properties (character classes) like C<\p{Nd}> are known to +be quite a bit slower (5-20 times) than their simpler counterparts +like C<\d> (then again, there 268 Unicode characters matching C +compared with the 10 ASCII characters matching C). =head2 Porting code from perl-5.6.X