From: Yves Orton Date: Wed, 2 Sep 2009 18:29:13 +0000 (+0200) Subject: update perlre and perldelta to document change in behaviour of \w and \d and POSIX... X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=6fa80ea2f46c3527db9bfc3ba3a52c12826c85c7;p=p5sagit%2Fp5-mst-13.2.git update perlre and perldelta to document change in behaviour of \w and \d and POSIX charclasses --- diff --git a/pod/perl5110delta.pod b/pod/perl5110delta.pod index 7336692..16266b1 100644 --- a/pod/perl5110delta.pod +++ b/pod/perl5110delta.pod @@ -11,6 +11,61 @@ the 5.11.0 development release. =head1 Incompatible Changes +=head2 Unicode interpretation of \w, \d, \s, and the POSIX character classes redefined. + +Previous versions of Perl tried to map POSIX style character class definitions onto +Unicode property names so that patterns would "dwim" when matches were made against latin-1 or +unicode strings. This proved to be a mistake, breaking character class negation, causing +forward compatibility problems (as Unicode keeps updating their property definitions and adding +new characters), and other problems. + +Therefore we have now defined a new set of artificial "unicode" property names which will be +used to do unicode matching of patterns using POSIX style character classes and perl short-form +escape character classes like \w and \d. + +The key change here is that \d will no longer match every digit in the unicode standard +(there are thousands) nor will \w match every word character in the standard, instead they +will match precisely their POSIX or Perl definition. + +Those needing to match based on Unicode properties can continue to do so by using the \p{} syntax +to match whichever property they like, including the new artificial definitions. + +B This is a backwards incompatible no-warning change in behaviour. If you are upgrading +and you process large volumes of text look for POSIX and Perl style character classes and +change them to the relevent property name (by removing the word 'Posix' from the current name). + +The following table maps the POSIX character class names, the escapes and the old and new +Unicode property mappings: + + POSIX Esc Class New-Property ! Old-Property + ----------------------------------------------+------------- + alnum [0-9A-Za-z] IsPosixAlnum ! IsAlnum + alpha [A-Za-z] IsPosixAlpha ! IsAlpha + ascii [\000-\177] IsASCII = IsASCII + blank [\011 ] IsPosixBlank ! + cntrl [\0-\37\177] IsPosixCntrl ! IsCntrl + digit \d [0-9] IsPosixDigit ! IsDigit + graph [!-~] IsPosixGraph ! IsGraph + lower [a-z] IsPosixLower ! IsLower + print [ -~] IsPosixPrint ! IsPrint + punct [!-/:-@[-`{-~] IsPosixPunct ! IsPunct + space [\11-\15 ] IsPosixSpace ! IsSpace + \s [\11\12\14\15 ] IsPerlSpace ! IsSpacePerl + upper [A-Z] IsPosixUpper ! IsUpper + word \w [0-9A-Z_a-z] IsPerlWord ! IsWord + xdigit [0-9A-Fa-f] IsXDigit = IsXDigit + +If you wish to build perl with the old mapping you may do so by setting + + #define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1 + +in regcomp.h, and then setting + + PERL_TEST_LEGACY_POSIX_CC + +to true your enviornment when testing. + + =head2 In @INC, move ARCHLIB and PRIVLIB after the current version's site_perl and vendor_perl. =head2 Switch statement changes @@ -2294,3 +2349,6 @@ when creation of a temporary file in it fails =head2 Add a pluggable hook in op_free() + + + diff --git a/pod/perlre.pod b/pod/perlre.pod index ee1c2cb..1336c5c 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -316,26 +316,34 @@ they must always be used within a character class expression. # this is not, and will generate a warning: $string =~ /[:alpha:]/; -The available classes and their backslash equivalents (if available) are -as follows: -X +The following table shows the mapping of POSIX character class +names, common escapes, literal escape sequences and their equivalent +Unicode style property names. +X X<\p> X<\p{}> X X X X X X X X X X X X X X - alpha - alnum - ascii - blank [1] - cntrl - digit \d - graph - lower - print - punct - space \s [2] - upper - word \w [3] - xdigit +B up to Perl 5.10 the property names used were shared with +standard Unicode properties, this was changed in Perl 5.11, see +L for details. + + POSIX Esc Class Property Note + -------------------------------------------------------- + alnum [0-9A-Za-z] IsPosixAlnum + alpha [A-Za-z] IsPosixAlpha + ascii [\000-\177] IsASCII + blank [\011 ] IsPosixBlank [1] + cntrl [\0-\37\177] IsPosixCntrl + digit \d [0-9] IsPosixDigit + graph [!-~] IsPosixGraph + lower [a-z] IsPosixLower + print [ -~] IsPosixPrint + punct [!-/:-@[-`{-~] IsPosixPunct + space [\11-\15 ] IsPosixSpace [2] + \s [\11\12\14\15 ] IsPerlSpace [2] + upper [A-Z] IsPosixUpper + word \w [0-9A-Z_a-z] IsPerlWord [3] + xdigit [0-9A-Fa-f] IsXDigit =over @@ -345,8 +353,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] -Not exactly equivalent to C<\s> since the C<[[:space:]]> includes -also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII. +Note that C<\s> and C<[[:space:]]> are B equivalent as C<[[:space:]]> +includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in +ASCII. =item [3] @@ -362,58 +371,6 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percent sign. -The following equivalences to Unicode \p{} constructs and equivalent -backslash character classes (if available), will hold: -X X<\p> X<\p{}> - - [[:...:]] \p{...} backslash - - alpha IsAlpha - alnum IsAlnum - ascii IsASCII - blank - cntrl IsCntrl - digit IsDigit \d - graph IsGraph - lower IsLower - print IsPrint (but see [2] below) - punct IsPunct (but see [3] below) - space IsSpace - IsSpacePerl \s - upper IsUpper - word IsWord \w - xdigit IsXDigit - -For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. - -However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}> -is not exact. - -=over 4 - -=item [1] - -If the C pragma is not used but the C pragma is, the -classes correlate with the usual isalpha(3) interface (except for -"word" and "blank"). - -But if the C or C pragmas are not used and -the string is not C, then C<[[:xxxxx:]]> (and C<\w>, etc.) -will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will -force the string to C and can match these characters -(as Unicode). - -=item [2] - -C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. - -=item [3] - -C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, -because they are classed as symbols (not punctuation) in Unicode. - -=over 4 - =item C<$> Currency symbol @@ -473,9 +430,9 @@ X POSIX traditional Unicode - [[:^digit:]] \D \P{IsDigit} - [[:^space:]] \S \P{IsSpace} - [[:^word:]] \W \P{IsWord} + [[:^digit:]] \D \P{IsPosixDigit} + [[:^space:]] \S \P{IsPosixSpace} + [[:^word:]] \W \P{IsPerlWord} Perl respects the POSIX standard in that POSIX character classes are only supported within a character class. The POSIX character classes