X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=1df6ba3d8a34c1f8e1b6131c8735df3e00c8cfd6;hb=06eaf0bc49fea082c8b8358680815d807a7a925e;hp=b7fda54061250d7487749bdeeabb725daee2440e;hpb=a0ed51b321531af4b47cce24205ab9656f043f0f;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index b7fda54..1df6ba3 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -7,12 +7,13 @@ perlre - Perl regular expressions This page describes the syntax of regular expressions in Perl. For a description of how to I regular expressions in matching operations, plus various examples of the same, see discussion -of C, C, C and C in L. +of C, C, C and C in L. The matching operations can have various modifiers. The modifiers that relate to the interpretation of the regular expression inside -are listed below. For the modifiers that alter the way regular expression -is used by Perl, see L. +are listed below. For the modifiers that alter the way a regular expression +is used by Perl, see L and +L. =over 4 @@ -168,7 +169,8 @@ In addition, Perl defines the following: \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P - \X Match eXtended Unicode "combining character sequence", \pM\pm* + \X Match eXtended Unicode "combining character sequence", + equivalent to C<(?:\PM\pM*)> \C Match a single C char (octet) even under utf8. A C<\w> matches a single alphanumeric character, not a whole @@ -346,10 +348,6 @@ Experimental "evaluate any Perl code" zero-width assertion. Always succeeds. C is not interpolated. Currently the rules to determine where the C ends are somewhat convoluted. -Owing to the risks to security, this is only available when the -C pragma is used, and then only for patterns that don't -have any variables that must be interpolated at run time. - The C is properly scoped in the following sense: if the assertion is backtracked (compare L<"Backtracking">), all the changes introduced after Cisation are undone, so @@ -380,6 +378,50 @@ other C<(?{ code })> assertions inside the same regular expression. The above assignment to $^R is properly localized, thus the old value of $^R is restored if the assertion is backtracked (compare L<"Backtracking">). +Due to security concerns, this construction is not allowed if the regular +expression involves run-time interpolation of variables, unless +C pragma is used (see L), or the variables contain +results of qr() operator (see L). + +This restriction is due to the wide-spread (questionable) practice of +using the construct + + $re = <>; + chomp $re; + $string =~ /$re/; + +without tainting. While this code is frowned upon from security point +of view, when C<(?{})> was introduced, it was considered bad to add +I security holes to existing scripts. + +B Use of the above insecure snippet without also enabling taint mode +is to be severely frowned upon. C does not disable tainting +checks, thus to allow $re in the above snippet to contain C<(?{})> +I, one needs both C and untaint +the $re. + +=item C<(?p{ code })> + +I "postponed" regular subexpression. C is evaluated +at runtime, at the moment this subexpression may match. The result of +evaluation is considered as a regular expression, and matched as if it +were inserted instead of this construct. + +C is not interpolated. Currently the rules to +determine where the C ends are somewhat convoluted. + +The following regular expression matches matching parenthesized group: + + $re = qr{ + \( + (?: + (?> [^()]+ ) # Non-parens without backtracking + | + (?p{ $re }) # Group with matching parens + )* + \) + }x; + =item C<(?Epattern)> An "independent" subexpression. Matches the substring that a @@ -392,7 +434,7 @@ C at the beginning of string, leaving no C for C to match. In contrast, C will match the same as C, since the match of the subgroup C is influenced by the following group C (see L<"Backtracking">). In particular, C inside C will match -less characters that a standalone C, since this makes the tail match. +fewer characters than a standalone C, since this makes the tail match. An effect similar to C<(?Epattern)> may be achieved by @@ -401,40 +443,42 @@ An effect similar to C<(?Epattern)> may be achieved by since the lookahead is in I<"logical"> context, thus matches the same substring as a standalone C. The following C<\1> eats the matched string, thus making a zero-length assertion into an analogue of -C<(?>...)>. (The difference between these two constructs is that the +C<(?E...)>. (The difference between these two constructs is that the second one uses a catching group, thus shifting ordinals of backreferences in the rest of a regular expression.) This construct is useful for optimizations of "eternal" matches, because it will not backtrack (see L<"Backtracking">). - m{ \( ( - [^()]+ - | - \( [^()]* \) - )+ - \) - }x + m{ \( + ( + [^()]+ + | + \( [^()]* \) + )+ + \) + }x That will efficiently match a nonempty group with matching two-or-less-level-deep parentheses. However, if there is no such group, it will take virtually forever on a long string. That's because there are so many different ways to split a long string into several substrings. -This is essentially what C<(.+)+> is doing, and this is a subpattern -of the above pattern. Consider that C<((()aaaaaaaaaaaaaaaaaa> on the -pattern above detects no-match in several seconds, but that each extra +This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern +of the above pattern. Consider that the above pattern detects no-match +on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that each extra letter doubles this time. This exponential performance will make it appear that your program has hung. However, a tiny modification of this pattern - m{ \( ( - (?> [^()]+ ) - | - \( [^()]* \) - )+ - \) - }x + m{ \( + ( + (?> [^()]+ ) + | + \( [^()]* \) + )+ + \) + }x which uses C<(?E...)> matches exactly when the one above does (verifying this yourself would be a productive exercise), but finishes in a fourth @@ -442,7 +486,7 @@ the time when used on a similar string with 1000000 Cs. Be aware, however, that this pattern currently triggers a warning message under B<-w> saying it C<"matches the null string many times">): -On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable +On simple groups, such as the pattern C<(?E [^()]+ )>, a comparable effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. This was only 4 times slower on a string with 1000000 Cs. @@ -457,9 +501,9 @@ matched), or lookahead/lookbehind/evaluate zero-width assertion. Say, m{ ( \( )? - [^()]+ + [^()]+ (?(1) \) ) - }x + }x matches a chunk of non-parentheses, possibly included in parentheses themselves. @@ -608,10 +652,10 @@ When using lookahead assertions and negations, this can all get even tricker. Imagine you'd like to find a sequence of non-digits not followed by "123". You might try to write that as - $_ = "ABC123"; - if ( /^\D*(?!123)/ ) { # Wrong! - print "Yup, no 123 in $_\n"; - } + $_ = "ABC123"; + if ( /^\D*(?!123)/ ) { # Wrong! + print "Yup, no 123 in $_\n"; + } But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of @@ -714,6 +758,13 @@ following all specify the same class of three characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which specifies a class containing twenty-six characters.) +Note also that the whole range idea is rather unportable between +character sets--and even within character sets they may cause results +you probably didn't expect. A sound principle is to use only ranges +that begin from and end at either alphabets of equal case ([a-e], +[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, +spell out the character sets in full. + Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, "\f" a form feed, etc. More generally, \I, where I is a string @@ -904,6 +955,8 @@ part of this regular expression needs to be converted explicitly L. +L. + L. L.