X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=7d0ba542f8b3d5aee6ceffffe9df311c561ad0c9;hb=c69f112c145fabe210a7e2c5c2406baeea71af2f;hp=2b24379c8bce8f0621dad035b9dc82d19e2994de;hpb=9fa51da436f491e012feddf8b112d19d02b94784;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 2b24379..7d0ba54 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -136,7 +136,7 @@ also work: \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) - \Q quote regexp metacharacters till \E + \Q quote (disable) regexp metacharacters till \E If C is in effect, the case map used by C<\l>, C<\L>, C<\u> and <\U> is taken from the current locale. See L. @@ -226,19 +226,20 @@ you've used them once, use them at will, because you've already paid the price. You will note that all backslashed metacharacters in Perl are -alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression -languages, there are no backslashed symbols that aren't alphanumeric. -So anything that looks like \\, \(, \), \E, \E, \{, or \} is always -interpreted as a literal character, not a metacharacter. This makes it -simple to quote a string that you want to use for a pattern but that -you are afraid might contain metacharacters. Quote simply all the +alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular +expression languages, there are no backslashed symbols that aren't +alphanumeric. So anything that looks like \\, \(, \), \E, \E, +\{, or \} is always interpreted as a literal character, not a +metacharacter. This was once used in a common idiom to disable or +quote the special meanings of regular expression metacharacters in a +string that you want to use for a pattern. Simply quote all the non-alphanumeric characters: $pattern =~ s/(\W)/\\$1/g; -You can also use the builtin quotemeta() function to do this. -An even easier way to quote metacharacters right in the match operator -is to say +Now it is much more common to see either the quotemeta() function or +the \Q escape sequence used to disable the metacharacters special +meanings like this: /$unquoted\Q$quoted\E$unquoted/ @@ -288,6 +289,104 @@ easier just to say: if (/foo/ && $` =~ /bar$/) +For lookbehind see below. + +=item (?<=regexp) + +A zero-width positive lookbehind assertion. For example, C +matches a word following a tab, without including the tab in C<$&>. +Works only for fixed-width lookbehind. + +=item (? +matches any occurrence of "foo" that isn't following "bar". +Works only for fixed-width lookbehind. + +=item (?{ code }) + +Experimental "evaluate any Perl code" zero-width assertion. Always +succeeds. Currently the quoting rules are somewhat convoluted, as is the +determination where the C ends. + + +=item C<(?Eregexp)> + +An "independend" subexpression. Matches the substring which a +I C would match if anchored at the given position, +B. + +Say, C<^(?Ea*)ab> will never match, since C<(?Ea*)> (anchored +at the beginning of string, as above) will match I the characters +C at the beginning of string, leaving no C for C to match. +In contrast, C will match the same as C, since the match of +the subgroup C is influenced by the following group C (see +L<"Backtracking">). In particular, C inside C will match +less characters that a standalone C, since this makes the tail match. + +Note that a similar effect to C<(?Eregexp)> may be achieved by + + (?=(regexp))\1 + +since the lookahead is in I<"logical"> context, thus matches the same +substring as a standalone C. The following C<\1> eats the matched +string, thus making a zero-length assertion into an analogue of +C<(?>...)>. (The difference of these two constructions is that the +second one uses a catching group, thus shifts ordinals of +backreferences in the rest of a regular expression.) + +This construction is very useful for optimizations of "eternal" +matches, since it will not backtrack (see L<"Backtracking">). Say, + + / \( ( + [^()]+ + | + \( [^()]* \) + )+ + \) /x + +will match a nonempty group with matching two-or-less-level-deep +parentheses. It is very efficient in finding such groups. However, +if there is no such group, it is going to take forever (on reasonably +long string), since there are so many different ways to split a long +string into several substrings (this is essentially what C<(.+)+> is +doing, and this is a subpattern of the above pattern). Say, on +C<((()aaaaaaaaaaaaaaaaaa> the above pattern detects no-match in 5sec +(on kitchentop'96 processor), and each extra letter doubles this time. + +However, a tiny modification of this + + / \( ( + (?> [^()]+ ) + | + \( [^()]* \) + )+ + \) /x + +which uses (?>...) matches exactly when the above one does (it is a +good excercise to check this), but finishes in a fourth of the above +time on a similar string with 1000000 Cs. + +Note that on simple groups like the above C<(?> [^()]+ )> a similar +effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. +This was only 4 times slower on a string with 1000000 Cs. + +=item (?(condition)yes-regexp|no-regexp) + +=item (?(condition)yes-regexp) + +Conditional expression. C<(condition)> should be either an integer in +parentheses (which is valid if the corresponding pair of parentheses +matched), or lookahead/lookbehind/evaluate zero-width assertion. + +Say, + + / ( \( )? + [^()]+ + (?(1) \) )/x + +matches a chunk of non-parentheses, possibly included in parentheses +themselves. =item (?imsx) @@ -305,6 +404,15 @@ pattern. For example: $pattern = "(?i)foobar"; if ( /$pattern/ ) +Note that these modifiers are localized inside an enclosing group (if +any). Say, + + ( (?i) blah ) \s+ \1 + +(assuming C modifier, and no C modifier outside of this group) +will match a repeated (I!) word C in any +case. + =back The specific choice of question mark for this and the new minimal @@ -314,10 +422,10 @@ and "question" exactly what is going on. That's psychology... =head2 Backtracking -A fundamental feature of regular expression matching involves the notion -called I. which is used (when needed) by all regular -expression quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and -C<{n,m}?>. +A fundamental feature of regular expression matching involves the +notion called I. which is currently used (when needed) +by all regular expression quantifiers, namely C<*>, C<*?>, C<+>, +C<+?>, C<{n,m}>, and C<{n,m}?>. For a regular expression to match, the I regular expression must match, not just part of it. So if the beginning of a pattern containing a @@ -497,6 +605,14 @@ time to run And if you used C<*>'s instead of limiting it to 0 through 5 matches, then it would take literally forever--or until you ran out of stack space. +A powerful tool for optimizing such beasts is "independent" groups, +which do not backtrace (see Lregexp)>>). Note also that +zero-length lookahead/lookbehind assertions will not backtrace to make +the tail match, since they are in "logical" context: only the fact +whether they match or not is considered relevant. For an example +where side-effects of a lookahead I have influenced the +following match, see Lregexp)>>. + =head2 Version 8 Regular Expressions In case you're not familiar with the "regular" Version 8 regexp @@ -515,7 +631,11 @@ in C<[]>, which will match any one of the characters in the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character is used to specify a range, so that C represents all the characters between "a" and "z", -inclusive. +inclusive. If you want "-" itself to be a member of a class, put it +at the start or end of the list, or escape it with a backslash. (The +following all specify the same class of three characters: C<[-az]>, +C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which +specifies a class containing twenty-six characters.) Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,