X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=7d0ba542f8b3d5aee6ceffffe9df311c561ad0c9;hb=c69f112c145fabe210a7e2c5c2406baeea71af2f;hp=f881a3bcc778ad206ed8e638056b13a0a744af3c;hpb=4a6725af9146bd7faaa10aa5429ff009d393fd6d;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index f881a3b..7d0ba54 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -136,7 +136,7 @@ also work: \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) - \Q quote regexp metacharacters till \E + \Q quote (disable) regexp metacharacters till \E If C is in effect, the case map used by C<\l>, C<\L>, C<\u> and <\U> is taken from the current locale. See L. @@ -163,7 +163,7 @@ Perl defines the following zero-width assertions: \B Match a non-(word boundary) \A Match at only beginning of string \Z Match at only end of string (or before newline at the end) - \G Match only where previous m//g left off + \G Match only where previous m//g left off (works only with /g) A word boundary (C<\b>) is defined as a spot between two characters that has a C<\w> on one side of it and a C<\W> on the other side of it (in @@ -173,9 +173,10 @@ represents backspace rather than a word boundary.) The C<\A> and C<\Z> are just like "^" and "$" except that they won't match multiple times when the C modifier is used, while "^" and "$" will match at every internal line boundary. To match the actual end of the string, not ignoring newline, -you can use C<\Z(?!\n)>. The C<\G> assertion can be used to mix global -matches (using C) and non-global ones, as described in +you can use C<\Z(?!\n)>. The C<\G> assertion can be used to chain global +matches (using C), as described in L. + It is also useful when writing C-like scanners, when you have several regexps which you want to match against consequent substrings of your string, see the previous reference. @@ -225,19 +226,20 @@ you've used them once, use them at will, because you've already paid the price. You will note that all backslashed metacharacters in Perl are -alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression -languages, there are no backslashed symbols that aren't alphanumeric. -So anything that looks like \\, \(, \), \E, \E, \{, or \} is always -interpreted as a literal character, not a metacharacter. This makes it -simple to quote a string that you want to use for a pattern but that -you are afraid might contain metacharacters. Quote simply all the +alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular +expression languages, there are no backslashed symbols that aren't +alphanumeric. So anything that looks like \\, \(, \), \E, \E, +\{, or \} is always interpreted as a literal character, not a +metacharacter. This was once used in a common idiom to disable or +quote the special meanings of regular expression metacharacters in a +string that you want to use for a pattern. Simply quote all the non-alphanumeric characters: $pattern =~ s/(\W)/\\$1/g; -You can also use the builtin quotemeta() function to do this. -An even easier way to quote metacharacters right in the match operator -is to say +Now it is much more common to see either the quotemeta() function or +the \Q escape sequence used to disable the metacharacters special +meanings like this: /$unquoted\Q$quoted\E$unquoted/ @@ -287,6 +289,104 @@ easier just to say: if (/foo/ && $` =~ /bar$/) +For lookbehind see below. + +=item (?<=regexp) + +A zero-width positive lookbehind assertion. For example, C +matches a word following a tab, without including the tab in C<$&>. +Works only for fixed-width lookbehind. + +=item (? +matches any occurrence of "foo" that isn't following "bar". +Works only for fixed-width lookbehind. + +=item (?{ code }) + +Experimental "evaluate any Perl code" zero-width assertion. Always +succeeds. Currently the quoting rules are somewhat convoluted, as is the +determination where the C ends. + + +=item C<(?Eregexp)> + +An "independend" subexpression. Matches the substring which a +I C would match if anchored at the given position, +B. + +Say, C<^(?Ea*)ab> will never match, since C<(?Ea*)> (anchored +at the beginning of string, as above) will match I the characters +C at the beginning of string, leaving no C for C to match. +In contrast, C will match the same as C, since the match of +the subgroup C is influenced by the following group C (see +L<"Backtracking">). In particular, C inside C will match +less characters that a standalone C, since this makes the tail match. + +Note that a similar effect to C<(?Eregexp)> may be achieved by + + (?=(regexp))\1 + +since the lookahead is in I<"logical"> context, thus matches the same +substring as a standalone C. The following C<\1> eats the matched +string, thus making a zero-length assertion into an analogue of +C<(?>...)>. (The difference of these two constructions is that the +second one uses a catching group, thus shifts ordinals of +backreferences in the rest of a regular expression.) + +This construction is very useful for optimizations of "eternal" +matches, since it will not backtrack (see L<"Backtracking">). Say, + + / \( ( + [^()]+ + | + \( [^()]* \) + )+ + \) /x + +will match a nonempty group with matching two-or-less-level-deep +parentheses. It is very efficient in finding such groups. However, +if there is no such group, it is going to take forever (on reasonably +long string), since there are so many different ways to split a long +string into several substrings (this is essentially what C<(.+)+> is +doing, and this is a subpattern of the above pattern). Say, on +C<((()aaaaaaaaaaaaaaaaaa> the above pattern detects no-match in 5sec +(on kitchentop'96 processor), and each extra letter doubles this time. + +However, a tiny modification of this + + / \( ( + (?> [^()]+ ) + | + \( [^()]* \) + )+ + \) /x + +which uses (?>...) matches exactly when the above one does (it is a +good excercise to check this), but finishes in a fourth of the above +time on a similar string with 1000000 Cs. + +Note that on simple groups like the above C<(?> [^()]+ )> a similar +effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. +This was only 4 times slower on a string with 1000000 Cs. + +=item (?(condition)yes-regexp|no-regexp) + +=item (?(condition)yes-regexp) + +Conditional expression. C<(condition)> should be either an integer in +parentheses (which is valid if the corresponding pair of parentheses +matched), or lookahead/lookbehind/evaluate zero-width assertion. + +Say, + + / ( \( )? + [^()]+ + (?(1) \) )/x + +matches a chunk of non-parentheses, possibly included in parentheses +themselves. =item (?imsx) @@ -304,6 +404,15 @@ pattern. For example: $pattern = "(?i)foobar"; if ( /$pattern/ ) +Note that these modifiers are localized inside an enclosing group (if +any). Say, + + ( (?i) blah ) \s+ \1 + +(assuming C modifier, and no C modifier outside of this group) +will match a repeated (I!) word C in any +case. + =back The specific choice of question mark for this and the new minimal @@ -313,10 +422,10 @@ and "question" exactly what is going on. That's psychology... =head2 Backtracking -A fundamental feature of regular expression matching involves the notion -called I. which is used (when needed) by all regular -expression quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and -C<{n,m}?>. +A fundamental feature of regular expression matching involves the +notion called I. which is currently used (when needed) +by all regular expression quantifiers, namely C<*>, C<*?>, C<+>, +C<+?>, C<{n,m}>, and C<{n,m}?>. For a regular expression to match, the I regular expression must match, not just part of it. So if the beginning of a pattern containing a @@ -496,6 +605,14 @@ time to run And if you used C<*>'s instead of limiting it to 0 through 5 matches, then it would take literally forever--or until you ran out of stack space. +A powerful tool for optimizing such beasts is "independent" groups, +which do not backtrace (see Lregexp)>>). Note also that +zero-length lookahead/lookbehind assertions will not backtrace to make +the tail match, since they are in "logical" context: only the fact +whether they match or not is considered relevant. For an example +where side-effects of a lookahead I have influenced the +following match, see Lregexp)>>. + =head2 Version 8 Regular Expressions In case you're not familiar with the "regular" Version 8 regexp @@ -514,7 +631,11 @@ in C<[]>, which will match any one of the characters in the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character is used to specify a range, so that C represents all the characters between "a" and "z", -inclusive. +inclusive. If you want "-" itself to be a member of a class, put it +at the start or end of the list, or escape it with a backslash. (The +following all specify the same class of three characters: C<[-az]>, +C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which +specifies a class containing twenty-six characters.) Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, @@ -572,3 +693,7 @@ You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with C<${1}000>. Basically, the operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the I side of the C. + +=head2 SEE ALSO + +"Mastering Regular Expressions" (see L) by Jeffrey Friedl.