From: Ilya Zakharevich Date: Sat, 18 Jul 1998 23:11:13 +0000 (-0400) Subject: applied RE doc patches, with tweaks to the prose X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=c84d73f1f9f9cc1f8dd8d87123bb87479c2f2754;p=p5sagit%2Fp5-mst-13.2.git applied RE doc patches, with tweaks to the prose Date: Sat, 18 Jul 1998 23:11:13 -0400 (EDT) Message-Id: <199807190311.XAA25080@monk.mps.ohio-state.edu> Subject: [PATCH 5.004_72] Document irregular zero-length matches -- Date: Sun, 19 Jul 1998 00:38:44 -0400 (EDT) Message-Id: <199807190438.AAA26226@monk.mps.ohio-state.edu> Subject: [PATCH 5.004_72] Another irregularity of expressions documented p4raw-id: //depot/perl@1598 --- diff --git a/pod/perlre.pod b/pod/perlre.pod index fc4d969..924a2c4 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -775,6 +775,126 @@ C<${1}000>. Basically, the operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the I side of the C. +=head2 Repeated patterns matching zero-length substring + +WARNING: Difficult material (and prose) ahead. This section needs a rewrite. + +Regular expressions provide a terse and powerful programming language. As +with most other power tools, power comes together with the ability +to wreak havoc. + +A common abuse of this power stems from the ability to make infinite +loops using regular expressions, with something as innocous as: + + 'foo' =~ m{ ( o? )* }x; + +The C can match at the beginning of C<'foo'>, and since the position +in the string is not moved by the match, C would match again and again +due to the C<*> modifier. Another common way to create a similar cycle +is with the looping modifier C: + + @matches = ( 'foo' =~ m{ o? }xg ); + +or + + print "match: <$&>\n" while 'foo' =~ m{ o? }xg; + +or the loop implied by split(). + +However, long experience has shown that many programming tasks may +be significantly simplified by using repeated subexpressions which +may match zero-length substrings, with a simple example being: + + @chars = split //, $string; # // is not magic in split + ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / + +Thus Perl allows the C construct, which I. The rules for this are different for lower-level +loops given by the greedy modifiers C<*+{}>, and for higher-level +ones like the C modifier or split() operator. + +The lower-level loops are I when it is detected that a +repeated expression did match a zero-length substring, thus + + m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; + +is made equivalent to + + m{ (?: NON_ZERO_LENGTH )* + | + (?: ZERO_LENGTH )? + }x; + +The higher level-loops preserve an additional state between iterations: +whether the last match was zero-length. To break the loop, the following +match after a zero-length match is prohibited to have a length of zero. +This prohibition interacts with backtracking (see L<"Backtracking">), +and so the I match is chosen if the I match is of +zero length. + +Say, + + $_ = 'bar'; + s/\w??/<$&>/g; + +results in C<"<><><><>">. At each position of the string the best +match given by non-greedy C is the zero-length match, and the I match is what is matched by C<\w>. Thus zero-length matches +alternate with one-character-long matches. + +Similarly, for repeated C the second-best match is the match at the +position one notch further in the string. + +The additional state of being I is associated to +the matched string, and is reset by each assignment to pos(). + +=head2 Creating custom RE engines + +Overloaded constants (see L) provide a simple way to extend +the functionality of the RE engine. + +Suppose that we want to enable a new RE escape-sequence C<\Y|> which +matches at boundary between white-space characters and non-whitespace +characters. Note that C<(?=\S)(? matches exactly +at these positions, so we want to have each C<\Y|> in the place of the +more complicated version. We can create a module C to do +this: + + package customre; + use overload; + + sub import { + shift; + die "No argument to customre::import allowed" if @_; + overload::constant 'qr' => \&convert; + } + + sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} + + my %rules = ( '\\' => '\\', + 'Y|' => qr/(?=\S)(? enables the new escape in constant regular +expressions, i.e., those without any runtime variable interpolations. +As documented in L, this conversion will work only over +literal parts of regular expressions. For C<\Y|$re\Y|> the variable +part of this regular expression needs to be converted explicitly +(but only if the special meaning of C<\Y|> should be enabled inside $re): + + use customre; + $re = <>; + chomp $re; + $re = customre::convert $re; + /\Y|$re\Y|/; + =head2 SEE ALSO L.