X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=1df6ba3d8a34c1f8e1b6131c8735df3e00c8cfd6;hb=06eaf0bc49fea082c8b8358680815d807a7a925e;hp=b7fda54061250d7487749bdeeabb725daee2440e;hpb=a0ed51b321531af4b47cce24205ab9656f043f0f;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlre.pod b/pod/perlre.pod
index b7fda54..1df6ba3 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -7,12 +7,13 @@ perlre - Perl regular expressions
 This page describes the syntax of regular expressions in Perl.  For a
 description of how to I<use> regular expressions in matching
 operations, plus various examples of the same, see discussion
-of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/Regexp Quote-Like Operators>.
+of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
 
 The matching operations can have various modifiers.  The modifiers
 that relate to the interpretation of the regular expression inside
-are listed below.  For the modifiers that alter the way regular expression
-is used by Perl, see L<perlop/Regexp Quote-Like Operators>.
+are listed below.  For the modifiers that alter the way a regular expression
+is used by Perl, see L<perlop/"Regexp Quote-Like Operators"> and 
+L<perlop/"Gory details of parsing quoted constructs">.
 
 =over 4
 
@@ -168,7 +169,8 @@ In addition, Perl defines the following:
     \D	Match a non-digit character
     \pP	Match P, named property.  Use \p{Prop} for longer names.
     \PP	Match non-P
-    \X	Match eXtended Unicode "combining character sequence", \pM\pm*
+    \X	Match eXtended Unicode "combining character sequence",
+        equivalent to C<(?:\PM\pM*)>
     \C	Match a single C char (octet) even under utf8.
 
 A C<\w> matches a single alphanumeric character, not a whole
@@ -346,10 +348,6 @@ Experimental "evaluate any Perl code" zero-width assertion.  Always
 succeeds.  C<code> is not interpolated.  Currently the rules to
 determine where the C<code> ends are somewhat convoluted.
 
-Owing to the risks to security, this is only available when the
-C<use re 'eval'> pragma is used, and then only for patterns that don't
-have any variables that must be interpolated at run time.
-
 The C<code> is properly scoped in the following sense: if the assertion
 is backtracked (compare L<"Backtracking">), all the changes introduced after
 C<local>isation are undone, so
@@ -380,6 +378,50 @@ other C<(?{ code })> assertions inside the same regular expression.
 The above assignment to $^R is properly localized, thus the old value of $^R
 is restored if the assertion is backtracked (compare L<"Backtracking">).
 
+Due to security concerns, this construction is not allowed if the regular
+expression involves run-time interpolation of variables, unless 
+C<use re 'eval'> pragma is used (see L<re>), or the variables contain
+results of qr() operator (see L<perlop/"qr/STRING/imosx">).
+
+This restriction is due to the wide-spread (questionable) practice of 
+using the construct
+
+    $re = <>;
+    chomp $re;
+    $string =~ /$re/;
+
+without tainting.  While this code is frowned upon from security point
+of view, when C<(?{})> was introduced, it was considered bad to add 
+I<new> security holes to existing scripts.
+
+B<NOTE:>  Use of the above insecure snippet without also enabling taint mode
+is to be severely frowned upon.  C<use re 'eval'> does not disable tainting
+checks, thus to allow $re in the above snippet to contain C<(?{})>
+I<with tainting enabled>, one needs both C<use re 'eval'> and untaint
+the $re.
+
+=item C<(?p{ code })>
+
+I<Very experimental> "postponed" regular subexpression.  C<code> is evaluated
+at runtime, at the moment this subexpression may match.  The result of
+evaluation is considered as a regular expression, and matched as if it
+were inserted instead of this construct.
+
+C<code> is not interpolated.  Currently the rules to
+determine where the C<code> ends are somewhat convoluted.
+
+The following regular expression matches matching parenthesized group:
+
+  $re = qr{
+	     \(
+	     (?:
+		(?> [^()]+ )	# Non-parens without backtracking
+	      |
+		(?p{ $re })	# Group with matching parens
+	     )*
+	     \)
+	  }x;
+
 =item C<(?E<gt>pattern)>
 
 An "independent" subexpression.  Matches the substring that a
@@ -392,7 +434,7 @@ C<a> at the beginning of string, leaving no C<a> for C<ab> to match.
 In contrast, C<a*ab> will match the same as C<a+b>, since the match of
 the subgroup C<a*> is influenced by the following group C<ab> (see
 L<"Backtracking">).  In particular, C<a*> inside C<a*ab> will match
-less characters that a standalone C<a*>, since this makes the tail match.
+fewer characters than a standalone C<a*>, since this makes the tail match.
 
 An effect similar to C<(?E<gt>pattern)> may be achieved by
 
@@ -401,40 +443,42 @@ An effect similar to C<(?E<gt>pattern)> may be achieved by
 since the lookahead is in I<"logical"> context, thus matches the same
 substring as a standalone C<a+>.  The following C<\1> eats the matched
 string, thus making a zero-length assertion into an analogue of
-C<(?>...)>.  (The difference between these two constructs is that the
+C<(?E<gt>...)>.  (The difference between these two constructs is that the
 second one uses a catching group, thus shifting ordinals of
 backreferences in the rest of a regular expression.)
 
 This construct is useful for optimizations of "eternal"
 matches, because it will not backtrack (see L<"Backtracking">).  
 
-    m{  \( ( 
-	 [^()]+ 
-       | 
-         \( [^()]* \)
-       )+
-	\) 
-    }x
+    m{ \(
+	  ( 
+	    [^()]+ 
+          | 
+            \( [^()]* \)
+          )+
+       \) 
+     }x
 
 That will efficiently match a nonempty group with matching
 two-or-less-level-deep parentheses.  However, if there is no such group,
 it will take virtually forever on a long string.  That's because there are
 so many different ways to split a long string into several substrings.
-This is essentially what C<(.+)+> is doing, and this is a subpattern
-of the above pattern.  Consider that C<((()aaaaaaaaaaaaaaaaaa> on the
-pattern above detects no-match in several seconds, but that  each extra
+This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern
+of the above pattern.  Consider that the above pattern detects no-match
+on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that  each extra
 letter doubles this time.  This exponential performance will make it
 appear that your program has hung.
 
 However, a tiny modification of this pattern 
 
-    m{ \( ( 
-	 (?> [^()]+ )
-       | 
-         \( [^()]* \)
-       )+
-	\) 
-    }x
+    m{ \( 
+	  ( 
+	    (?> [^()]+ )
+          | 
+            \( [^()]* \)
+          )+
+       \) 
+     }x
 
 which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
 this yourself would be a productive exercise), but finishes in a fourth
@@ -442,7 +486,7 @@ the time when used on a similar string with 1000000 C<a>s.  Be aware,
 however, that this pattern currently triggers a warning message under
 B<-w> saying it C<"matches the null string many times">):
 
-On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable
+On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable
 effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
 This was only 4 times slower on a string with 1000000 C<a>s.
 
@@ -457,9 +501,9 @@ matched), or lookahead/lookbehind/evaluate zero-width assertion.
 Say,
 
     m{ ( \( )? 
-      [^()]+ 
+       [^()]+ 
        (?(1) \) ) 
-    }x
+     }x
 
 matches a chunk of non-parentheses, possibly included in parentheses
 themselves.
@@ -608,10 +652,10 @@ When using lookahead assertions and negations, this can all get even
 tricker.  Imagine you'd like to find a sequence of non-digits not
 followed by "123".  You might try to write that as
 
-	$_ = "ABC123";
-	if ( /^\D*(?!123)/ ) {				# Wrong!
-	    print "Yup, no 123 in $_\n";
-	}
+    $_ = "ABC123";
+    if ( /^\D*(?!123)/ ) {		# Wrong!
+	print "Yup, no 123 in $_\n";
+    }
 
 But that isn't going to match; at least, not the way you're hoping.  It
 claims that there is no 123 in the string.  Here's a clearer picture of
@@ -714,6 +758,13 @@ following all specify the same class of three characters: C<[-az]>,
 C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, which
 specifies a class containing twenty-six characters.)
 
+Note also that the whole range idea is rather unportable between
+character sets--and even within character sets they may cause results
+you probably didn't expect.  A sound principle is to use only ranges
+that begin from and end at either alphabets of equal case ([a-e],
+[A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,
+spell out the character sets in full.
+
 Characters may be specified using a metacharacter syntax much like that
 used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
 "\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
@@ -904,6 +955,8 @@ part of this regular expression needs to be converted explicitly
 
 L<perlop/"Regexp Quote-Like Operators">.
 
+L<perlop/"Gory details of parsing quoted constructs">.
+
 L<perlfunc/pos>.
 
 L<perllocale>.