This page describes the syntax of regular expressions in Perl. For a
description of how to I<use> regular expressions in matching
operations, plus various examples of the same, see discussion
-of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/Regexp Quote-Like Operators>.
+of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
The matching operations can have various modifiers. The modifiers
that relate to the interpretation of the regular expression inside
-are listed below. For the modifiers that alter the way regular expression
-is used by Perl, see L<perlop/Regexp Quote-Like Operators>.
+are listed below. For the modifiers that alter the way a regular expression
+is used by Perl, see L<perlop/"Regexp Quote-Like Operators"> and
+L<perlop/"Gory details of parsing quoted constructs">.
=over 4
Treat string as multiple lines. That is, change "^" and "$" from matching
at only the very start or end of the string to the start or end of any
-line anywhere within the string,
+line anywhere within the string.
=item s
(If a curly bracket occurs in any other context, it is treated
as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
-to integral values less than 65536.
+to integral values less than a preset limit defined when perl is built.
+This is usually 32766 on the most common platforms. The actual limit can
+be seen in the error message generated by code such as this:
+
+ $_ **= $_ , / {$_} / for 2 .. 42;
By default, a quantified subpattern is "greedy", that is, it will match as
many times as possible (given a particular starting location) while still
\e escape (think troff) (ESC)
\033 octal char (think of a PDP-11)
\x1B hex char
+ \x{263a} wide hex char (Unicode SMILEY)
\c[ control char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
+ \pP Match P, named property. Use \p{Prop} for longer names.
+ \PP Match non-P
+ \X Match eXtended Unicode "combining character sequence",
+ equivalent to C<(?:\PM\pM*)>
+ \C Match a single C char (octet) even under utf8.
A C<\w> matches a single alphanumeric character, not a whole
word. To match a word you'd need to say C<\w+>. If C<use locale> is in
succeeds. C<code> is not interpolated. Currently the rules to
determine where the C<code> ends are somewhat convoluted.
-Owing to the risks to security, this is only available when the
-C<use re 'eval'> pragma is used, and then only for patterns that don't
-have any variables that must be interpolated at run time.
-
The C<code> is properly scoped in the following sense: if the assertion
is backtracked (compare L<"Backtracking">), all the changes introduced after
C<local>isation are undone, so
The above assignment to $^R is properly localized, thus the old value of $^R
is restored if the assertion is backtracked (compare L<"Backtracking">).
+Due to security concerns, this construction is not allowed if the regular
+expression involves run-time interpolation of variables, unless
+C<use re 'eval'> pragma is used (see L<re>), or the variables contain
+results of qr() operator (see L<perlop/"qr/STRING/imosx">).
+
+This restriction is due to the wide-spread (questionable) practice of
+using the construct
+
+ $re = <>;
+ chomp $re;
+ $string =~ /$re/;
+
+without tainting. While this code is frowned upon from security point
+of view, when C<(?{})> was introduced, it was considered bad to add
+I<new> security holes to existing scripts.
+
+B<NOTE:> Use of the above insecure snippet without also enabling taint mode
+is to be severely frowned upon. C<use re 'eval'> does not disable tainting
+checks, thus to allow $re in the above snippet to contain C<(?{})>
+I<with tainting enabled>, one needs both C<use re 'eval'> and untaint
+the $re.
+
+=item C<(?p{ code })>
+
+I<Very experimental> "postponed" regular subexpression. C<code> is evaluated
+at runtime, at the moment this subexpression may match. The result of
+evaluation is considered as a regular expression, and matched as if it
+were inserted instead of this construct.
+
+C<code> is not interpolated. Currently the rules to
+determine where the C<code> ends are somewhat convoluted.
+
+The following regular expression matches matching parenthesized group:
+
+ $re = qr{
+ \(
+ (?:
+ (?> [^()]+ ) # Non-parens without backtracking
+ |
+ (?p{ $re }) # Group with matching parens
+ )*
+ \)
+ }x;
+
=item C<(?E<gt>pattern)>
An "independent" subexpression. Matches the substring that a
In contrast, C<a*ab> will match the same as C<a+b>, since the match of
the subgroup C<a*> is influenced by the following group C<ab> (see
L<"Backtracking">). In particular, C<a*> inside C<a*ab> will match
-less characters that a standalone C<a*>, since this makes the tail match.
+fewer characters than a standalone C<a*>, since this makes the tail match.
An effect similar to C<(?E<gt>pattern)> may be achieved by
since the lookahead is in I<"logical"> context, thus matches the same
substring as a standalone C<a+>. The following C<\1> eats the matched
string, thus making a zero-length assertion into an analogue of
-C<(?>...)>. (The difference between these two constructs is that the
+C<(?E<gt>...)>. (The difference between these two constructs is that the
second one uses a catching group, thus shifting ordinals of
backreferences in the rest of a regular expression.)
This construct is useful for optimizations of "eternal"
matches, because it will not backtrack (see L<"Backtracking">).
- m{ \( (
- [^()]+
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ [^()]+
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
That will efficiently match a nonempty group with matching
two-or-less-level-deep parentheses. However, if there is no such group,
it will take virtually forever on a long string. That's because there are
so many different ways to split a long string into several substrings.
-This is essentially what C<(.+)+> is doing, and this is a subpattern
-of the above pattern. Consider that C<((()aaaaaaaaaaaaaaaaaa> on the
-pattern above detects no-match in several seconds, but that each extra
+This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern
+of the above pattern. Consider that the above pattern detects no-match
+on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that each extra
letter doubles this time. This exponential performance will make it
appear that your program has hung.
However, a tiny modification of this pattern
- m{ \( (
- (?> [^()]+ )
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ (?> [^()]+ )
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
however, that this pattern currently triggers a warning message under
B<-w> saying it C<"matches the null string many times">):
-On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable
+On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<a>s.
Say,
m{ ( \( )?
- [^()]+
+ [^()]+
(?(1) \) )
- }x
+ }x
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
tricker. Imagine you'd like to find a sequence of non-digits not
followed by "123". You might try to write that as
- $_ = "ABC123";
- if ( /^\D*(?!123)/ ) { # Wrong!
- print "Yup, no 123 in $_\n";
- }
+ $_ = "ABC123";
+ if ( /^\D*(?!123)/ ) { # Wrong!
+ print "Yup, no 123 in $_\n";
+ }
But that isn't going to match; at least, not the way you're hoping. It
claims that there is no 123 in the string. Here's a clearer picture of
C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
specifies a class containing twenty-six characters.)
+Note also that the whole range idea is rather unportable between
+character sets--and even within character sets they may cause results
+you probably didn't expect. A sound principle is to use only ranges
+that begin from and end at either alphabets of equal case ([a-e],
+[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
+spell out the character sets in full.
+
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
Alternatives are tried from left to right, so the first
alternative found for which the entire expression matches, is the one that
is chosen. This means that alternatives are not necessarily greedy. For
-example: when mathing C<foo|foot> against "barefoot", only the "foo"
+example: when matching C<foo|foot> against "barefoot", only the "foo"
part will match, as that is the first alternative tried, and it successfully
matches the target string. (This might not seem important, but it is
important when you are capturing matched text using parentheses.)
to wreak havoc.
A common abuse of this power stems from the ability to make infinite
-loops using regular expressions, with something as innocous as:
+loops using regular expressions, with something as innocuous as:
'foo' =~ m{ ( o? )* }x;
L<perlop/"Regexp Quote-Like Operators">.
+L<perlop/"Gory details of parsing quoted constructs">.
+
L<perlfunc/pos>.
L<perllocale>.