This page describes the syntax of regular expressions in Perl. For a
description of how to I<use> regular expressions in matching
-operations, plus various examples of the same, see C<m//> and C<s///> in
-L<perlop>.
+operations, plus various examples of the same, see discussion
+of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
The matching operations can have various modifiers. The modifiers
that relate to the interpretation of the regular expression inside
-are listed below. For the modifiers that alter the behaviour of the
-operation, see L<perlop/"m//"> and L<perlop/"s//">.
+are listed below. For the modifiers that alter the way a regular expression
+is used by Perl, see L<perlop/"Regexp Quote-Like Operators"> and
+L<perlop/"Gory details of parsing quoted constructs">.
=over 4
Treat string as multiple lines. That is, change "^" and "$" from matching
at only the very start or end of the string to the start or end of any
-line anywhere within the string,
+line anywhere within the string.
=item s
(If a curly bracket occurs in any other context, it is treated
as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
-to integral values less than 65536.
+to integral values less than a preset limit defined when perl is built.
+This is usually 32766 on the most common platforms. The actual limit can
+be seen in the error message generated by code such as this:
+
+ $_ **= $_ , / {$_} / for 2 .. 42;
By default, a quantified subpattern is "greedy", that is, it will match as
many times as possible (given a particular starting location) while still
\e escape (think troff) (ESC)
\033 octal char (think of a PDP-11)
\x1B hex char
+ \x{263a} wide hex char (Unicode SMILEY)
\c[ control char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
+ \pP Match P, named property. Use \p{Prop} for longer names.
+ \PP Match non-P
+ \X Match eXtended Unicode "combining character sequence",
+ equivalent to C<(?:\PM\pM*)>
+ \C Match a single C char (octet) even under utf8.
A C<\w> matches a single alphanumeric character, not a whole
word. To match a word you'd need to say C<\w+>. If C<use locale> is in
\b Match a word boundary
\B Match a non-(word boundary)
- \A Match at only beginning of string
- \Z Match at only end of string (or before newline at the end)
+ \A Match only at beginning of string
+ \Z Match only at end of string, or before newline at the end
+ \z Match only at end of string
\G Match only where previous m//g left off (works only with /g)
A word boundary (C<\b>) is defined as a spot between two characters that
just like "^" and "$", except that they won't match multiple times when the
C</m> modifier is used, while "^" and "$" will match at every internal line
boundary. To match the actual end of the string, not ignoring newline,
-you can use C<\Z(?!\n)>. The C<\G> assertion can be used to chain global
+you can use C<\z>. The C<\G> assertion can be used to chain global
matches (using C<m//g>), as described in
L<perlop/"Regexp Quote-Like Operators">.
=item C<(?#text)>
A comment. The text is ignored. If the C</x> switch is used to enable
-whitespace formatting, a simple C<#> will suffice.
+whitespace formatting, a simple C<#> will suffice. Note that perl closes
+the comment as soon as it sees a C<)>, so there is no way to put a literal
+C<)> in the comment.
=item C<(?:pattern)>
+=item C<(?imsx-imsx:pattern)>
+
This is for clustering, not capturing; it groups subexpressions like
"()", but doesn't make backreferences as "()" does. So
but doesn't spit out extra fields.
+The letters between C<?> and C<:> act as flags modifiers, see
+L<C<(?imsx-imsx)>>. In particular,
+
+ /(?s-i:more.*than).*million/i
+
+is equivalent to more verbose
+
+ /(?:(?s-i)more.*than).*million/i
+
=item C<(?=pattern)>
A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
succeeds. C<code> is not interpolated. Currently the rules to
determine where the C<code> ends are somewhat convoluted.
-B<WARNING>: This is a grave security risk for arbitrarily interpolated
-patterns. It introduces security holes in previously safe programs.
-A fix to Perl, and to this documentation, will be forthcoming prior
-to the actual 5.005 release.
+The C<code> is properly scoped in the following sense: if the assertion
+is backtracked (compare L<"Backtracking">), all the changes introduced after
+C<local>isation are undone, so
+
+ $_ = 'a' x 8;
+ m<
+ (?{ $cnt = 0 }) # Initialize $cnt.
+ (
+ a
+ (?{
+ local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
+ })
+ )*
+ aaaa
+ (?{ $res = $cnt }) # On success copy to non-localized
+ # location.
+ >x;
+
+will set C<$res = 4>. Note that after the match $cnt returns to the globally
+introduced value 0, since the scopes which restrict C<local> statements
+are unwound.
+
+This assertion may be used as L<C<(?(condition)yes-pattern|no-pattern)>>
+switch. If I<not> used in this way, the result of evaluation of C<code>
+is put into variable $^R. This happens immediately, so $^R can be used from
+other C<(?{ code })> assertions inside the same regular expression.
+
+The above assignment to $^R is properly localized, thus the old value of $^R
+is restored if the assertion is backtracked (compare L<"Backtracking">).
+
+Due to security concerns, this construction is not allowed if the regular
+expression involves run-time interpolation of variables, unless
+C<use re 'eval'> pragma is used (see L<re>), or the variables contain
+results of qr() operator (see L<perlop/"qr/STRING/imosx">).
+
+This restriction is due to the wide-spread (questionable) practice of
+using the construct
+
+ $re = <>;
+ chomp $re;
+ $string =~ /$re/;
+
+without tainting. While this code is frowned upon from security point
+of view, when C<(?{})> was introduced, it was considered bad to add
+I<new> security holes to existing scripts.
+
+B<NOTE:> Use of the above insecure snippet without also enabling taint mode
+is to be severely frowned upon. C<use re 'eval'> does not disable tainting
+checks, thus to allow $re in the above snippet to contain C<(?{})>
+I<with tainting enabled>, one needs both C<use re 'eval'> and untaint
+the $re.
+
+=item C<(?p{ code })>
+
+I<Very experimental> "postponed" regular subexpression. C<code> is evaluated
+at runtime, at the moment this subexpression may match. The result of
+evaluation is considered as a regular expression, and matched as if it
+were inserted instead of this construct.
+
+C<code> is not interpolated. Currently the rules to
+determine where the C<code> ends are somewhat convoluted.
+
+The following regular expression matches matching parenthesized group:
+
+ $re = qr{
+ \(
+ (?:
+ (?> [^()]+ ) # Non-parens without backtracking
+ |
+ (?p{ $re }) # Group with matching parens
+ )*
+ \)
+ }x;
=item C<(?E<gt>pattern)>
In contrast, C<a*ab> will match the same as C<a+b>, since the match of
the subgroup C<a*> is influenced by the following group C<ab> (see
L<"Backtracking">). In particular, C<a*> inside C<a*ab> will match
-less characters that a standalone C<a*>, since this makes the tail match.
+fewer characters than a standalone C<a*>, since this makes the tail match.
An effect similar to C<(?E<gt>pattern)> may be achieved by
since the lookahead is in I<"logical"> context, thus matches the same
substring as a standalone C<a+>. The following C<\1> eats the matched
string, thus making a zero-length assertion into an analogue of
-C<(?>...)>. (The difference between these two constructs is that the
+C<(?E<gt>...)>. (The difference between these two constructs is that the
second one uses a catching group, thus shifting ordinals of
backreferences in the rest of a regular expression.)
This construct is useful for optimizations of "eternal"
matches, because it will not backtrack (see L<"Backtracking">).
- m{ \( (
- [^()]+
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ [^()]+
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
That will efficiently match a nonempty group with matching
two-or-less-level-deep parentheses. However, if there is no such group,
it will take virtually forever on a long string. That's because there are
so many different ways to split a long string into several substrings.
-This is essentially what C<(.+)+> is doing, and this is a subpattern
-of the above pattern. Consider that C<((()aaaaaaaaaaaaaaaaaa> on the
-pattern above detects no-match in several seconds, but that each extra
+This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern
+of the above pattern. Consider that the above pattern detects no-match
+on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that each extra
letter doubles this time. This exponential performance will make it
appear that your program has hung.
However, a tiny modification of this pattern
- m{ \( (
- (?> [^()]+ )
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ (?> [^()]+ )
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
however, that this pattern currently triggers a warning message under
B<-w> saying it C<"matches the null string many times">):
-On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable
+On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<a>s.
Say,
m{ ( \( )?
- [^()]+
+ [^()]+
(?(1) \) )
- }x
+ }x
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
-=item C<(?imsx)>
+=item C<(?imsx-imsx)>
One or more embedded pattern-match modifiers. This is particularly
useful for patterns that are specified in a table somewhere, some of
$pattern = "(?i)foobar";
if ( /$pattern/ ) { }
+Letters after C<-> switch modifiers off.
+
These modifiers are localized inside an enclosing group (if any). Say,
( (?i) blah ) \s+ \1
tricker. Imagine you'd like to find a sequence of non-digits not
followed by "123". You might try to write that as
- $_ = "ABC123";
- if ( /^\D*(?!123)/ ) { # Wrong!
- print "Yup, no 123 in $_\n";
- }
+ $_ = "ABC123";
+ if ( /^\D*(?!123)/ ) { # Wrong!
+ print "Yup, no 123 in $_\n";
+ }
But that isn't going to match; at least, not the way you're hoping. It
claims that there is no 123 in the string. Here's a clearer picture of
C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
specifies a class containing twenty-six characters.)
+Note also that the whole range idea is rather unportable between
+character sets--and even within character sets they may cause results
+you probably didn't expect. A sound principle is to use only ranges
+that begin from and end at either alphabets of equal case ([a-e],
+[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
+spell out the character sets in full.
+
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
Alternatives are tried from left to right, so the first
alternative found for which the entire expression matches, is the one that
is chosen. This means that alternatives are not necessarily greedy. For
-example: when mathing C<foo|foot> against "barefoot", only the "foo"
+example: when matching C<foo|foot> against "barefoot", only the "foo"
part will match, as that is the first alternative tried, and it successfully
matches the target string. (This might not seem important, but it is
important when you are capturing matched text using parentheses.)
with the operation of matching a backreference. Certainly they mean two
different things on the I<left> side of the C<s///>.
+=head2 Repeated patterns matching zero-length substring
+
+WARNING: Difficult material (and prose) ahead. This section needs a rewrite.
+
+Regular expressions provide a terse and powerful programming language. As
+with most other power tools, power comes together with the ability
+to wreak havoc.
+
+A common abuse of this power stems from the ability to make infinite
+loops using regular expressions, with something as innocuous as:
+
+ 'foo' =~ m{ ( o? )* }x;
+
+The C<o?> can match at the beginning of C<'foo'>, and since the position
+in the string is not moved by the match, C<o?> would match again and again
+due to the C<*> modifier. Another common way to create a similar cycle
+is with the looping modifier C<//g>:
+
+ @matches = ( 'foo' =~ m{ o? }xg );
+
+or
+
+ print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
+
+or the loop implied by split().
+
+However, long experience has shown that many programming tasks may
+be significantly simplified by using repeated subexpressions which
+may match zero-length substrings, with a simple example being:
+
+ @chars = split //, $string; # // is not magic in split
+ ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
+
+Thus Perl allows the C</()/> construct, which I<forcefully breaks
+the infinite loop>. The rules for this are different for lower-level
+loops given by the greedy modifiers C<*+{}>, and for higher-level
+ones like the C</g> modifier or split() operator.
+
+The lower-level loops are I<interrupted> when it is detected that a
+repeated expression did match a zero-length substring, thus
+
+ m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
+
+is made equivalent to
+
+ m{ (?: NON_ZERO_LENGTH )*
+ |
+ (?: ZERO_LENGTH )?
+ }x;
+
+The higher level-loops preserve an additional state between iterations:
+whether the last match was zero-length. To break the loop, the following
+match after a zero-length match is prohibited to have a length of zero.
+This prohibition interacts with backtracking (see L<"Backtracking">),
+and so the I<second best> match is chosen if the I<best> match is of
+zero length.
+
+Say,
+
+ $_ = 'bar';
+ s/\w??/<$&>/g;
+
+results in C<"<><b><><a><><r><>">. At each position of the string the best
+match given by non-greedy C<??> is the zero-length match, and the I<second
+best> match is what is matched by C<\w>. Thus zero-length matches
+alternate with one-character-long matches.
+
+Similarly, for repeated C<m/()/g> the second-best match is the match at the
+position one notch further in the string.
+
+The additional state of being I<matched with zero-length> is associated to
+the matched string, and is reset by each assignment to pos().
+
+=head2 Creating custom RE engines
+
+Overloaded constants (see L<overload>) provide a simple way to extend
+the functionality of the RE engine.
+
+Suppose that we want to enable a new RE escape-sequence C<\Y|> which
+matches at boundary between white-space characters and non-whitespace
+characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
+at these positions, so we want to have each C<\Y|> in the place of the
+more complicated version. We can create a module C<customre> to do
+this:
+
+ package customre;
+ use overload;
+
+ sub import {
+ shift;
+ die "No argument to customre::import allowed" if @_;
+ overload::constant 'qr' => \&convert;
+ }
+
+ sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
+
+ my %rules = ( '\\' => '\\',
+ 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
+ sub convert {
+ my $re = shift;
+ $re =~ s{
+ \\ ( \\ | Y . )
+ }
+ { $rules{$1} or invalid($re,$1) }sgex;
+ return $re;
+ }
+
+Now C<use customre> enables the new escape in constant regular
+expressions, i.e., those without any runtime variable interpolations.
+As documented in L<overload>, this conversion will work only over
+literal parts of regular expressions. For C<\Y|$re\Y|> the variable
+part of this regular expression needs to be converted explicitly
+(but only if the special meaning of C<\Y|> should be enabled inside $re):
+
+ use customre;
+ $re = <>;
+ chomp $re;
+ $re = customre::convert $re;
+ /\Y|$re\Y|/;
+
=head2 SEE ALSO
L<perlop/"Regexp Quote-Like Operators">.
+L<perlop/"Gory details of parsing quoted constructs">.
+
L<perlfunc/pos>.
L<perllocale>.