This page describes the syntax of regular expressions in Perl. For a
description of how to I<use> regular expressions in matching
-operations, plus various examples of the same, see C<m//> and C<s///> in
-L<perlop>.
+operations, plus various examples of the same, see discussion
+of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
The matching operations can have various modifiers. The modifiers
that relate to the interpretation of the regular expression inside
-are listed below. For the modifiers that alter the behaviour of the
-operation, see L<perlop/"m//"> and L<perlop/"s//">.
+are listed below. For the modifiers that alter the way a regular expression
+is used by Perl, see L<perlop/"Regexp Quote-Like Operators"> and
+L<perlop/"Gory details of parsing quoted constructs">.
=over 4
\e escape (think troff) (ESC)
\033 octal char (think of a PDP-11)
\x1B hex char
+ \x{263a} wide hex char (Unicode SMILEY)
\c[ control char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
+ \pP Match P, named property. Use \p{Prop} for longer names.
+ \PP Match non-P
+ \X Match eXtended Unicode "combining character sequence",
+ equivalent to C<(?:\PM\pM*)>
+ \C Match a single C char (octet) even under utf8.
A C<\w> matches a single alphanumeric character, not a whole
word. To match a word you'd need to say C<\w+>. If C<use locale> is in
\b Match a word boundary
\B Match a non-(word boundary)
- \A Match at only beginning of string
- \Z Match at only end of string (or before newline at the end)
+ \A Match only at beginning of string
+ \Z Match only at end of string, or before newline at the end
+ \z Match only at end of string
\G Match only where previous m//g left off (works only with /g)
A word boundary (C<\b>) is defined as a spot between two characters that
just like "^" and "$", except that they won't match multiple times when the
C</m> modifier is used, while "^" and "$" will match at every internal line
boundary. To match the actual end of the string, not ignoring newline,
-you can use C<\Z(?!\n)>. The C<\G> assertion can be used to chain global
+you can use C<\z>. The C<\G> assertion can be used to chain global
matches (using C<m//g>), as described in
L<perlop/"Regexp Quote-Like Operators">.
=item C<(?#text)>
A comment. The text is ignored. If the C</x> switch is used to enable
-whitespace formatting, a simple C<#> will suffice.
+whitespace formatting, a simple C<#> will suffice. Note that perl closes
+the comment as soon as it sees a C<)>, so there is no way to put a literal
+C<)> in the comment.
=item C<(?:pattern)>
+=item C<(?imsx-imsx:pattern)>
+
This is for clustering, not capturing; it groups subexpressions like
"()", but doesn't make backreferences as "()" does. So
but doesn't spit out extra fields.
+The letters between C<?> and C<:> act as flags modifiers, see
+L<C<(?imsx-imsx)>>. In particular,
+
+ /(?s-i:more.*than).*million/i
+
+is equivalent to more verbose
+
+ /(?:(?s-i)more.*than).*million/i
+
=item C<(?=pattern)>
A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/>
succeeds. C<code> is not interpolated. Currently the rules to
determine where the C<code> ends are somewhat convoluted.
-B<WARNING>: This is a grave security risk for arbitrarily interpolated
-patterns. It introduces security holes in previously safe programs.
-A fix to Perl, and to this documentation, will be forthcoming prior
-to the actual 5.005 release.
+The C<code> is properly scoped in the following sense: if the assertion
+is backtracked (compare L<"Backtracking">), all the changes introduced after
+C<local>isation are undone, so
+
+ $_ = 'a' x 8;
+ m<
+ (?{ $cnt = 0 }) # Initialize $cnt.
+ (
+ a
+ (?{
+ local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
+ })
+ )*
+ aaaa
+ (?{ $res = $cnt }) # On success copy to non-localized
+ # location.
+ >x;
+
+will set C<$res = 4>. Note that after the match $cnt returns to the globally
+introduced value 0, since the scopes which restrict C<local> statements
+are unwound.
+
+This assertion may be used as L<C<(?(condition)yes-pattern|no-pattern)>>
+switch. If I<not> used in this way, the result of evaluation of C<code>
+is put into variable $^R. This happens immediately, so $^R can be used from
+other C<(?{ code })> assertions inside the same regular expression.
+
+The above assignment to $^R is properly localized, thus the old value of $^R
+is restored if the assertion is backtracked (compare L<"Backtracking">).
+
+Due to security concerns, this construction is not allowed if the regular
+expression involves run-time interpolation of variables, unless
+C<use re 'eval'> pragma is used (see L<re>), or the variables contain
+results of qr() operator (see L<perlop/"qr/STRING/imosx">).
+
+This restriction is due to the wide-spread (questionable) practice of
+using the construct
+
+ $re = <>;
+ chomp $re;
+ $string =~ /$re/;
+
+without tainting. While this code is frowned upon from security point
+of view, when C<(?{})> was introduced, it was considered bad to add
+I<new> security holes to existing scripts.
+
+B<NOTE:> Use of the above insecure snippet without also enabling taint mode
+is to be severely frowned upon. C<use re 'eval'> does not disable tainting
+checks, thus to allow $re in the above snippet to contain C<(?{})>
+I<with tainting enabled>, one needs both C<use re 'eval'> and untaint
+the $re.
+
+=item C<(?p{ code })>
+
+I<Very experimental> "postponed" regular subexpression. C<code> is evaluated
+at runtime, at the moment this subexpression may match. The result of
+evaluation is considered as a regular expression, and matched as if it
+were inserted instead of this construct.
+
+C<code> is not interpolated. Currently the rules to
+determine where the C<code> ends are somewhat convoluted.
+
+The following regular expression matches matching parenthesized group:
+
+ $re = qr{
+ \(
+ (?:
+ (?> [^()]+ ) # Non-parens without backtracking
+ |
+ (?p{ $re }) # Group with matching parens
+ )*
+ \)
+ }x;
=item C<(?E<gt>pattern)>
In contrast, C<a*ab> will match the same as C<a+b>, since the match of
the subgroup C<a*> is influenced by the following group C<ab> (see
L<"Backtracking">). In particular, C<a*> inside C<a*ab> will match
-less characters that a standalone C<a*>, since this makes the tail match.
+fewer characters than a standalone C<a*>, since this makes the tail match.
An effect similar to C<(?E<gt>pattern)> may be achieved by
since the lookahead is in I<"logical"> context, thus matches the same
substring as a standalone C<a+>. The following C<\1> eats the matched
string, thus making a zero-length assertion into an analogue of
-C<(?>...)>. (The difference between these two constructs is that the
+C<(?E<gt>...)>. (The difference between these two constructs is that the
second one uses a catching group, thus shifting ordinals of
backreferences in the rest of a regular expression.)
This construct is useful for optimizations of "eternal"
matches, because it will not backtrack (see L<"Backtracking">).
- m{ \( (
- [^()]+
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ [^()]+
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
That will efficiently match a nonempty group with matching
two-or-less-level-deep parentheses. However, if there is no such group,
it will take virtually forever on a long string. That's because there are
so many different ways to split a long string into several substrings.
-This is essentially what C<(.+)+> is doing, and this is a subpattern
-of the above pattern. Consider that C<((()aaaaaaaaaaaaaaaaaa> on the
-pattern above detects no-match in several seconds, but that each extra
+This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern
+of the above pattern. Consider that the above pattern detects no-match
+on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that each extra
letter doubles this time. This exponential performance will make it
appear that your program has hung.
However, a tiny modification of this pattern
- m{ \( (
- (?> [^()]+ )
- |
- \( [^()]* \)
- )+
- \)
- }x
+ m{ \(
+ (
+ (?> [^()]+ )
+ |
+ \( [^()]* \)
+ )+
+ \)
+ }x
which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
however, that this pattern currently triggers a warning message under
B<-w> saying it C<"matches the null string many times">):
-On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable
+On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<a>s.
Say,
m{ ( \( )?
- [^()]+
+ [^()]+
(?(1) \) )
- }x
+ }x
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
-=item C<(?imsx)>
+=item C<(?imsx-imsx)>
One or more embedded pattern-match modifiers. This is particularly
useful for patterns that are specified in a table somewhere, some of
$pattern = "(?i)foobar";
if ( /$pattern/ ) { }
+Letters after C<-> switch modifiers off.
+
These modifiers are localized inside an enclosing group (if any). Say,
( (?i) blah ) \s+ \1
tricker. Imagine you'd like to find a sequence of non-digits not
followed by "123". You might try to write that as
- $_ = "ABC123";
- if ( /^\D*(?!123)/ ) { # Wrong!
- print "Yup, no 123 in $_\n";
- }
+ $_ = "ABC123";
+ if ( /^\D*(?!123)/ ) { # Wrong!
+ print "Yup, no 123 in $_\n";
+ }
But that isn't going to match; at least, not the way you're hoping. It
claims that there is no 123 in the string. Here's a clearer picture of
C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
specifies a class containing twenty-six characters.)
+Note also that the whole range idea is rather unportable between
+character sets--and even within character sets they may cause results
+you probably didn't expect. A sound principle is to use only ranges
+that begin from and end at either alphabets of equal case ([a-e],
+[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
+spell out the character sets in full.
+
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
with the operation of matching a backreference. Certainly they mean two
different things on the I<left> side of the C<s///>.
+=head2 Repeated patterns matching zero-length substring
+
+WARNING: Difficult material (and prose) ahead. This section needs a rewrite.
+
+Regular expressions provide a terse and powerful programming language. As
+with most other power tools, power comes together with the ability
+to wreak havoc.
+
+A common abuse of this power stems from the ability to make infinite
+loops using regular expressions, with something as innocous as:
+
+ 'foo' =~ m{ ( o? )* }x;
+
+The C<o?> can match at the beginning of C<'foo'>, and since the position
+in the string is not moved by the match, C<o?> would match again and again
+due to the C<*> modifier. Another common way to create a similar cycle
+is with the looping modifier C<//g>:
+
+ @matches = ( 'foo' =~ m{ o? }xg );
+
+or
+
+ print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
+
+or the loop implied by split().
+
+However, long experience has shown that many programming tasks may
+be significantly simplified by using repeated subexpressions which
+may match zero-length substrings, with a simple example being:
+
+ @chars = split //, $string; # // is not magic in split
+ ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
+
+Thus Perl allows the C</()/> construct, which I<forcefully breaks
+the infinite loop>. The rules for this are different for lower-level
+loops given by the greedy modifiers C<*+{}>, and for higher-level
+ones like the C</g> modifier or split() operator.
+
+The lower-level loops are I<interrupted> when it is detected that a
+repeated expression did match a zero-length substring, thus
+
+ m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
+
+is made equivalent to
+
+ m{ (?: NON_ZERO_LENGTH )*
+ |
+ (?: ZERO_LENGTH )?
+ }x;
+
+The higher level-loops preserve an additional state between iterations:
+whether the last match was zero-length. To break the loop, the following
+match after a zero-length match is prohibited to have a length of zero.
+This prohibition interacts with backtracking (see L<"Backtracking">),
+and so the I<second best> match is chosen if the I<best> match is of
+zero length.
+
+Say,
+
+ $_ = 'bar';
+ s/\w??/<$&>/g;
+
+results in C<"<><b><><a><><r><>">. At each position of the string the best
+match given by non-greedy C<??> is the zero-length match, and the I<second
+best> match is what is matched by C<\w>. Thus zero-length matches
+alternate with one-character-long matches.
+
+Similarly, for repeated C<m/()/g> the second-best match is the match at the
+position one notch further in the string.
+
+The additional state of being I<matched with zero-length> is associated to
+the matched string, and is reset by each assignment to pos().
+
+=head2 Creating custom RE engines
+
+Overloaded constants (see L<overload>) provide a simple way to extend
+the functionality of the RE engine.
+
+Suppose that we want to enable a new RE escape-sequence C<\Y|> which
+matches at boundary between white-space characters and non-whitespace
+characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
+at these positions, so we want to have each C<\Y|> in the place of the
+more complicated version. We can create a module C<customre> to do
+this:
+
+ package customre;
+ use overload;
+
+ sub import {
+ shift;
+ die "No argument to customre::import allowed" if @_;
+ overload::constant 'qr' => \&convert;
+ }
+
+ sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
+
+ my %rules = ( '\\' => '\\',
+ 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
+ sub convert {
+ my $re = shift;
+ $re =~ s{
+ \\ ( \\ | Y . )
+ }
+ { $rules{$1} or invalid($re,$1) }sgex;
+ return $re;
+ }
+
+Now C<use customre> enables the new escape in constant regular
+expressions, i.e., those without any runtime variable interpolations.
+As documented in L<overload>, this conversion will work only over
+literal parts of regular expressions. For C<\Y|$re\Y|> the variable
+part of this regular expression needs to be converted explicitly
+(but only if the special meaning of C<\Y|> should be enabled inside $re):
+
+ use customre;
+ $re = <>;
+ chomp $re;
+ $re = customre::convert $re;
+ /\Y|$re\Y|/;
+
=head2 SEE ALSO
L<perlop/"Regexp Quote-Like Operators">.
+L<perlop/"Gory details of parsing quoted constructs">.
+
L<perlfunc/pos>.
L<perllocale>.