C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
Operators">.
+
+=head2 Modifiers
+
Matching operations can have various modifiers. Modifiers
that relate to the interpretation of the regular expression inside
are listed below. Modifiers that alter the way a regular expression
=head3 Metacharacters
-The patterns used in Perl pattern matching derive from supplied in
+The patterns used in Perl pattern matching evolved from the ones supplied in
the Version 8 regex routines. (The routines are derived
(distantly) from Henry Spencer's freely redistributable reimplementation
of the V8 routines.) See L<Version 8 Regular Expressions> for
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a "?". Note
that the meanings don't change, just the "greediness":
-X<metacharacter> X<greedy> X<greedyness>
+X<metacharacter> X<greedy> X<greediness>
X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
- *? Match 0 or more times
- +? Match 1 or more times
- ?? Match 0 or 1 time
- {n}? Match exactly n times
- {n,}? Match at least n times
- {n,m}? Match at least n but not more than m times
+ *? Match 0 or more times, not greedily
+ +? Match 1 or more times, not greedily
+ ?? Match 0 or 1 time, not greedily
+ {n}? Match exactly n times, not greedily
+ {n,}? Match at least n times, not greedily
+ {n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour is
-sometimes undesirable. Thus Perl provides the "possesive" quantifier form
+sometimes undesirable. Thus Perl provides the "possessive" quantifier form
as well.
- *+ Match 0 or more times and give nothing back
- ++ Match 1 or more times and give nothing back
- ?+ Match 0 or 1 time and give nothing back
+ *+ Match 0 or more times and give nothing back
+ ++ Match 1 or more times and give nothing back
+ ?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
/"(?:[^"\\]++|\\.)*+"/
-as we know that if the final quote does not match, bactracking will not
+as we know that if the final quote does not match, backtracking will not
help. See the independent subexpression C<< (?>...) >> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
Because patterns are processed as double quoted strings, the following
also work:
-X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
+X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
X<\0> X<\c> X<\N> X<\x>
\t tab (HT, TAB)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
- \033 octal char (think of a PDP-11)
- \x1B hex char
- \x{263a} wide hex char (Unicode SMILEY)
- \c[ control char
+ \033 octal char (example: ESC)
+ \x1B hex char (example: ESC)
+ \x{263a} wide hex char (example: Unicode SMILEY)
+ \cK control char (example: VT)
\N{name} named char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
=head3 Character classes
In addition, Perl defines the following:
-X<metacharacter>
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<word> X<whitespace>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V>
+X<word> X<whitespace> X<character class> X<backreference>
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character
as matching an English word). If C<use locale> is in effect, the list
of alphabetic characters generated by C<\w> is taken from the current
locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
-C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood
-literally. If Unicode is in effect, C<\s> matches also "\x{85}",
-"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
-C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
-You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
+C<\d>, and C<\D> within character classes, but they aren't usable
+as either end of a range. If any of them precedes or follows a "-",
+the "-" is understood literally. If Unicode is in effect, C<\s> matches
+also "\x{85}", "\x{2028}, and "\x{2029}". See L<perlunicode> for more
+details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
+your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
+in general.
X<\w> X<\W> X<word>
The POSIX character class syntax
[:class:]
-is also available. Note that the C<[> and C<]> braces are I<literal>;
+is also available. Note that the C<[> and C<]> brackets are I<literal>;
they must always be used within a character class expression.
# this is correct:
=item [2]
Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\ck", chr(11).
+also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII.
=item [3]
[01[:alpha:]%]
-matches zero, one, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percent sign.
The following equivalences to Unicode \p{} constructs and equivalent
backslash character classes (if available), will hold:
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
- blank IsSpace
+ blank
cntrl IsCntrl
digit IsDigit \d
graph IsGraph
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
-32 are most often classified as control characters (assuming ASCII,
+32 are usually classified as control characters (assuming ASCII,
the ISO Latin character sets, and Unicode), as is the character with
the ord() value of 127 (C<DEL>).
X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
\b Match a word boundary
- \B Match a non-(word boundary)
+ \B Match except at a word boundary
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
=head3 Capture buffers
-The bracketing construct C<( ... )> creates capture buffers. To
-refer to the digit'th buffer use \<digit> within the
-match. Outside the match use "$" instead of "\". (The
+The bracketing construct C<( ... )> creates capture buffers. To refer
+to the current contents of a buffer later on, within the same pattern,
+use \1 for the first, \2 for the second, and so on.
+Outside the match use "$" instead of "\". (The
\<digit> notation works in certain circumstances outside
the match. See the warning below about \1 vs $1 for details.)
Referring back to another part of the match is called a
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
In order to provide a safer and easier way to construct patterns using
-backrefs, in Perl 5.10 the C<\g{N}> notation is provided. The curly
+backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly
brackets are optional, however omitting them is less safe as the meaning
of the pattern can be changed by text (such as digits) following it.
When N is a positive integer the C<\g{N}> notation is exactly equivalent
Additionally, as of Perl 5.10 you may use named capture buffers and named
backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
-to reference. You may also use single quotes instead of angle brackets to quote the
-name; and you may use the bracketed C<< \g{name} >> back reference syntax.
-The only difference between named capture buffers and unnamed ones is
-that multiple buffers may have the same name and that the contents of
-named capture buffers are available via the C<%+> hash. When multiple
-groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
-leftmost defined group, thus it's possible to do things with named capture
-buffers that would otherwise require C<(??{})> code to accomplish. Named
-capture buffers are numbered just as normal capture buffers are and may be
-referenced via the magic numeric variables or via numeric backreferences
-as well as by name.
+to reference. You may also use apostrophes instead of angle brackets to delimit the
+name; and you may use the bracketed C<< \g{name} >> backreference syntax.
+It's possible to refer to a named capture buffer by absolute and relative number as well.
+Outside the pattern, a named capture buffer is available via the C<%+> hash.
+When different buffers within the same pattern have the same name, C<$+{name}>
+and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
+to do things with named capture buffers that would otherwise require C<(??{})>
+code to accomplish.)
+X<named capture buffer> X<regular expression, named capture buffer>
+X<%+> X<$+{name}> X<\k{name}>
Examples:
/(?<char>.)\k<char>/ # ... a different way
and print "'$+{char}' is the first doubled character\n";
- /(?<char>.)\1/ # ... mix and match
+ /(?'char'.)\1/ # ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
-B<NOTE>: failed matches in Perl do not reset the match variables,
+B<NOTE>: Failed matches in Perl do not reset the match variables,
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
turned off, if preceded by C<->) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamic patterns, such as those read in from a
-configuration file, read in as an argument, are specified in a table
-somewhere, etc. Consider the case that some of which want to be case
-sensitive and some do not. The case insensitive ones need to include
-merely C<(?i)> at the front of the pattern. For example:
+configuration file, taken from an argument, or specified in a table
+somewhere. Consider the case where some patterns want to be case
+sensitive and some do not: The case insensitive ones merely need to
+include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
if ( /$pattern/i ) { }
( (?i) blah ) \s+ \1
-will match a repeated (I<including the case>!) word C<blah> in any
-case, assuming C<x> modifier, and no C<i> modifier outside this
-group.
+will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
+repetition of the previous word, assuming the C</x> modifier, and no C</i>
+modifier outside this group.
Note that the C<k> modifier is special in that it can only be enabled,
not disabled, and that its presence anywhere in a pattern has a global
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture buffer. Identical in every respect to normal capturing
-parens C<()> but for the additional fact that C<%+> may be used after
+parentheses C<()> but for the additional fact that C<%+> may be used after
a succesful match to refer to a named buffer. See C<perlvar> for more
details on the C<%+> hash.
If multiple distinct capture buffers have the same name then the
$+{NAME} will refer to the leftmost defined buffer in the match.
-The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
B<NOTE:> While the notation of this construct is the same as the similar
-function in .NET regexes, the behavior is not, in Perl the buffers are
+function in .NET regexes, the behavior is not. In Perl the buffers are
numbered sequentially regardless of being named or not. Thus in the
pattern
though it isn't extended by the locale (see L<perllocale>).
B<NOTE:> In order to make things easier for programmers with experience
-with the Python or PCRE regex engines the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
-maybe be used instead of C<< (?<NAME>pattern) >>; however this form does not
+with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
+may be used instead of C<< (?<NAME>pattern) >>; however this form does not
support the use of single quotes as a delimiter for the name. This is
only available in Perl 5.10 or later.
have the same name then it refers to the leftmost defined group in
the current match.
-It is an error to refer to a name not defined by a C<(?<NAME>)>
+It is an error to refer to a name not defined by a C<< (?<NAME>) >>
earlier in the pattern.
Both forms are equivalent.
B<NOTE:> In order to make things easier for programmers with experience
-with the Python or PCRE regex engines the pattern C<< (?P=NAME) >>
-maybe be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
+with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
+may be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
# location.
>x;
-will set C<$res = 4>. Note that after the match, $cnt returns to the globally
+will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
introduced value, because the scopes that restrict C<local> operators
are unwound.
variables contain results of C<qr//> operator (see
L<perlop/"qr/STRING/imosx">).
-This restriction is because of the wide-spread and remarkably convenient
+This restriction is due to the wide-spread and remarkably convenient
custom of using run-time determined strings as patterns. For example:
$re = <>;
Better yet, use the carefully constrained evaluation within a Safe
compartment. See L<perlsec> for details about both these mechanisms.
-Because perl's regex engine is not currently re-entrant, interpolated
+Because Perl's regex engine is currently not re-entrant, interpolated
code may not invoke the regex engine either directly with C<m//> or C<s///>),
or indirectly with functions such as C<split>.
}
B<Note> that this pattern does not behave the same way as the equivalent
-PCRE or Python construct of the same form. In perl you can backtrack into
+PCRE or Python construct of the same form. In Perl you can backtrack into
a recursed group, in PCRE and Python the recursed into group is treated
as atomic. Also, modifiers are resolved at compile time, so constructs
like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
=item C<(?&NAME)>
X<(?&NAME)>
-Recurse to a named subpattern. Identical to (?PARNO) except that the
-parenthesis to recurse to is determined by name. If multiple parens have
+Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
+parenthesis to recurse to is determined by name. If multiple parentheses have
the same name, then it recurses to the leftmost.
It is an error to refer to a name that is not declared somewhere in the
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
-maybe be used instead of C<< (?&NAME) >> as of Perl 5.10.
+may be used instead of C<< (?&NAME) >> in Perl 5.10 or later.
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
)/x
Note that capture buffers matched inside of recursion are not accessible
-after the recursion returns, so the extra layer of capturing buffers are
+after the recursion returns, so the extra layer of capturing buffers is
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
=head2 Special Backtracking Control Verbs
B<WARNING:> These patterns are experimental and subject to change or
-removal in a future version of perl. Their usage in production code should
+removal in a future version of Perl. Their usage in production code should
be noted to avoid problems during upgrades.
These special patterns are generally of the form C<(*VERB:ARG)>. Unless
not match, then no further backtracking will take place, and the pattern
will fail outright at the current starting position.
-As a shortcut, X<\v> is exactly equivalent to C<(*PRUNE)>.
+As a shortcut, C<\v> is exactly equivalent to C<(*PRUNE)>.
The following example counts all the possible matching strings in a
pattern (without actually matching any of them).
to this position on failure and tries to match again, (assuming that
there is sufficient room to match).
-As a shortcut X<\V> is exactly equivalent to C<(*SKIP)>.
+As a shortcut C<\V> is exactly equivalent to C<(*SKIP)>.
The name of the C<(*SKIP:NAME)> pattern has special significance. If a
C<(*MARK:NAME)> was encountered while matching, then it is that position
This pattern matches nothing and causes the end of successful matching at
the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
whether there is actually more to match in the string. When inside of a
-nested pattern, such as recursion or a dynamically generated subbpattern
+nested pattern, such as recursion, or in a subpattern dynamically generated
via C<(??{})>, only the innermost pattern is ended immediately.
If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are
'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parens were matched, such as in the
+be set. If another branch in the inner parentheses were matched, such as in the
string 'ACDE', then the C<D> and C<E> would have to be matched as well.
=back
NOTE: This section presents an abstract approximation of regular
expression behavior. For a more rigorous (and complicated) view of
the rules involved in selecting a match among possible alternatives,
-see L<Combining pieces together>.
+see L<Combining RE Pieces>.
A fundamental feature of regular expression matching involves the
notion called I<backtracking>, which is currently used (when needed)
-by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
+by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
internally, but the general principle outlined here is valid.
if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
got <d is under the >
-Here's another example: let's say you'd like to match a number at the end
+Here's another example. Let's say you'd like to match a number at the end
of a string, and you also want to keep the preceding part of the match.
So you write this:
although the attempted matches are made at different positions because "a"
is not a zero-width assertion, but a one-width assertion.
-B<WARNING>: particularly complicated regular expressions can take
+B<WARNING>: Particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
-ways they can use backtracking to try match. For example, without
+ways they can use backtracking to try for a match. For example, without
internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
with a special meaning described here or above. You can cause
characters that normally function as metacharacters to be interpreted
literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
-character; "\\" matches a "\"). A series of characters matches that
-series of characters in the target string, so the pattern C<blurfl>
-would match "blurfl" in the target string.
+character; "\\" matches a "\"). This escape mechanism is also required
+for the character used as the pattern delimiter.
+
+A series of characters matches that series of characters in the target
+string, so the pattern C<blurfl> would match "blurfl" in the target
+string.
You can specify a character class, by enclosing a list of characters
in C<[]>, which will match any character from the list. If the
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
-that begin from and end at either alphabets of equal case ([a-e],
+that begin from and end at either alphabetics of equal case ([a-e],
[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
spell out the character sets in full.
1 matched "0x", even though the rule C<0|0x> could potentially match
the leading 0 in the second number.
-=head2 Warning on \1 vs $1
+=head2 Warning on \1 Instead of $1
Some people get too used to writing things like:
with the operation of matching a backreference. Certainly they mean two
different things on the I<left> side of the C<s///>.
-=head2 Repeated patterns matching zero-length substring
+=head2 Repeated Patterns Matching a Zero-length Substring
B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
'foo' =~ m{ ( o? )* }x;
-The C<o?> can match at the beginning of C<'foo'>, and since the position
+The C<o?> matches at the beginning of C<'foo'>, and since the position
in the string is not moved by the match, C<o?> would match again and again
because of the C<*> modifier. Another common way to create a similar cycle
is with the looping modifier C<//g>:
Zero-length matches at the end of the previous match are ignored
during C<split>.
-=head2 Combining pieces together
+=head2 Combining RE Pieces
Each of the elementary pieces of regular expressions which were described
before (such as C<ab> or C<\Z>) could match at most one substring
whole regular expression: a match at an earlier position is always better
than a match at a later position.
-=head2 Creating custom RE engines
+=head2 Creating Custom RE Engines
Overloaded constants (see L<overload>) provide a simple way to extend
the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between whitespace characters and non-whitespace
+matches at a boundary between whitespace characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
more complicated version. We can create a module C<customre> to do