C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
Operators">.
+
+=head2 Modifiers
+
Matching operations can have various modifiers. Modifiers
that relate to the interpretation of the regular expression inside
are listed below. Modifiers that alter the way a regular expression
=over 4
-=item i
-X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
-X<regular expression, case-insensitive>
-
-Do case-insensitive pattern matching.
-
-If C<use locale> is in effect, the case map is taken from the current
-locale. See L<perllocale>.
-
=item m
X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
+=item i
+X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
+X<regular expression, case-insensitive>
+
+Do case-insensitive pattern matching.
+
+If C<use locale> is in effect, the case map is taken from the current
+locale. See L<perllocale>.
+
=item x
X</x>
Extend your pattern's legibility by permitting whitespace and comments.
+=item p
+X</p> X<regex, preserve> X<regexp, preserve>
+
+Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
+${^POSTMATCH} are available for use after matching.
+
+=item g and c
+X</g> X</c>
+
+Global matching, and keep the Current position after failed matching.
+Unlike i, m, s and x, these two flags affect the way the regex is used
+rather than the regex itself. See
+L<perlretut/"Using regular expressions in Perl"> for further explanation
+of the g and c modifiers.
+
=back
These are usually written as "the C</x> modifier", even though the delimiter
=head3 Metacharacters
-The patterns used in Perl pattern matching derive from supplied in
+The patterns used in Perl pattern matching evolved from those supplied in
the Version 8 regex routines. (The routines are derived
(distantly) from Henry Spencer's freely redistributable reimplementation
of the V8 routines.) See L<Version 8 Regular Expressions> for
(If a curly bracket occurs in any other context, it is treated
as a regular character. In particular, the lower bound
-is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
-modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
+is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
be seen in the error message generated by code such as this:
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a "?". Note
that the meanings don't change, just the "greediness":
-X<metacharacter> X<greedy> X<greedyness>
+X<metacharacter> X<greedy> X<greediness>
X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
- *? Match 0 or more times
- +? Match 1 or more times
- ?? Match 0 or 1 time
- {n}? Match exactly n times
- {n,}? Match at least n times
- {n,m}? Match at least n but not more than m times
+ *? Match 0 or more times, not greedily
+ +? Match 1 or more times, not greedily
+ ?? Match 0 or 1 time, not greedily
+ {n}? Match exactly n times, not greedily
+ {n,}? Match at least n times, not greedily
+ {n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour is
-sometimes undesirable. Thus Perl provides the "possesive" quantifier form
+sometimes undesirable. Thus Perl provides the "possessive" quantifier form
as well.
- *+ Match 0 or more times and give nothing back
- ++ Match 1 or more times and give nothing back
- ?+ Match 0 or 1 time and give nothing back
+ *+ Match 0 or more times and give nothing back
+ ++ Match 1 or more times and give nothing back
+ ?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
/"(?:[^"\\]++|\\.)*+"/
-as we know that if the final quote does not match, bactracking will not
+as we know that if the final quote does not match, backtracking will not
help. See the independent subexpression C<< (?>...) >> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
Because patterns are processed as double quoted strings, the following
also work:
-X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
+X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
X<\0> X<\c> X<\N> X<\x>
\t tab (HT, TAB)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
- \033 octal char (think of a PDP-11)
- \x1B hex char
- \x{263a} wide hex char (Unicode SMILEY)
- \c[ control char
- \N{name} named char
+ \033 octal char (example: ESC)
+ \x1B hex char (example: ESC)
+ \x{263a} long hex char (example: Unicode SMILEY)
+ \cK control char (example: VT)
+ \N{name} named Unicode character
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
while escaping will cause the literal string C<\$> to be matched.
You'll need to write something like C<m/\Quser\E\@\Qhost/>.
-=head3 Character classes
+=head3 Character Classes and other Special Escapes
In addition, Perl defines the following:
-X<metacharacter>
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<word> X<whitespace>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
+X<word> X<whitespace> X<character class> X<backreference>
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character
\D Match a non-digit character
\pP Match P, named property. Use \p{Prop} for longer names.
\PP Match non-P
- \X Match eXtended Unicode "combining character sequence",
- equivalent to (?:\PM\pM*)
+ \X Match Unicode "eXtended grapheme cluster"
\C Match a single C char (octet) even under Unicode.
NOTE: breaks up characters into their UTF-8 bytes,
so you may end up with malformed pieces of UTF-8.
optionally be wrapped in curly brackets for safer parsing.
\g{name} Named backreference
\k<name> Named backreference
- \N{name} Named unicode character, or unicode escape
- \x12 Hexadecimal escape sequence
- \x{1234} Long hexadecimal escape sequence
+ \K Keep the stuff left of the \K, don't include it in $&
+ \N Any character but \n
+ \v Vertical whitespace
+ \V Not vertical whitespace
+ \h Horizontal whitespace
+ \H Not horizontal whitespace
+ \R Linebreak
A C<\w> matches a single alphanumeric character (an alphabetic
character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
as matching an English word). If C<use locale> is in effect, the list
of alphabetic characters generated by C<\w> is taken from the current
locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
-C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood
-literally. If Unicode is in effect, C<\s> matches also "\x{85}",
-"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
-C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
-You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
+C<\d>, and C<\D> within character classes, but they aren't usable
+as either end of a range. If any of them precedes or follows a "-",
+the "-" is understood literally. If Unicode is in effect, C<\s> matches
+also "\x{85}", "\x{2028}", and "\x{2029}". See L<perlunicode> for more
+details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
+your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
+in general.
X<\w> X<\W> X<word>
+C<\R> will atomically match a linebreak, including the network line-ending
+"\x0D\x0A". Specifically, X<\R> is exactly equivalent to
+
+ (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
+
+B<Note:> C<\R> has no special meaning inside of a character class;
+use C<\v> instead (vertical whitespace).
+X<\R>
+
The POSIX character class syntax
X<character class>
[:class:]
-is also available. Note that the C<[> and C<]> braces are I<literal>;
+is also available. Note that the C<[> and C<]> brackets are I<literal>;
they must always be used within a character class expression.
# this is correct:
# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
-The available classes and their backslash equivalents (if available) are
-as follows:
-X<character class>
+The following table shows the mapping of POSIX character class
+names, common escapes, literal escape sequences and their equivalent
+Unicode style property names.
+X<character class> X<\p> X<\p{}>
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
- alpha
- alnum
- ascii
- blank [1]
- cntrl
- digit \d
- graph
- lower
- print
- punct
- space \s [2]
- upper
- word \w [3]
- xdigit
+B<Note:> up to Perl 5.10 the property names used were shared with
+standard Unicode properties, this was changed in Perl 5.11, see
+L<perl5110delta> for details.
+
+ POSIX Esc Class Property Note
+ --------------------------------------------------------
+ alnum [0-9A-Za-z] IsPosixAlnum
+ alpha [A-Za-z] IsPosixAlpha
+ ascii [\000-\177] IsASCII
+ blank [\011 ] IsPosixBlank [1]
+ cntrl [\0-\37\177] IsPosixCntrl
+ digit \d [0-9] IsPosixDigit
+ graph [!-~] IsPosixGraph
+ lower [a-z] IsPosixLower
+ print [ -~] IsPosixPrint
+ punct [!-/:-@[-`{-~] IsPosixPunct
+ space [\11-\15 ] IsPosixSpace [2]
+ \s [\11\12\14\15 ] IsPerlSpace [2]
+ upper [A-Z] IsPosixUpper
+ word \w [0-9A-Z_a-z] IsPerlWord [3]
+ xdigit [0-9A-Fa-f] IsXDigit
=over
=item [2]
-Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\ck", chr(11).
+Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
+includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
+ASCII.
=item [3]
[01[:alpha:]%]
-matches zero, one, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percent sign.
-The following equivalences to Unicode \p{} constructs and equivalent
-backslash character classes (if available), will hold:
-X<character class> X<\p> X<\p{}>
+=over 4
+
+=item C<$>
+
+Currency symbol
- [[:...:]] \p{...} backslash
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
- alpha IsAlpha
- alnum IsAlnum
- ascii IsASCII
- blank IsSpace
- cntrl IsCntrl
- digit IsDigit \d
- graph IsGraph
- lower IsLower
- print IsPrint
- punct IsPunct
- space IsSpace
- IsSpacePerl \s
- upper IsUpper
- word IsWord
- xdigit IsXDigit
+Mathematical symbols
-For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+=item C<^> C<`>
-If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the usual isalpha(3) interface (except for
-"word" and "blank").
+Modifier symbols (accents)
-The assumedly non-obviously named classes are:
+
+=back
+
+The other named classes are:
=over 4
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
-32 are most often classified as control characters (assuming ASCII,
+32 are usually classified as control characters (assuming ASCII,
the ISO Latin character sets, and Unicode), as is the character with
the ord() value of 127 (C<DEL>).
POSIX traditional Unicode
- [[:^digit:]] \D \P{IsDigit}
- [[:^space:]] \S \P{IsSpace}
- [[:^word:]] \W \P{IsWord}
+ [[:^digit:]] \D \P{IsPosixDigit}
+ [[:^space:]] \S \P{IsPosixSpace}
+ [[:^word:]] \W \P{IsPerlWord}
Perl respects the POSIX standard in that POSIX character classes are
only supported within a character class. The POSIX character classes
X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
\b Match a word boundary
- \B Match a non-(word boundary)
+ \B Match except at a word boundary
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
=head3 Capture buffers
-The bracketing construct C<( ... )> creates capture buffers. To
-refer to the digit'th buffer use \<digit> within the
-match. Outside the match use "$" instead of "\". (The
+The bracketing construct C<( ... )> creates capture buffers. To refer
+to the current contents of a buffer later on, within the same pattern,
+use \1 for the first, \2 for the second, and so on.
+Outside the match use "$" instead of "\". (The
\<digit> notation works in certain circumstances outside
the match. See the warning below about \1 vs $1 for details.)
Referring back to another part of the match is called a
backreference only if at least 11 left parentheses have opened
before it. And so on. \1 through \9 are always interpreted as
backreferences.
+If the bracketing group did not match, the associated backreference won't
+match either. (This can happen if the bracketing group is optional, or
+in a different branch of an alternation.)
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
In order to provide a safer and easier way to construct patterns using
-backrefs, in Perl 5.10 the C<\g{N}> notation is provided. The curly
-brackets are optional, however omitting them is less safe as the meaning
-of the pattern can be changed by text (such as digits) following it.
-When N is a positive integer the C<\g{N}> notation is exactly equivalent
-to using normal backreferences. When N is a negative integer then it is
-a relative backreference referring to the previous N'th capturing group.
-When the bracket form is used and N is not an integer, it is treated as a
-reference to a named buffer.
+backreferences, Perl provides the C<\g{N}> notation (starting with perl
+5.10.0). The curly brackets are optional, however omitting them is less
+safe as the meaning of the pattern can be changed by text (such as digits)
+following it. When N is a positive integer the C<\g{N}> notation is
+exactly equivalent to using normal backreferences. When N is a negative
+integer then it is a relative backreference referring to the previous N'th
+capturing group. When the bracket form is used and N is not an integer, it
+is treated as a reference to a named buffer.
Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
buffer before that. For example:
and would match the same as C</(Y) ( (X) \3 \1 )/x>.
-Additionally, as of Perl 5.10 you may use named capture buffers and named
+Additionally, as of Perl 5.10.0 you may use named capture buffers and named
backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
-to reference. You may also use single quotes instead of angle brackets to quote the
-name; and you may use the bracketed C<< \g{name} >> back reference syntax.
-The only difference between named capture buffers and unnamed ones is
-that multiple buffers may have the same name and that the contents of
-named capture buffers are available via the C<%+> hash. When multiple
-groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
-leftmost defined group, thus it's possible to do things with named capture
-buffers that would otherwise require C<(??{})> code to accomplish. Named
-capture buffers are numbered just as normal capture buffers are and may be
-referenced via the magic numeric variables or via numeric backreferences
-as well as by name.
+to reference. You may also use apostrophes instead of angle brackets to delimit the
+name; and you may use the bracketed C<< \g{name} >> backreference syntax.
+It's possible to refer to a named capture buffer by absolute and relative number as well.
+Outside the pattern, a named capture buffer is available via the C<%+> hash.
+When different buffers within the same pattern have the same name, C<$+{name}>
+and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
+to do things with named capture buffers that would otherwise require C<(??{})>
+code to accomplish.)
+X<named capture buffer> X<regular expression, named capture buffer>
+X<%+> X<$+{name}> X<< \k<name> >>
Examples:
/(?<char>.)\k<char>/ # ... a different way
and print "'$+{char}' is the first doubled character\n";
- /(?<char>.)\1/ # ... mix and match
+ /(?'char'.)\1/ # ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
-B<NOTE>: failed matches in Perl do not reset the match variables,
+B<NOTE>: Failed matches in Perl do not reset the match variables,
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
other two.
X<$&> X<$`> X<$'>
+As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
+C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
+and C<$'>, B<except> that they are only guaranteed to be defined after a
+successful match that was executed with the C</p> (preserve) modifier.
+The use of these variables incurs no global performance penalty, unlike
+their punctuation char equivalents, however at the trade-off that you
+have to tell perl when you want to use them.
+X</p> X<p modifier>
+
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
the comment as soon as it sees a C<)>, so there is no way to put a literal
C<)> in the comment.
-=item C<(?imsx-imsx)>
+=item C<(?pimsx-imsx)>
X<(?)>
One or more embedded pattern-match modifiers, to be turned on (or
turned off, if preceded by C<->) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamic patterns, such as those read in from a
-configuration file, read in as an argument, are specified in a table
-somewhere, etc. Consider the case that some of which want to be case
-sensitive and some do not. The case insensitive ones need to include
-merely C<(?i)> at the front of the pattern. For example:
+configuration file, taken from an argument, or specified in a table
+somewhere. Consider the case where some patterns want to be case
+sensitive and some do not: The case insensitive ones merely need to
+include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
if ( /$pattern/i ) { }
( (?i) blah ) \s+ \1
-will match a repeated (I<including the case>!) word C<blah> in any
-case, assuming C<x> modifier, and no C<i> modifier outside this
-group.
+will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
+repetition of the previous word, assuming the C</x> modifier, and no C</i>
+modifier outside this group.
+
+These modifiers do not carry over into named subpatterns called in the
+enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+change the case-sensitivity of the "NAME" pattern.
+
+Note that the C<p> modifier is special in that it can only be enabled,
+not disabled, and that its presence anywhere in a pattern has a global
+effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
+when executed under C<use warnings>.
=item C<(?:pattern)>
X<(?:)>
/(?:(?s-i)more.*than).*million/i
+=item C<(?|pattern)>
+X<(?|)> X<Branch reset>
+
+This is the "branch reset" pattern, which has the special property
+that the capture buffers are numbered from the same starting point
+in each alternation branch. It is available starting from perl 5.10.0.
+
+Capture buffers are numbered from left to right, but inside this
+construct the numbering is restarted for each branch.
+
+The numbering within each branch will be as normal, and any buffers
+following this construct will be numbered as though the construct
+contained only one branch, that being the one with the most capture
+buffers in it.
+
+This construct will be useful when you want to capture one of a
+number of alternative matches.
+
+Consider the following pattern. The numbers underneath show in
+which buffer the captured content will be stored.
+
+
+ # before ---------------branch-reset----------- after
+ / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+ # 1 2 2 3 2 3 4
+
+Be careful when using the branch reset pattern in combination with
+named captures. Named captures are implemented as being aliases to
+numbered buffers holding the captures, and that interferes with the
+implementation of the branch reset pattern. If you are using named
+captures in a branch reset pattern, it's best to use the same names,
+in the same order, in each of the alternations:
+
+ /(?| (?<a> x ) (?<b> y )
+ | (?<a> z ) (?<b> w )) /x
+
+Not doing so may lead to surprises:
+
+ "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
+ say $+ {a}; # Prints '12'
+ say $+ {b}; # *Also* prints '12'.
+
+The problem here is that both the buffer named C<< a >> and the buffer
+named C<< b >> are aliases for the buffer belonging to C<< $1 >>.
+
+=item Look-Around Assertions
+X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
+
+Look-around assertions are zero width patterns which match a specific
+pattern without including it in C<$&>. Positive assertions match when
+their subpattern matches, negative assertions match when their subpattern
+fails. Look-behind matches text up to the current match position,
+look-ahead matches text following the current match position.
+
+=over 4
+
=item C<(?=pattern)>
X<(?=)> X<look-ahead, positive> X<lookahead, positive>
For look-behind see below.
-=item C<(?<=pattern)>
-X<(?<=)> X<look-behind, positive> X<lookbehind, positive>
+=item C<(?<=pattern)> C<\K>
+X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
matches a word that follows a tab, without including the tab in C<$&>.
Works only for fixed-width look-behind.
+There is a special form of this construct, called C<\K>, which causes the
+regex engine to "keep" everything it had matched prior to the C<\K> and
+not include it in C<$&>. This effectively provides variable length
+look-behind. The use of C<\K> inside of another look-around assertion
+is allowed, but the behaviour is currently not well defined.
+
+For various reasons C<\K> may be significantly more efficient than the
+equivalent C<< (?<=...) >> construct, and it is especially useful in
+situations where you want to efficiently remove something following
+something else in a string. For instance
+
+ s/(foo)bar/$1/g;
+
+can be rewritten as the much more efficient
+
+ s/foo\Kbar//g;
+
=item C<(?<!pattern)>
X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
matches any occurrence of "foo" that does not follow "bar". Works
only for fixed-width look-behind.
+=back
+
=item C<(?'NAME'pattern)>
=item C<< (?<NAME>pattern) >>
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture buffer. Identical in every respect to normal capturing
-parens C<()> but for the additional fact that C<%+> may be used after
-a succesful match to refer to a named buffer. See C<perlvar> for more
-details on the C<%+> hash.
+parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
+used after a successful match to refer to a named buffer. See C<perlvar>
+for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture buffers have the same name then the
$+{NAME} will refer to the leftmost defined buffer in the match.
-The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
B<NOTE:> While the notation of this construct is the same as the similar
-function in .NET regexes, the behavior is not, in Perl the buffers are
+function in .NET regexes, the behavior is not. In Perl the buffers are
numbered sequentially regardless of being named or not. Thus in the
pattern
though it isn't extended by the locale (see L<perllocale>).
B<NOTE:> In order to make things easier for programmers with experience
-with the Python or PCRE regex engines the pattern C<< (?P<NAME>pattern) >>
-maybe be used instead of C<< (?<NAME>pattern) >>; however this form does not
-support the use of single quotes as a delimiter for the name. This is
-only available in Perl 5.10 or later.
+with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
+may be used instead of C<< (?<NAME>pattern) >>; however this form does not
+support the use of single quotes as a delimiter for the name.
=item C<< \k<NAME> >>
have the same name then it refers to the leftmost defined group in
the current match.
-It is an error to refer to a name not defined by a C<(?<NAME>)>
+It is an error to refer to a name not defined by a C<< (?<NAME>) >>
earlier in the pattern.
Both forms are equivalent.
B<NOTE:> In order to make things easier for programmers with experience
-with the Python or PCRE regex engines the pattern C<< (?P=NAME) >>
-maybe be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
+with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
+may be used instead of C<< \k<NAME> >>.
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
# location.
>x;
-will set C<$res = 4>. Note that after the match, $cnt returns to the globally
+will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
introduced value, because the scopes that restrict C<local> operators
are unwound.
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
-Due to an unfortunate implementation issue, the Perl code contained in these
-blocks is treated as a compile time closure that can have seemingly bizarre
-consequences when used with lexically scoped variables inside of subroutines
-or loops. There are various workarounds for this, including simply using
-global variables instead. If you are using this construct and strange results
-occur then check for the use of lexically scoped variables.
-
For reasons of security, this construct is forbidden if the regular
expression involves run-time interpolation of variables, unless the
perilous C<use re 'eval'> pragma has been used (see L<re>), or the
variables contain results of C<qr//> operator (see
L<perlop/"qr/STRING/imosx">).
-This restriction is because of the wide-spread and remarkably convenient
+This restriction is due to the wide-spread and remarkably convenient
custom of using run-time determined strings as patterns. For example:
$re = <>;
Better yet, use the carefully constrained evaluation within a Safe
compartment. See L<perlsec> for details about both these mechanisms.
-Because perl's regex engine is not currently re-entrant, interpolated
-code may not invoke the regex engine either directly with C<m//> or C<s///>),
-or indirectly with functions such as C<split>.
+B<WARNING>: Use of lexical (C<my>) variables in these blocks is
+broken. The result is unpredictable and will make perl unstable. The
+workaround is to use global (C<our>) variables.
+
+B<WARNING>: Because Perl's regex engine is currently not re-entrant,
+interpolated code may not invoke the regex engine either directly with
+C<m//> or C<s///>), or indirectly with functions such as
+C<split>. Invoking the regex engine in these blocks will make perl
+unstable.
=item C<(??{ code })>
X<(??{})>
See also C<(?PARNO)> for a different, more efficient way to accomplish
the same task.
+For reasons of security, this construct is forbidden if the regular
+expression involves run-time interpolation of variables, unless the
+perilous C<use re 'eval'> pragma has been used (see L<re>), or the
+variables contain results of C<qr//> operator (see
+L<perlop/"qr/STRING/imosx">).
+
Because perl's regex engine is not currently re-entrant, delayed
code may not invoke the regex engine either directly with C<m//> or C<s///>),
or indirectly with functions such as C<split>.
}
B<Note> that this pattern does not behave the same way as the equivalent
-PCRE or Python construct of the same form. In perl you can backtrack into
+PCRE or Python construct of the same form. In Perl you can backtrack into
a recursed group, in PCRE and Python the recursed into group is treated
as atomic. Also, modifiers are resolved at compile time, so constructs
like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
=item C<(?&NAME)>
X<(?&NAME)>
-Recurse to a named subpattern. Identical to (?PARNO) except that the
-parenthesis to recurse to is determined by name. If multiple parens have
+Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
+parenthesis to recurse to is determined by name. If multiple parentheses have
the same name, then it recurses to the leftmost.
It is an error to refer to a name that is not declared somewhere in the
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
-maybe be used instead of C<< (?&NAME) >> as of Perl 5.10.
+may be used instead of C<< (?&NAME) >>.
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
)/x
Note that capture buffers matched inside of recursion are not accessible
-after the recursion returns, so the extra layer of capturing buffers are
+after the recursion returns, so the extra layer of capturing buffers is
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
=head2 Special Backtracking Control Verbs
B<WARNING:> These patterns are experimental and subject to change or
-removal in a future version of perl. Their usage in production code should
+removal in a future version of Perl. Their usage in production code should
be noted to avoid problems during upgrades.
These special patterns are generally of the form C<(*VERB:ARG)>. Unless
forbidden.
Any pattern containing a special backtracking verb that allows an argument
-has the special behaviour that when executed it sets the current packages'
+has the special behaviour that when executed it sets the current package's
C<$REGERROR> and C<$REGMARK> variables. When doing so the following
rules apply:
print "Count=$count\n";
we prevent backtracking and find the count of the longest matching
-at each matching startpoint like so:
+at each matching starting point like so:
aaab
aab
Count=2
Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
-executed, the next startpoint will be where the cursor was when the
+executed, the next starting point will be where the cursor was when the
C<(*SKIP)> was executed.
=item C<(*MARK:NAME)> C<(*:NAME)>
when a certain part of the pattern has been successfully matched. This
mark may be given a name. A later C<(*SKIP)> pattern will then skip
forward to that point if backtracked into on failure. Any number of
-C<(*MARK)> patterns are allowed, and the NAME portion is optional and may
-be duplicated.
+C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
can be used to "label" a pattern branch, so that after matching, the
in the match.
This can be used to determine which branch of a pattern was matched
-without using a seperate capture buffer for each branch, which in turn
+without using a separate capture buffer for each branch, which in turn
can result in a performance improvement, as perl cannot optimize
C</(?:(x)|(y)|(z))/> as efficiently as something like
C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
=item C<(*THEN)> C<(*THEN:NAME)>
-This is similar to the "cut group" operator C<::> from Perl6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
C<(*PRUNE)>, this verb always matches, and when backtracked into on
failure, it causes the regex engine to try the next alternation in the
innermost enclosing group (capturing or otherwise).
=item C<(*COMMIT)>
X<(*COMMIT)>
-This is the Perl6 "commit pattern" C<< <commit> >> or C<:::>. It's a
+This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
zero-width pattern similar to C<(*SKIP)>, except that when backtracked
into on failure it causes the match to fail outright. No further attempts
to find a valid match by advancing the start pointer will occur again.
This pattern matches nothing and causes the end of successful matching at
the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
whether there is actually more to match in the string. When inside of a
-nested pattern, such as recursion or a dynamically generated subbpattern
+nested pattern, such as recursion, or in a subpattern dynamically generated
via C<(??{})>, only the innermost pattern is ended immediately.
If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are
'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parens were matched, such as in the
+be set. If another branch in the inner parentheses were matched, such as in the
string 'ACDE', then the C<D> and C<E> would have to be matched as well.
=back
NOTE: This section presents an abstract approximation of regular
expression behavior. For a more rigorous (and complicated) view of
the rules involved in selecting a match among possible alternatives,
-see L<Combining pieces together>.
+see L<Combining RE Pieces>.
A fundamental feature of regular expression matching involves the
notion called I<backtracking>, which is currently used (when needed)
-by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
+by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
internally, but the general principle outlined here is valid.
if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
got <d is under the >
-Here's another example: let's say you'd like to match a number at the end
+Here's another example. Let's say you'd like to match a number at the end
of a string, and you also want to keep the preceding part of the match.
So you write this:
although the attempted matches are made at different positions because "a"
is not a zero-width assertion, but a one-width assertion.
-B<WARNING>: particularly complicated regular expressions can take
+B<WARNING>: Particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
-ways they can use backtracking to try match. For example, without
+ways they can use backtracking to try for a match. For example, without
internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
with a special meaning described here or above. You can cause
characters that normally function as metacharacters to be interpreted
literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
-character; "\\" matches a "\"). A series of characters matches that
-series of characters in the target string, so the pattern C<blurfl>
-would match "blurfl" in the target string.
+character; "\\" matches a "\"). This escape mechanism is also required
+for the character used as the pattern delimiter.
+
+A series of characters matches that series of characters in the target
+string, so the pattern C<blurfl> would match "blurfl" in the target
+string.
You can specify a character class, by enclosing a list of characters
in C<[]>, which will match any character from the list. If the
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
-that begin from and end at either alphabets of equal case ([a-e],
+that begin from and end at either alphabetics of equal case ([a-e],
[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
spell out the character sets in full.
1 matched "0x", even though the rule C<0|0x> could potentially match
the leading 0 in the second number.
-=head2 Warning on \1 vs $1
+=head2 Warning on \1 Instead of $1
Some people get too used to writing things like:
with the operation of matching a backreference. Certainly they mean two
different things on the I<left> side of the C<s///>.
-=head2 Repeated patterns matching zero-length substring
+=head2 Repeated Patterns Matching a Zero-length Substring
B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
'foo' =~ m{ ( o? )* }x;
-The C<o?> can match at the beginning of C<'foo'>, and since the position
+The C<o?> matches at the beginning of C<'foo'>, and since the position
in the string is not moved by the match, C<o?> would match again and again
-because of the C<*> modifier. Another common way to create a similar cycle
+because of the C<*> quantifier. Another common way to create a similar cycle
is with the looping modifier C<//g>:
@matches = ( 'foo' =~ m{ o? }xg );
Thus Perl allows such constructs, by I<forcefully breaking
the infinite loop>. The rules for this are different for lower-level
-loops given by the greedy modifiers C<*+{}>, and for higher-level
+loops given by the greedy quantifiers C<*+{}>, and for higher-level
ones like the C</g> modifier or split() operator.
The lower-level loops are I<interrupted> (that is, the loop is
Zero-length matches at the end of the previous match are ignored
during C<split>.
-=head2 Combining pieces together
+=head2 Combining RE Pieces
Each of the elementary pieces of regular expressions which were described
before (such as C<ab> or C<\Z>) could match at most one substring
whole regular expression: a match at an earlier position is always better
than a match at a later position.
-=head2 Creating custom RE engines
+=head2 Creating Custom RE Engines
Overloaded constants (see L<overload>) provide a simple way to extend
the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between whitespace characters and non-whitespace
+matches at a boundary between whitespace characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
more complicated version. We can create a module C<customre> to do
=head1 PCRE/Python Support
-As of Perl 5.10 Perl supports several Python/PCRE specific extensions
+As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
-Perl specific syntax, the following are legal in Perl 5.10:
+Perl specific syntax, the following are also accepted:
=over 4
-=item C<< (?P<NAME>pattern) >>
+=item C<< (?PE<lt>NAMEE<gt>pattern) >>
Define a named capture buffer. Equivalent to C<< (?<NAME>pattern) >>.
Subroutine call to a named capture buffer. Equivalent to C<< (?&NAME) >>.
-=back 4
+=back
=head1 BUGS