C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
Operators">.
+
+=head2 Modifiers
+
Matching operations can have various modifiers. Modifiers
that relate to the interpretation of the regular expression inside
are listed below. Modifiers that alter the way a regular expression
=over 4
-=item i
-X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
-X<regular expression, case-insensitive>
-
-Do case-insensitive pattern matching.
-
-If C<use locale> is in effect, the case map is taken from the current
-locale. See L<perllocale>.
-
=item m
X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
+=item i
+X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
+X<regular expression, case-insensitive>
+
+Do case-insensitive pattern matching.
+
+If C<use locale> is in effect, the case map is taken from the current
+locale. See L<perllocale>.
+
=item x
X</x>
Extend your pattern's legibility by permitting whitespace and comments.
+=item p
+X</p> X<regex, preserve> X<regexp, preserve>
+
+Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and
+${^POSTMATCH} are available for use after matching.
+
=back
These are usually written as "the C</x> modifier", even though the delimiter
=head3 Metacharacters
-The patterns used in Perl pattern matching derive from supplied in
+The patterns used in Perl pattern matching evolved from the ones supplied in
the Version 8 regex routines. (The routines are derived
(distantly) from Henry Spencer's freely redistributable reimplementation
of the V8 routines.) See L<Version 8 Regular Expressions> for
(If a curly bracket occurs in any other context, it is treated
as a regular character. In particular, the lower bound
-is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
-modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
+is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
be seen in the error message generated by code such as this:
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a "?". Note
that the meanings don't change, just the "greediness":
-X<metacharacter> X<greedy> X<greedyness>
+X<metacharacter> X<greedy> X<greediness>
X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
- *? Match 0 or more times
- +? Match 1 or more times
- ?? Match 0 or 1 time
- {n}? Match exactly n times
- {n,}? Match at least n times
- {n,m}? Match at least n but not more than m times
+ *? Match 0 or more times, not greedily
+ +? Match 1 or more times, not greedily
+ ?? Match 0 or 1 time, not greedily
+ {n}? Match exactly n times, not greedily
+ {n,}? Match at least n times, not greedily
+ {n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour is
-sometimes undesirable. Thus Perl provides the "possesive" quantifier form
+sometimes undesirable. Thus Perl provides the "possessive" quantifier form
as well.
- *+ Match 0 or more times and give nothing back
- ++ Match 1 or more times and give nothing back
- ?+ Match 0 or 1 time and give nothing back
+ *+ Match 0 or more times and give nothing back
+ ++ Match 1 or more times and give nothing back
+ ?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
/"(?:[^"\\]++|\\.)*+"/
-as we know that if the final quote does not match, bactracking will not
+as we know that if the final quote does not match, backtracking will not
help. See the independent subexpression C<< (?>...) >> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
Because patterns are processed as double quoted strings, the following
also work:
-X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
+X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
X<\0> X<\c> X<\N> X<\x>
\t tab (HT, TAB)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
- \033 octal char (think of a PDP-11)
- \x1B hex char
- \x{263a} wide hex char (Unicode SMILEY)
- \c[ control char
+ \033 octal char (example: ESC)
+ \x1B hex char (example: ESC)
+ \x{263a} wide hex char (example: Unicode SMILEY)
+ \cK control char (example: VT)
\N{name} named char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
while escaping will cause the literal string C<\$> to be matched.
You'll need to write something like C<m/\Quser\E\@\Qhost/>.
-=head3 Character classes
+=head3 Character Classes and other Special Escapes
In addition, Perl defines the following:
-X<metacharacter>
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<word> X<whitespace>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
+X<word> X<whitespace> X<character class> X<backreference>
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character
Unsupported in lookbehind.
\1 Backreference to a specific group.
'1' may actually be any positive integer.
- \R1 Relative backreference to a preceding closed group.
- '1' may actually be any positive integer.
+ \g1 Backreference to a specific or previous group,
+ \g{-1} number may be negative indicating a previous buffer and may
+ optionally be wrapped in curly brackets for safer parsing.
+ \g{name} Named backreference
\k<name> Named backreference
- \N{name} Named unicode character, or unicode escape
+ \N{name} Named Unicode character, or Unicode escape
\x12 Hexadecimal escape sequence
\x{1234} Long hexadecimal escape sequence
+ \K Keep the stuff left of the \K, don't include it in $&
+ \v Vertical whitespace
+ \V Not vertical whitespace
+ \h Horizontal whitespace
+ \H Not horizontal whitespace
+ \R Linebreak
A C<\w> matches a single alphanumeric character (an alphabetic
character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
as matching an English word). If C<use locale> is in effect, the list
of alphabetic characters generated by C<\w> is taken from the current
locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
-C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood
-literally. If Unicode is in effect, C<\s> matches also "\x{85}",
-"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
-C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
-You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
+C<\d>, and C<\D> within character classes, but they aren't usable
+as either end of a range. If any of them precedes or follows a "-",
+the "-" is understood literally. If Unicode is in effect, C<\s> matches
+also "\x{85}", "\x{2028}", and "\x{2029}". See L<perlunicode> for more
+details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
+your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
+in general.
X<\w> X<\W> X<word>
+C<\R> will atomically match a linebreak, including the network line-ending
+"\x0D\x0A". Specifically, X<\R> is exactly equivelent to
+
+ (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
+
+B<Note:> C<\R> has no special meaning inside of a character class;
+use C<\v> instead (vertical whitespace).
+X<\R>
+
The POSIX character class syntax
X<character class>
[:class:]
-is also available. Note that the C<[> and C<]> braces are I<literal>;
+is also available. Note that the C<[> and C<]> brackets are I<literal>;
they must always be used within a character class expression.
# this is correct:
=item [2]
Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\ck", chr(11).
+also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII.
=item [3]
[01[:alpha:]%]
-matches zero, one, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percent sign.
The following equivalences to Unicode \p{} constructs and equivalent
backslash character classes (if available), will hold:
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
- blank IsSpace
+ blank
cntrl IsCntrl
digit IsDigit \d
graph IsGraph
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
-32 are most often classified as control characters (assuming ASCII,
+32 are usually classified as control characters (assuming ASCII,
the ISO Latin character sets, and Unicode), as is the character with
the ord() value of 127 (C<DEL>).
X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
\b Match a word boundary
- \B Match a non-(word boundary)
+ \B Match except at a word boundary
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end
\z Match only at end of string
=head3 Capture buffers
-The bracketing construct C<( ... )> creates capture buffers. To
-refer to the digit'th buffer use \<digit> within the
-match. Outside the match use "$" instead of "\". (The
+The bracketing construct C<( ... )> creates capture buffers. To refer
+to the current contents of a buffer later on, within the same pattern,
+use \1 for the first, \2 for the second, and so on.
+Outside the match use "$" instead of "\". (The
\<digit> notation works in certain circumstances outside
the match. See the warning below about \1 vs $1 for details.)
Referring back to another part of the match is called a
before it. And so on. \1 through \9 are always interpreted as
backreferences.
-X<relative backreference>
-In Perl 5.10 it is possible to relatively address a capture buffer by
-using the C<\RNNN> notation, where C<NNN> is negative offset to a
-preceding capture buffer. Thus C<\R1> refers to the last buffer,
-C<\R2> refers to the buffer before that. For example:
+X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
+In order to provide a safer and easier way to construct patterns using
+backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly
+brackets are optional, however omitting them is less safe as the meaning
+of the pattern can be changed by text (such as digits) following it.
+When N is a positive integer the C<\g{N}> notation is exactly equivalent
+to using normal backreferences. When N is a negative integer then it is
+a relative backreference referring to the previous N'th capturing group.
+When the bracket form is used and N is not an integer, it is treated as a
+reference to a named buffer.
+
+Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
+buffer before that. For example:
/
(Y) # buffer 1
( # buffer 2
(X) # buffer 3
- \R1 # backref to buffer 3
- \R3 # backref to buffer 1
+ \g{-1} # backref to buffer 3
+ \g{-3} # backref to buffer 1
)
/x
-and would match the same as C</(Y) ( (X) $3 $1 )/x>.
+and would match the same as C</(Y) ( (X) \3 \1 )/x>.
Additionally, as of Perl 5.10 you may use named capture buffers and named
-backreferences. The notation is C<< (?<name>...) >> and C<< \k<name> >>
-(you may also use single quotes instead of angle brackets to quote the
-name). The only difference with named capture buffers and unnamed ones is
-that multiple buffers may have the same name and that the contents of
-named capture buffers is available via the C<%+> hash. When multiple
-groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
-leftmost defined group, thus it's possible to do things with named capture
-buffers that would otherwise require C<(??{})> code to accomplish. Named
-capture buffers are numbered just as normal capture buffers are and may be
-referenced via the magic numeric variables or via numeric backreferences
-as well as by name.
+backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
+to reference. You may also use apostrophes instead of angle brackets to delimit the
+name; and you may use the bracketed C<< \g{name} >> backreference syntax.
+It's possible to refer to a named capture buffer by absolute and relative number as well.
+Outside the pattern, a named capture buffer is available via the C<%+> hash.
+When different buffers within the same pattern have the same name, C<$+{name}>
+and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
+to do things with named capture buffers that would otherwise require C<(??{})>
+code to accomplish.)
+X<named capture buffer> X<regular expression, named capture buffer>
+X<%+> X<$+{name}> X<\k{name}>
Examples:
/(?<char>.)\k<char>/ # ... a different way
and print "'$+{char}' is the first doubled character\n";
- /(?<char>.)\1/ # ... mix and match
+ /(?'char'.)\1/ # ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
-B<NOTE>: failed matches in Perl do not reset the match variables,
+B<NOTE>: Failed matches in Perl do not reset the match variables,
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
other two.
X<$&> X<$`> X<$'>
+As a workaround for this problem, Perl 5.10 introduces C<${^PREMATCH}>,
+C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
+and C<$'>, B<except> that they are only guaranteed to be defined after a
+successful match that was executed with the C</p> (preserve) modifier.
+The use of these variables incurs no global performance penalty, unlike
+their punctuation char equivalents, however at the trade-off that you
+have to tell perl when you want to use them.
+X</p> X<p modifier>
+
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
the comment as soon as it sees a C<)>, so there is no way to put a literal
C<)> in the comment.
-=item C<(?imsx-imsx)>
+=item C<(?pimsx-imsx)>
X<(?)>
One or more embedded pattern-match modifiers, to be turned on (or
turned off, if preceded by C<->) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamic patterns, such as those read in from a
-configuration file, read in as an argument, are specified in a table
-somewhere, etc. Consider the case that some of which want to be case
-sensitive and some do not. The case insensitive ones need to include
-merely C<(?i)> at the front of the pattern. For example:
+configuration file, taken from an argument, or specified in a table
+somewhere. Consider the case where some patterns want to be case
+sensitive and some do not: The case insensitive ones merely need to
+include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
if ( /$pattern/i ) { }
( (?i) blah ) \s+ \1
-will match a repeated (I<including the case>!) word C<blah> in any
-case, assuming C<x> modifier, and no C<i> modifier outside this
-group.
+will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
+repetition of the previous word, assuming the C</x> modifier, and no C</i>
+modifier outside this group.
+
+Note that the C<p> modifier is special in that it can only be enabled,
+not disabled, and that its presence anywhere in a pattern has a global
+effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
+when executed under C<use warnings>.
=item C<(?:pattern)>
X<(?:)>
/(?:(?s-i)more.*than).*million/i
+=item C<(?|pattern)>
+X<(?|)> X<Branch reset>
+
+This is the "branch reset" pattern, which has the special property
+that the capture buffers are numbered from the same starting point
+in each alternation branch. It is available starting from perl 5.10.
+
+Capture buffers are numbered from left to right, but inside this
+construct the numbering is restarted for each branch.
+
+The numbering within each branch will be as normal, and any buffers
+following this construct will be numbered as though the construct
+contained only one branch, that being the one with the most capture
+buffers in it.
+
+This construct will be useful when you want to capture one of a
+number of alternative matches.
+
+Consider the following pattern. The numbers underneath show in
+which buffer the captured content will be stored.
+
+
+ # before ---------------branch-reset----------- after
+ / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+ # 1 2 2 3 2 3 4
+
+=item Look-Around Assertions
+X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
+
+Look-around assertions are zero width patterns which match a specific
+pattern without including it in C<$&>. Positive assertions match when
+their subpattern matches, negative assertions match when their subpattern
+fails. Look-behind matches text up to the current match position,
+look-ahead matches text following the current match position.
+
+=over 4
+
=item C<(?=pattern)>
X<(?=)> X<look-ahead, positive> X<lookahead, positive>
For look-behind see below.
-=item C<(?<=pattern)>
-X<(?<=)> X<look-behind, positive> X<lookbehind, positive>
+=item C<(?<=pattern)> C<\K>
+X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
matches a word that follows a tab, without including the tab in C<$&>.
Works only for fixed-width look-behind.
+There is a special form of this construct, called C<\K>, which causes the
+regex engine to "keep" everything it had matched prior to the C<\K> and
+not include it in C<$&>. This effectively provides variable length
+look-behind. The use of C<\K> inside of another look-around assertion
+is allowed, but the behaviour is currently not well defined.
+
+For various reasons C<\K> may be significantly more efficient than the
+equivalent C<< (?<=...) >> construct, and it is especially useful in
+situations where you want to efficiently remove something following
+something else in a string. For instance
+
+ s/(foo)bar/$1/g;
+
+can be rewritten as the much more efficient
+
+ s/foo\Kbar//g;
+
=item C<(?<!pattern)>
X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
matches any occurrence of "foo" that does not follow "bar". Works
only for fixed-width look-behind.
+=back
+
=item C<(?'NAME'pattern)>
=item C<< (?<NAME>pattern) >>
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture buffer. Identical in every respect to normal capturing
-parens C<()> but for the additional fact that C<%+> may be used after
-a succesful match to refer to a named buffer. See C<perlvar> for more
+parentheses C<()> but for the additional fact that C<%+> may be used after
+a successful match to refer to a named buffer. See C<perlvar> for more
details on the C<%+> hash.
If multiple distinct capture buffers have the same name then the
$+{NAME} will refer to the leftmost defined buffer in the match.
-The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+The forms C<(?'NAME'pattern)> and C<< (?<NAME>pattern) >> are equivalent.
B<NOTE:> While the notation of this construct is the same as the similar
-function in .NET regexes, the behavior is not, in Perl the buffers are
+function in .NET regexes, the behavior is not. In Perl the buffers are
numbered sequentially regardless of being named or not. Thus in the
pattern
$+{foo} will be the same as $2, and $3 will contain 'z' instead of
the opposite which is what a .NET regex hacker might expect.
-Currently NAME is restricted to word chars only. In other words, it
-must match C</^\w+$/>.
+Currently NAME is restricted to simple identifiers only.
+In other words, it must match C</^[_A-Za-z][_A-Za-z0-9]*\z/> or
+its Unicode extension (see L<utf8>),
+though it isn't extended by the locale (see L<perllocale>).
+
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
+may be used instead of C<< (?<NAME>pattern) >>; however this form does not
+support the use of single quotes as a delimiter for the name. This is
+only available in Perl 5.10 or later.
-=item C<< \k<name> >>
+=item C<< \k<NAME> >>
-=item C<< \k'name' >>
+=item C<< \k'NAME' >>
Named backreference. Similar to numeric backreferences, except that
the group is designated by name and not number. If multiple groups
have the same name then it refers to the leftmost defined group in
the current match.
-It is an error to refer to a name not defined by a C<(?<NAME>)>
+It is an error to refer to a name not defined by a C<< (?<NAME>) >>
earlier in the pattern.
Both forms are equivalent.
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
+may be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
+
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
# location.
>x;
-will set C<$res = 4>. Note that after the match, $cnt returns to the globally
+will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally
introduced value, because the scopes that restrict C<local> operators
are unwound.
variables contain results of C<qr//> operator (see
L<perlop/"qr/STRING/imosx">).
-This restriction is because of the wide-spread and remarkably convenient
+This restriction is due to the wide-spread and remarkably convenient
custom of using run-time determined strings as patterns. For example:
$re = <>;
Better yet, use the carefully constrained evaluation within a Safe
compartment. See L<perlsec> for details about both these mechanisms.
-Because perl's regex engine is not currently re-entrant, interpolated
+Because Perl's regex engine is currently not re-entrant, interpolated
code may not invoke the regex engine either directly with C<m//> or C<s///>),
or indirectly with functions such as C<split>.
}
B<Note> that this pattern does not behave the same way as the equivalent
-PCRE or Python construct of the same form. In perl you can backtrack into
+PCRE or Python construct of the same form. In Perl you can backtrack into
a recursed group, in PCRE and Python the recursed into group is treated
as atomic. Also, modifiers are resolved at compile time, so constructs
like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will
=item C<(?&NAME)>
X<(?&NAME)>
-Recurse to a named subpattern. Identical to (?PARNO) except that the
-parenthesis to recurse to is determined by name. If multiple parens have
+Recurse to a named subpattern. Identical to C<(?PARNO)> except that the
+parenthesis to recurse to is determined by name. If multiple parentheses have
the same name, then it recurses to the leftmost.
It is an error to refer to a name that is not declared somewhere in the
pattern.
+B<NOTE:> In order to make things easier for programmers with experience
+with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
+may be used instead of C<< (?&NAME) >> in Perl 5.10 or later.
+
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
An example of how this might be used is as follows:
- /(?<NAME>(&NAME_PAT))(?<ADDR>(&ADDRESS_PAT))
+ /(?<NAME>(?&NAME_PAT))(?<ADDR>(?&ADDRESS_PAT))
(?(DEFINE)
- (<NAME_PAT>....)
- (<ADRESS_PAT>....)
+ (?<NAME_PAT>....)
+ (?<ADRESS_PAT>....)
)/x
Note that capture buffers matched inside of recursion are not accessible
-after the recursion returns, so the extra layer of capturing buffers are
+after the recursion returns, so the extra layer of capturing buffers is
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
=head2 Special Backtracking Control Verbs
B<WARNING:> These patterns are experimental and subject to change or
-removal in a future version of perl. Their usage in production code should
+removal in a future version of Perl. Their usage in production code should
be noted to avoid problems during upgrades.
These special patterns are generally of the form C<(*VERB:ARG)>. Unless
executed, the next startpoint will be where the cursor was when the
C<(*SKIP)> was executed.
-As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
-
=item C<(*MARK:NAME)> C<(*:NAME)>
X<(*MARK)> C<(*MARK:NAME)> C<(*:NAME)>
in the match.
This can be used to determine which branch of a pattern was matched
-without using a seperate capture buffer for each branch, which in turn
+without using a separate capture buffer for each branch, which in turn
can result in a performance improvement, as perl cannot optimize
C</(?:(x)|(y)|(z))/> as efficiently as something like
C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
See C<(*SKIP)> for more details.
+As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>.
+
=item C<(*THEN)> C<(*THEN:NAME)>
This is similar to the "cut group" operator C<::> from Perl6. Like
This pattern matches nothing and causes the end of successful matching at
the point at which the C<(*ACCEPT)> pattern was encountered, regardless of
whether there is actually more to match in the string. When inside of a
-nested pattern, such as recursion or a dynamically generated subbpattern
+nested pattern, such as recursion, or in a subpattern dynamically generated
via C<(??{})>, only the innermost pattern is ended immediately.
If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are
'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parens were matched, such as in the
+be set. If another branch in the inner parentheses were matched, such as in the
string 'ACDE', then the C<D> and C<E> would have to be matched as well.
=back
NOTE: This section presents an abstract approximation of regular
expression behavior. For a more rigorous (and complicated) view of
the rules involved in selecting a match among possible alternatives,
-see L<Combining pieces together>.
+see L<Combining RE Pieces>.
A fundamental feature of regular expression matching involves the
notion called I<backtracking>, which is currently used (when needed)
-by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
+by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>,
C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
internally, but the general principle outlined here is valid.
if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
got <d is under the >
-Here's another example: let's say you'd like to match a number at the end
+Here's another example. Let's say you'd like to match a number at the end
of a string, and you also want to keep the preceding part of the match.
So you write this:
although the attempted matches are made at different positions because "a"
is not a zero-width assertion, but a one-width assertion.
-B<WARNING>: particularly complicated regular expressions can take
+B<WARNING>: Particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
-ways they can use backtracking to try match. For example, without
+ways they can use backtracking to try for a match. For example, without
internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
with a special meaning described here or above. You can cause
characters that normally function as metacharacters to be interpreted
literally by prefixing them with a "\" (e.g., "\." matches a ".", not any
-character; "\\" matches a "\"). A series of characters matches that
-series of characters in the target string, so the pattern C<blurfl>
-would match "blurfl" in the target string.
+character; "\\" matches a "\"). This escape mechanism is also required
+for the character used as the pattern delimiter.
+
+A series of characters matches that series of characters in the target
+string, so the pattern C<blurfl> would match "blurfl" in the target
+string.
You can specify a character class, by enclosing a list of characters
in C<[]>, which will match any character from the list. If the
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
-that begin from and end at either alphabets of equal case ([a-e],
+that begin from and end at either alphabetics of equal case ([a-e],
[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt,
spell out the character sets in full.
1 matched "0x", even though the rule C<0|0x> could potentially match
the leading 0 in the second number.
-=head2 Warning on \1 vs $1
+=head2 Warning on \1 Instead of $1
Some people get too used to writing things like:
with the operation of matching a backreference. Certainly they mean two
different things on the I<left> side of the C<s///>.
-=head2 Repeated patterns matching zero-length substring
+=head2 Repeated Patterns Matching a Zero-length Substring
B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite.
'foo' =~ m{ ( o? )* }x;
-The C<o?> can match at the beginning of C<'foo'>, and since the position
+The C<o?> matches at the beginning of C<'foo'>, and since the position
in the string is not moved by the match, C<o?> would match again and again
-because of the C<*> modifier. Another common way to create a similar cycle
+because of the C<*> quantifier. Another common way to create a similar cycle
is with the looping modifier C<//g>:
@matches = ( 'foo' =~ m{ o? }xg );
Thus Perl allows such constructs, by I<forcefully breaking
the infinite loop>. The rules for this are different for lower-level
-loops given by the greedy modifiers C<*+{}>, and for higher-level
+loops given by the greedy quantifiers C<*+{}>, and for higher-level
ones like the C</g> modifier or split() operator.
The lower-level loops are I<interrupted> (that is, the loop is
Zero-length matches at the end of the previous match are ignored
during C<split>.
-=head2 Combining pieces together
+=head2 Combining RE Pieces
Each of the elementary pieces of regular expressions which were described
before (such as C<ab> or C<\Z>) could match at most one substring
whole regular expression: a match at an earlier position is always better
than a match at a later position.
-=head2 Creating custom RE engines
+=head2 Creating Custom RE Engines
Overloaded constants (see L<overload>) provide a simple way to extend
the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between whitespace characters and non-whitespace
+matches at a boundary between whitespace characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
more complicated version. We can create a module C<customre> to do
$re = customre::convert $re;
/\Y|$re\Y|/;
+=head1 PCRE/Python Support
+
+As of Perl 5.10 Perl supports several Python/PCRE specific extensions
+to the regex syntax. While Perl programmers are encouraged to use the
+Perl specific syntax, the following are legal in Perl 5.10:
+
+=over 4
+
+=item C<< (?PE<lt>NAMEE<gt>pattern) >>
+
+Define a named capture buffer. Equivalent to C<< (?<NAME>pattern) >>.
+
+=item C<< (?P=NAME) >>
+
+Backreference to a named capture buffer. Equivalent to C<< \g{NAME} >>.
+
+=item C<< (?P>NAME) >>
+
+Subroutine call to a named capture buffer. Equivalent to C<< (?&NAME) >>.
+
+=back
+
=head1 BUGS
This document varies from difficult to understand to completely