=item p
X</p> X<regex, preserve> X<regexp, preserve>
-Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and
+Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
${^POSTMATCH} are available for use after matching.
+=item g and c
+X</g> X</c>
+
+Global matching, and keep the Current position after failed matching.
+Unlike i, m, s and x, these two flags affect the way the regex is used
+rather than the regex itself. See
+L<perlretut/"Using regular expressions in Perl"> for further explanation
+of the g and c modifiers.
+
=back
These are usually written as "the C</x> modifier", even though the delimiter
=head3 Metacharacters
-The patterns used in Perl pattern matching evolved from the ones supplied in
+The patterns used in Perl pattern matching evolved from those supplied in
the Version 8 regex routines. (The routines are derived
(distantly) from Henry Spencer's freely redistributable reimplementation
of the V8 routines.) See L<Version 8 Regular Expressions> for
\e escape (think troff) (ESC)
\033 octal char (example: ESC)
\x1B hex char (example: ESC)
- \x{263a} wide hex char (example: Unicode SMILEY)
+ \x{263a} long hex char (example: Unicode SMILEY)
\cK control char (example: VT)
- \N{name} named char
+ \N{name} named Unicode character
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
In addition, Perl defines the following:
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<\g> X<\k> X<\N> X<\K> X<\v> X<\V>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
X<word> X<whitespace> X<character class> X<backreference>
\w Match a "word" character (alphanumeric plus "_")
\D Match a non-digit character
\pP Match P, named property. Use \p{Prop} for longer names.
\PP Match non-P
- \X Match eXtended Unicode "combining character sequence",
- equivalent to (?:\PM\pM*)
+ \X Match Unicode "eXtended grapheme cluster"
\C Match a single C char (octet) even under Unicode.
NOTE: breaks up characters into their UTF-8 bytes,
so you may end up with malformed pieces of UTF-8.
optionally be wrapped in curly brackets for safer parsing.
\g{name} Named backreference
\k<name> Named backreference
- \N{name} Named unicode character, or unicode escape
- \x12 Hexadecimal escape sequence
- \x{1234} Long hexadecimal escape sequence
\K Keep the stuff left of the \K, don't include it in $&
+ \N Any character but \n
\v Vertical whitespace
\V Not vertical whitespace
\h Horizontal whitespace
X<\w> X<\W> X<word>
C<\R> will atomically match a linebreak, including the network line-ending
-"\x0D\x0A". Specifically, X<\R> is exactly equivelent to
+"\x0D\x0A". Specifically, X<\R> is exactly equivalent to
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
-The available classes and their backslash equivalents (if available) are
-as follows:
-X<character class>
+The following table shows the mapping of POSIX character class
+names, common escapes, literal escape sequences and their equivalent
+Unicode style property names.
+X<character class> X<\p> X<\p{}>
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
- alpha
- alnum
- ascii
- blank [1]
- cntrl
- digit \d
- graph
- lower
- print
- punct
- space \s [2]
- upper
- word \w [3]
- xdigit
+B<Note:> up to Perl 5.10 the property names used were shared with
+standard Unicode properties, this was changed in Perl 5.11, see
+L<perl5110delta> for details.
+
+ POSIX Esc Class Property Note
+ --------------------------------------------------------
+ alnum [0-9A-Za-z] IsPosixAlnum
+ alpha [A-Za-z] IsPosixAlpha
+ ascii [\000-\177] IsASCII
+ blank [\011 ] IsPosixBlank [1]
+ cntrl [\0-\37\177] IsPosixCntrl
+ digit \d [0-9] IsPosixDigit
+ graph [!-~] IsPosixGraph
+ lower [a-z] IsPosixLower
+ print [ -~] IsPosixPrint
+ punct [!-/:-@[-`{-~] IsPosixPunct
+ space [\11-\15 ] IsPosixSpace [2]
+ \s [\11\12\14\15 ] IsPerlSpace [2]
+ upper [A-Z] IsPosixUpper
+ word \w [0-9A-Z_a-z] IsPerlWord [3]
+ xdigit [0-9A-Fa-f] IsXDigit
=over
=item [2]
-Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII.
+Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
+includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
+ASCII.
=item [3]
matches zero, one, any alphabetic character, and the percent sign.
-The following equivalences to Unicode \p{} constructs and equivalent
-backslash character classes (if available), will hold:
-X<character class> X<\p> X<\p{}>
+=over 4
+
+=item C<$>
+
+Currency symbol
- [[:...:]] \p{...} backslash
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
- alpha IsAlpha
- alnum IsAlnum
- ascii IsASCII
- blank
- cntrl IsCntrl
- digit IsDigit \d
- graph IsGraph
- lower IsLower
- print IsPrint
- punct IsPunct
- space IsSpace
- IsSpacePerl \s
- upper IsUpper
- word IsWord
- xdigit IsXDigit
+Mathematical symbols
-For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+=item C<^> C<`>
-If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the usual isalpha(3) interface (except for
-"word" and "blank").
+Modifier symbols (accents)
-The assumedly non-obviously named classes are:
+
+=back
+
+The other named classes are:
=over 4
POSIX traditional Unicode
- [[:^digit:]] \D \P{IsDigit}
- [[:^space:]] \S \P{IsSpace}
- [[:^word:]] \W \P{IsWord}
+ [[:^digit:]] \D \P{IsPosixDigit}
+ [[:^space:]] \S \P{IsPosixSpace}
+ [[:^word:]] \W \P{IsPerlWord}
Perl respects the POSIX standard in that POSIX character classes are
only supported within a character class. The POSIX character classes
backreference only if at least 11 left parentheses have opened
before it. And so on. \1 through \9 are always interpreted as
backreferences.
+If the bracketing group did not match, the associated backreference won't
+match either. (This can happen if the bracketing group is optional, or
+in a different branch of an alternation.)
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
In order to provide a safer and easier way to construct patterns using
-backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly
-brackets are optional, however omitting them is less safe as the meaning
-of the pattern can be changed by text (such as digits) following it.
-When N is a positive integer the C<\g{N}> notation is exactly equivalent
-to using normal backreferences. When N is a negative integer then it is
-a relative backreference referring to the previous N'th capturing group.
-When the bracket form is used and N is not an integer, it is treated as a
-reference to a named buffer.
+backreferences, Perl provides the C<\g{N}> notation (starting with perl
+5.10.0). The curly brackets are optional, however omitting them is less
+safe as the meaning of the pattern can be changed by text (such as digits)
+following it. When N is a positive integer the C<\g{N}> notation is
+exactly equivalent to using normal backreferences. When N is a negative
+integer then it is a relative backreference referring to the previous N'th
+capturing group. When the bracket form is used and N is not an integer, it
+is treated as a reference to a named buffer.
Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
buffer before that. For example:
and would match the same as C</(Y) ( (X) \3 \1 )/x>.
-Additionally, as of Perl 5.10 you may use named capture buffers and named
+Additionally, as of Perl 5.10.0 you may use named capture buffers and named
backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
to reference. You may also use apostrophes instead of angle brackets to delimit the
name; and you may use the bracketed C<< \g{name} >> backreference syntax.
to do things with named capture buffers that would otherwise require C<(??{})>
code to accomplish.)
X<named capture buffer> X<regular expression, named capture buffer>
-X<%+> X<$+{name}> X<\k{name}>
+X<%+> X<$+{name}> X<< \k<name> >>
Examples:
other two.
X<$&> X<$`> X<$'>
-As a workaround for this problem, Perl 5.10 introduces C<${^PREMATCH}>,
+As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
and C<$'>, B<except> that they are only guaranteed to be defined after a
successful match that was executed with the C</p> (preserve) modifier.
the comment as soon as it sees a C<)>, so there is no way to put a literal
C<)> in the comment.
-=item C<(?kimsx-imsx)>
+=item C<(?pimsx-imsx)>
X<(?)>
One or more embedded pattern-match modifiers, to be turned on (or
repetition of the previous word, assuming the C</x> modifier, and no C</i>
modifier outside this group.
-Note that the C<k> modifier is special in that it can only be enabled,
+These modifiers do not carry over into named subpatterns called in the
+enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not
+change the case-sensitivity of the "NAME" pattern.
+
+Note that the C<p> modifier is special in that it can only be enabled,
not disabled, and that its presence anywhere in a pattern has a global
-effect. Thus C<(?-k)> and C<(?-k:...)> are meaningless and will warn
+effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
when executed under C<use warnings>.
=item C<(?:pattern)>
This is the "branch reset" pattern, which has the special property
that the capture buffers are numbered from the same starting point
-in each alternation branch. It is available starting from perl 5.10.
+in each alternation branch. It is available starting from perl 5.10.0.
Capture buffers are numbered from left to right, but inside this
construct the numbering is restarted for each branch.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
+Be careful when using the branch reset pattern in combination with
+named captures. Named captures are implemented as being aliases to
+numbered buffers holding the captures, and that interferes with the
+implementation of the branch reset pattern. If you are using named
+captures in a branch reset pattern, it's best to use the same names,
+in the same order, in each of the alternations:
+
+ /(?| (?<a> x ) (?<b> y )
+ | (?<a> z ) (?<b> w )) /x
+
+Not doing so may lead to surprises:
+
+ "12" =~ /(?| (?<a> \d+ ) | (?<b> \D+))/x;
+ say $+ {a}; # Prints '12'
+ say $+ {b}; # *Also* prints '12'.
+
+The problem here is that both the buffer named C<< a >> and the buffer
+named C<< b >> are aliases for the buffer belonging to C<< $1 >>.
+
=item Look-Around Assertions
X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture buffer. Identical in every respect to normal capturing
-parentheses C<()> but for the additional fact that C<%+> may be used after
-a successful match to refer to a named buffer. See C<perlvar> for more
-details on the C<%+> hash.
+parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
+used after a successful match to refer to a named buffer. See C<perlvar>
+for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture buffers have the same name then the
$+{NAME} will refer to the leftmost defined buffer in the match.
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
may be used instead of C<< (?<NAME>pattern) >>; however this form does not
-support the use of single quotes as a delimiter for the name. This is
-only available in Perl 5.10 or later.
+support the use of single quotes as a delimiter for the name.
=item C<< \k<NAME> >>
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
-may be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
+may be used instead of C<< \k<NAME> >>.
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
value of C<$^R> is restored if the assertion is backtracked; compare
L<"Backtracking">.
-Due to an unfortunate implementation issue, the Perl code contained in these
-blocks is treated as a compile time closure that can have seemingly bizarre
-consequences when used with lexically scoped variables inside of subroutines
-or loops. There are various workarounds for this, including simply using
-global variables instead. If you are using this construct and strange results
-occur then check for the use of lexically scoped variables.
-
For reasons of security, this construct is forbidden if the regular
expression involves run-time interpolation of variables, unless the
perilous C<use re 'eval'> pragma has been used (see L<re>), or the
Better yet, use the carefully constrained evaluation within a Safe
compartment. See L<perlsec> for details about both these mechanisms.
-Because Perl's regex engine is currently not re-entrant, interpolated
-code may not invoke the regex engine either directly with C<m//> or C<s///>),
-or indirectly with functions such as C<split>.
+B<WARNING>: Use of lexical (C<my>) variables in these blocks is
+broken. The result is unpredictable and will make perl unstable. The
+workaround is to use global (C<our>) variables.
+
+B<WARNING>: Because Perl's regex engine is currently not re-entrant,
+interpolated code may not invoke the regex engine either directly with
+C<m//> or C<s///>), or indirectly with functions such as
+C<split>. Invoking the regex engine in these blocks will make perl
+unstable.
=item C<(??{ code })>
X<(??{})>
See also C<(?PARNO)> for a different, more efficient way to accomplish
the same task.
+For reasons of security, this construct is forbidden if the regular
+expression involves run-time interpolation of variables, unless the
+perilous C<use re 'eval'> pragma has been used (see L<re>), or the
+variables contain results of C<qr//> operator (see
+L<perlop/"qr/STRING/imosx">).
+
Because perl's regex engine is not currently re-entrant, delayed
code may not invoke the regex engine either directly with C<m//> or C<s///>),
or indirectly with functions such as C<split>.
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
-may be used instead of C<< (?&NAME) >> in Perl 5.10 or later.
+may be used instead of C<< (?&NAME) >>.
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
forbidden.
Any pattern containing a special backtracking verb that allows an argument
-has the special behaviour that when executed it sets the current packages'
+has the special behaviour that when executed it sets the current package's
C<$REGERROR> and C<$REGMARK> variables. When doing so the following
rules apply:
=over 4
=item C<(*PRUNE)> C<(*PRUNE:NAME)>
-X<(*PRUNE)> X<(*PRUNE:NAME)> X<\v>
+X<(*PRUNE)> X<(*PRUNE:NAME)>
This zero-width pattern prunes the backtracking tree at the current point
when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
not match, then no further backtracking will take place, and the pattern
will fail outright at the current starting position.
-As a shortcut, C<\v> is exactly equivalent to C<(*PRUNE)>.
-
The following example counts all the possible matching strings in a
pattern (without actually matching any of them).
print "Count=$count\n";
we prevent backtracking and find the count of the longest matching
-at each matching startpoint like so:
+at each matching starting point like so:
aaab
aab
to this position on failure and tries to match again, (assuming that
there is sufficient room to match).
-As a shortcut C<\V> is exactly equivalent to C<(*SKIP)>.
-
The name of the C<(*SKIP:NAME)> pattern has special significance. If a
C<(*MARK:NAME)> was encountered while matching, then it is that position
which is used as the "skip point". If no C<(*MARK)> of that name was
Count=2
Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
-executed, the next startpoint will be where the cursor was when the
+executed, the next starting point will be where the cursor was when the
C<(*SKIP)> was executed.
=item C<(*MARK:NAME)> C<(*:NAME)>
when a certain part of the pattern has been successfully matched. This
mark may be given a name. A later C<(*SKIP)> pattern will then skip
forward to that point if backtracked into on failure. Any number of
-C<(*MARK)> patterns are allowed, and the NAME portion is optional and may
-be duplicated.
+C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated.
In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)>
can be used to "label" a pattern branch, so that after matching, the
=item C<(*THEN)> C<(*THEN:NAME)>
-This is similar to the "cut group" operator C<::> from Perl6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
C<(*PRUNE)>, this verb always matches, and when backtracked into on
failure, it causes the regex engine to try the next alternation in the
innermost enclosing group (capturing or otherwise).
=item C<(*COMMIT)>
X<(*COMMIT)>
-This is the Perl6 "commit pattern" C<< <commit> >> or C<:::>. It's a
+This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
zero-width pattern similar to C<(*SKIP)>, except that when backtracked
into on failure it causes the match to fail outright. No further attempts
to find a valid match by advancing the start pointer will occur again.
=head1 PCRE/Python Support
-As of Perl 5.10 Perl supports several Python/PCRE specific extensions
+As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
-Perl specific syntax, the following are legal in Perl 5.10:
+Perl specific syntax, the following are also accepted:
=over 4