=over 4
-=item i
-X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
-X<regular expression, case-insensitive>
-
-Do case-insensitive pattern matching.
-
-If C<use locale> is in effect, the case map is taken from the current
-locale. See L<perllocale>.
-
=item m
X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
+=item i
+X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
+X<regular expression, case-insensitive>
+
+Do case-insensitive pattern matching.
+
+If C<use locale> is in effect, the case map is taken from the current
+locale. See L<perllocale>.
+
=item x
X</x>
Extend your pattern's legibility by permitting whitespace and comments.
+=item p
+X</p> X<regex, preserve> X<regexp, preserve>
+
+Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and
+${^POSTMATCH} are available for use after matching.
+
+=item g and c
+X</g> X</c>
+
+Global matching, and keep the Current position after failed matching.
+Unlike i, m, s and x, these two flags affect the way the regex is used
+rather than the regex itself. See
+L<perlretut/"Using regular expressions in Perl"> for further explanation
+of the g and c modifiers.
+
=back
These are usually written as "the C</x> modifier", even though the delimiter
=head3 Metacharacters
-The patterns used in Perl pattern matching evolved from the ones supplied in
+The patterns used in Perl pattern matching evolved from those supplied in
the Version 8 regex routines. (The routines are derived
(distantly) from Henry Spencer's freely redistributable reimplementation
of the V8 routines.) See L<Version 8 Regular Expressions> for
(If a curly bracket occurs in any other context, it is treated
as a regular character. In particular, the lower bound
-is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
-modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
+is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
+quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
to integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
be seen in the error message generated by code such as this:
\e escape (think troff) (ESC)
\033 octal char (example: ESC)
\x1B hex char (example: ESC)
- \x{263a} wide hex char (example: Unicode SMILEY)
+ \x{263a} long hex char (example: Unicode SMILEY)
\cK control char (example: VT)
- \N{name} named char
+ \N{name} named Unicode character
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
while escaping will cause the literal string C<\$> to be matched.
You'll need to write something like C<m/\Quser\E\@\Qhost/>.
-=head3 Character classes
+=head3 Character Classes and other Special Escapes
In addition, Perl defines the following:
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
-X<\g> X<\k> X<\N> X<\K> X<\v> X<\V>
+X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H>
X<word> X<whitespace> X<character class> X<backreference>
\w Match a "word" character (alphanumeric plus "_")
\pP Match P, named property. Use \p{Prop} for longer names.
\PP Match non-P
\X Match eXtended Unicode "combining character sequence",
- equivalent to (?:\PM\pM*)
+ equivalent to (?>\PM\pM*)
\C Match a single C char (octet) even under Unicode.
NOTE: breaks up characters into their UTF-8 bytes,
so you may end up with malformed pieces of UTF-8.
optionally be wrapped in curly brackets for safer parsing.
\g{name} Named backreference
\k<name> Named backreference
- \N{name} Named unicode character, or unicode escape
- \x12 Hexadecimal escape sequence
- \x{1234} Long hexadecimal escape sequence
\K Keep the stuff left of the \K, don't include it in $&
- \v Shortcut for (*PRUNE)
- \V Shortcut for (*SKIP)
+ \N Any character but \n
+ \v Vertical whitespace
+ \V Not vertical whitespace
+ \h Horizontal whitespace
+ \H Not horizontal whitespace
+ \R Linebreak
A C<\w> matches a single alphanumeric character (an alphabetic
character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
C<\d>, and C<\D> within character classes, but they aren't usable
as either end of a range. If any of them precedes or follows a "-",
the "-" is understood literally. If Unicode is in effect, C<\s> matches
-also "\x{85}", "\x{2028}, and "\x{2029}". See L<perlunicode> for more
+also "\x{85}", "\x{2028}", and "\x{2029}". See L<perlunicode> for more
details about C<\pP>, C<\PP>, C<\X> and the possibility of defining
your own C<\p> and C<\P> properties, and L<perluniintro> about Unicode
in general.
X<\w> X<\W> X<word>
+C<\R> will atomically match a linebreak, including the network line-ending
+"\x0D\x0A". Specifically, X<\R> is exactly equivalent to
+
+ (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
+
+B<Note:> C<\R> has no special meaning inside of a character class;
+use C<\v> instead (vertical whitespace).
+X<\R>
+
The POSIX character class syntax
X<character class>
# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
-The available classes and their backslash equivalents (if available) are
-as follows:
-X<character class>
+The following table shows the mapping of POSIX character class
+names, common escapes, literal escape sequences and their equivalent
+Unicode style property names.
+X<character class> X<\p> X<\p{}>
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
- alpha
- alnum
- ascii
- blank [1]
- cntrl
- digit \d
- graph
- lower
- print
- punct
- space \s [2]
- upper
- word \w [3]
- xdigit
+B<Note:> up to Perl 5.10 the property names used were shared with
+standard Unicode properties, this was changed in Perl 5.11, see
+L<perl5110delta> for details.
+
+ POSIX Esc Class Property Note
+ --------------------------------------------------------
+ alnum [0-9A-Za-z] IsPosixAlnum
+ alpha [A-Za-z] IsPosixAlpha
+ ascii [\000-\177] IsASCII
+ blank [\011 ] IsPosixBlank [1]
+ cntrl [\0-\37\177] IsPosixCntrl
+ digit \d [0-9] IsPosixDigit
+ graph [!-~] IsPosixGraph
+ lower [a-z] IsPosixLower
+ print [ -~] IsPosixPrint
+ punct [!-/:-@[-`{-~] IsPosixPunct
+ space [\11-\15 ] IsPosixSpace [2]
+ \s [\11\12\14\15 ] IsPerlSpace [2]
+ upper [A-Z] IsPosixUpper
+ word \w [0-9A-Z_a-z] IsPerlWord [3]
+ xdigit [0-9A-Fa-f] IsXDigit
=over
=item [2]
-Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII.
+Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
+includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
+ASCII.
=item [3]
matches zero, one, any alphabetic character, and the percent sign.
-The following equivalences to Unicode \p{} constructs and equivalent
-backslash character classes (if available), will hold:
-X<character class> X<\p> X<\p{}>
+=item C<$>
- [[:...:]] \p{...} backslash
+Currency symbol
- alpha IsAlpha
- alnum IsAlnum
- ascii IsASCII
- blank
- cntrl IsCntrl
- digit IsDigit \d
- graph IsGraph
- lower IsLower
- print IsPrint
- punct IsPunct
- space IsSpace
- IsSpacePerl \s
- upper IsUpper
- word IsWord
- xdigit IsXDigit
+=item C<+> C<< < >> C<=> C<< > >> C<|> C<~>
-For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
+Mathematical symbols
-If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the usual isalpha(3) interface (except for
-"word" and "blank").
+=item C<^> C<`>
+
+Modifier symbols (accents)
+
+=back
+
+=back
-The assumedly non-obviously named classes are:
+The other named classes are:
=over 4
POSIX traditional Unicode
- [[:^digit:]] \D \P{IsDigit}
- [[:^space:]] \S \P{IsSpace}
- [[:^word:]] \W \P{IsWord}
+ [[:^digit:]] \D \P{IsPosixDigit}
+ [[:^space:]] \S \P{IsPosixSpace}
+ [[:^word:]] \W \P{IsPerlWord}
Perl respects the POSIX standard in that POSIX character classes are
only supported within a character class. The POSIX character classes
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
In order to provide a safer and easier way to construct patterns using
-backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly
-brackets are optional, however omitting them is less safe as the meaning
-of the pattern can be changed by text (such as digits) following it.
-When N is a positive integer the C<\g{N}> notation is exactly equivalent
-to using normal backreferences. When N is a negative integer then it is
-a relative backreference referring to the previous N'th capturing group.
-When the bracket form is used and N is not an integer, it is treated as a
-reference to a named buffer.
+backreferences, Perl provides the C<\g{N}> notation (starting with perl
+5.10.0). The curly brackets are optional, however omitting them is less
+safe as the meaning of the pattern can be changed by text (such as digits)
+following it. When N is a positive integer the C<\g{N}> notation is
+exactly equivalent to using normal backreferences. When N is a negative
+integer then it is a relative backreference referring to the previous N'th
+capturing group. When the bracket form is used and N is not an integer, it
+is treated as a reference to a named buffer.
Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the
buffer before that. For example:
and would match the same as C</(Y) ( (X) \3 \1 )/x>.
-Additionally, as of Perl 5.10 you may use named capture buffers and named
+Additionally, as of Perl 5.10.0 you may use named capture buffers and named
backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
to reference. You may also use apostrophes instead of angle brackets to delimit the
name; and you may use the bracketed C<< \g{name} >> backreference syntax.
to do things with named capture buffers that would otherwise require C<(??{})>
code to accomplish.)
X<named capture buffer> X<regular expression, named capture buffer>
-X<%+> X<$+{name}> X<\k{name}>
+X<%+> X<$+{name}> X<< \k<name> >>
Examples:
other two.
X<$&> X<$`> X<$'>
-As a workaround for this problem, Perl 5.10 introduces C<${^PREMATCH}>,
+As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>,
C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&>
and C<$'>, B<except> that they are only guaranteed to be defined after a
-successful match that was executed with the C</k> (keep-copy) modifier.
+successful match that was executed with the C</p> (preserve) modifier.
The use of these variables incurs no global performance penalty, unlike
their punctuation char equivalents, however at the trade-off that you
have to tell perl when you want to use them.
-X</k> X<k modifier>
+X</p> X<p modifier>
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
the comment as soon as it sees a C<)>, so there is no way to put a literal
C<)> in the comment.
-=item C<(?kimsx-imsx)>
+=item C<(?pimsx-imsx)>
X<(?)>
One or more embedded pattern-match modifiers, to be turned on (or
repetition of the previous word, assuming the C</x> modifier, and no C</i>
modifier outside this group.
-Note that the C<k> modifier is special in that it can only be enabled,
+Note that the C<p> modifier is special in that it can only be enabled,
not disabled, and that its presence anywhere in a pattern has a global
-effect. Thus C<(?-k)> and C<(?-k:...)> are meaningless and will warn
+effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn
when executed under C<use warnings>.
=item C<(?:pattern)>
/(?:(?s-i)more.*than).*million/i
+=item C<(?|pattern)>
+X<(?|)> X<Branch reset>
+
+This is the "branch reset" pattern, which has the special property
+that the capture buffers are numbered from the same starting point
+in each alternation branch. It is available starting from perl 5.10.0.
+
+Capture buffers are numbered from left to right, but inside this
+construct the numbering is restarted for each branch.
+
+The numbering within each branch will be as normal, and any buffers
+following this construct will be numbered as though the construct
+contained only one branch, that being the one with the most capture
+buffers in it.
+
+This construct will be useful when you want to capture one of a
+number of alternative matches.
+
+Consider the following pattern. The numbers underneath show in
+which buffer the captured content will be stored.
+
+
+ # before ---------------branch-reset----------- after
+ / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+ # 1 2 2 3 2 3 4
+
+Note: as of Perl 5.10.0, branch resets interfere with the contents of
+the C<%+> hash, that holds named captures. Consider using C<%-> instead.
+
=item Look-Around Assertions
X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
look-behind. The use of C<\K> inside of another look-around assertion
is allowed, but the behaviour is currently not well defined.
-For various reasons C<\K> may be signifigantly more efficient than the
+For various reasons C<\K> may be significantly more efficient than the
equivalent C<< (?<=...) >> construct, and it is especially useful in
situations where you want to efficiently remove something following
something else in a string. For instance
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture buffer. Identical in every respect to normal capturing
-parentheses C<()> but for the additional fact that C<%+> may be used after
-a succesful match to refer to a named buffer. See C<perlvar> for more
-details on the C<%+> hash.
+parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
+used after a successful match to refer to a named buffer. See C<perlvar>
+for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture buffers have the same name then the
$+{NAME} will refer to the leftmost defined buffer in the match.
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<< (?PE<lt>NAMEE<gt>pattern) >>
may be used instead of C<< (?<NAME>pattern) >>; however this form does not
-support the use of single quotes as a delimiter for the name. This is
-only available in Perl 5.10 or later.
+support the use of single quotes as a delimiter for the name.
=item C<< \k<NAME> >>
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >>
-may be used instead of C<< \k<NAME> >> in Perl 5.10 or later.
+may be used instead of C<< \k<NAME> >>.
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
B<NOTE:> In order to make things easier for programmers with experience
with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
-may be used instead of C<< (?&NAME) >> in Perl 5.10 or later.
+may be used instead of C<< (?&NAME) >>.
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
=over 4
=item C<(*PRUNE)> C<(*PRUNE:NAME)>
-X<(*PRUNE)> X<(*PRUNE:NAME)> X<\v>
+X<(*PRUNE)> X<(*PRUNE:NAME)>
This zero-width pattern prunes the backtracking tree at the current point
when backtracked into on failure. Consider the pattern C<A (*PRUNE) B>,
not match, then no further backtracking will take place, and the pattern
will fail outright at the current starting position.
-As a shortcut, C<\v> is exactly equivalent to C<(*PRUNE)>.
-
The following example counts all the possible matching strings in a
pattern (without actually matching any of them).
print "Count=$count\n";
we prevent backtracking and find the count of the longest matching
-at each matching startpoint like so:
+at each matching starting point like so:
aaab
aab
to this position on failure and tries to match again, (assuming that
there is sufficient room to match).
-As a shortcut C<\V> is exactly equivalent to C<(*SKIP)>.
-
The name of the C<(*SKIP:NAME)> pattern has special significance. If a
C<(*MARK:NAME)> was encountered while matching, then it is that position
which is used as the "skip point". If no C<(*MARK)> of that name was
Count=2
Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)>
-executed, the next startpoint will be where the cursor was when the
+executed, the next starting point will be where the cursor was when the
C<(*SKIP)> was executed.
=item C<(*MARK:NAME)> C<(*:NAME)>
in the match.
This can be used to determine which branch of a pattern was matched
-without using a seperate capture buffer for each branch, which in turn
+without using a separate capture buffer for each branch, which in turn
can result in a performance improvement, as perl cannot optimize
C</(?:(x)|(y)|(z))/> as efficiently as something like
C</(?:x(*MARK:x)|y(*MARK:y)|z(*MARK:z))/>.
=item C<(*THEN)> C<(*THEN:NAME)>
-This is similar to the "cut group" operator C<::> from Perl6. Like
+This is similar to the "cut group" operator C<::> from Perl 6. Like
C<(*PRUNE)>, this verb always matches, and when backtracked into on
failure, it causes the regex engine to try the next alternation in the
innermost enclosing group (capturing or otherwise).
=item C<(*COMMIT)>
X<(*COMMIT)>
-This is the Perl6 "commit pattern" C<< <commit> >> or C<:::>. It's a
+This is the Perl 6 "commit pattern" C<< <commit> >> or C<:::>. It's a
zero-width pattern similar to C<(*SKIP)>, except that when backtracked
into on failure it causes the match to fail outright. No further attempts
to find a valid match by advancing the start pointer will occur again.
The C<o?> matches at the beginning of C<'foo'>, and since the position
in the string is not moved by the match, C<o?> would match again and again
-because of the C<*> modifier. Another common way to create a similar cycle
+because of the C<*> quantifier. Another common way to create a similar cycle
is with the looping modifier C<//g>:
@matches = ( 'foo' =~ m{ o? }xg );
Thus Perl allows such constructs, by I<forcefully breaking
the infinite loop>. The rules for this are different for lower-level
-loops given by the greedy modifiers C<*+{}>, and for higher-level
+loops given by the greedy quantifiers C<*+{}>, and for higher-level
ones like the C</g> modifier or split() operator.
The lower-level loops are I<interrupted> (that is, the loop is
=head1 PCRE/Python Support
-As of Perl 5.10 Perl supports several Python/PCRE specific extensions
+As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
-Perl specific syntax, the following are legal in Perl 5.10:
+Perl specific syntax, the following are also accepted:
=over 4