{n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is treated
-as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
+as a regular character. In particular, the lower bound
+is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
to integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
\pP Match P, named property. Use \p{Prop} for longer names.
\PP Match non-P
\X Match eXtended Unicode "combining character sequence",
- equivalent to C<(?:\PM\pM*)>
- \C Match a single C char (octet) even under utf8.
-
-A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
-Use C<\w+> to match a string of Perl-identifier characters (which isn't
-the same as matching an English word). If C<use locale> is in effect, the
-list of alphabetic characters generated by C<\w> is taken from the
-current locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
+ equivalent to (?:\PM\pM*)
+ \C Match a single C char (octet) even under Unicode.
+ NOTE: breaks up characters into their UTF-8 bytes,
+ so you may end up with malformed pieces of UTF-8.
+
+A C<\w> matches a single alphanumeric character (an alphabetic
+character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
+to match a string of Perl-identifier characters (which isn't the same
+as matching an English word). If C<use locale> is in effect, the list
+of alphabetic characters generated by C<\w> is taken from the current
+locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood literally.
-See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
+as endpoints of a range, that's not a range, the "-" is understood
+literally. If Unicode is in effect, C<\s> matches also "\x{85}",
+"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
+C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
+You can define your own C<\p> and C<\P> propreties, see L<perlunicode>.
The POSIX character class syntax
word \w [3]
xdigit
- [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
- [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
- also the (very rare) `vertical tabulator', "\ck", chr(11).
- [3] A Perl extension.
+=over
+
+=item [1]
+
+A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+
+=item [2]
+
+Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
+also the (very rare) `vertical tabulator', "\ck", chr(11).
+
+=item [3]
+
+A Perl extension, see above.
+
+=back
For example use C<[:upper:]> to match all the uppercase characters.
Note that the C<[]> are part of the C<[::]> construct, not part of the
matches zero, one, any alphabetic character, and the percentage sign.
-If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs and equivalent backslash character classes (if available),
-will hold:
+The following equivalences to Unicode \p{} constructs and equivalent
+backslash character classes (if available), will hold:
+
+ [:...:] \p{...} backslash
alpha IsAlpha
alnum IsAlnum
You can negate the [::] character classes by prefixing the class name
with a '^'. This is a Perl extension. For example:
- POSIX trad. Perl utf8 Perl
+ POSIX traditional Unicode
[:^digit:] \D \P{IsDigit}
[:^space:] \S \P{IsSpace}
[:^word:] \W \P{IsWord}
-The POSIX character classes [.cc.] and [=cc=] are recognized but
-B<not> supported and trying to use them will cause an error.
+Perl respects the POSIX standard in that POSIX character classes are
+only supported within a character class. The POSIX character classes
+[.cc.] and [=cc=] are recognized but B<not> supported and trying to
+use them will cause an error.
Perl defines the following zero-width assertions:
several patterns that you want to match against consequent substrings
of your string, see the previous reference. The actual location
where C<\G> will match can also be influenced by using C<pos()> as
-an lvalue. See L<perlfunc/pos>.
+an lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
+supported when anchored to the start of the pattern; while it
+is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
+such uses (C</.\G/g>, for example) currently cause problems, and
+it is recommended that you avoid such usage for now.
The bracketing construct C<( ... )> creates capture buffers. To
refer to the digit'th buffer use \<digit> within the
match. C<$+> returns whatever the last bracket match matched.
C<$&> returns the entire matched string. (At one point C<$0> did
also, but now it returns the name of the program.) C<$`> returns
-everything before the matched string. And C<$'> returns everything
-after the matched string.
+everything before the matched string. C<$'> returns everything
+after the matched string. And C<$^N> contains whatever was matched by
+the most-recently closed group (submatch). C<$^N> can be used in
+extended patterns (see below), for example to assign a submatch to a
+variable.
The numbered variables ($1, $2, $3, etc.) and the related punctuation
-set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
+set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
=item C<(?imsx-imsx)>
-One or more embedded pattern-match modifiers. This is particularly
-useful for dynamic patterns, such as those read in from a configuration
-file, read in as an argument, are specified in a table somewhere,
-etc. Consider the case that some of which want to be case sensitive
-and some do not. The case insensitive ones need to include merely
-C<(?i)> at the front of the pattern. For example:
+One or more embedded pattern-match modifiers, to be turned on (or
+turned off, if preceded by C<->) for the remainder of the pattern or
+the remainder of the enclosing pattern group (if any). This is
+particularly useful for dynamic patterns, such as those read in from a
+configuration file, read in as an argument, are specified in a table
+somewhere, etc. Consider the case that some of which want to be case
+sensitive and some do not. The case insensitive ones need to include
+merely C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
if ( /$pattern/i ) { }
$pattern = "(?i)foobar";
if ( /$pattern/ ) { }
-Letters after a C<-> turn those modifiers off. These modifiers are
-localized inside an enclosing group (if any). For example,
+These modifiers are restored at the end of the enclosing group. For example,
( (?i) blah ) \s+ \1
always succeeds, and its C<code> is not interpolated. Currently,
the rules to determine where the C<code> ends are somewhat convoluted.
+This feature can be used together with the special variable C<$^N> to
+capture the results of submatches in variables without having to keep
+track of the number of nested parentheses. For example:
+
+ $_ = "The brown fox jumps over the lazy dog";
+ /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
+ print "color = $color, animal = $animal\n";
+
The C<code> is properly scoped in the following sense: If the assertion
is backtracked (compare L<"Backtracking">), all changes introduced after
C<local>ization are undone, so that
know which variety of success you will achieve.
When using look-ahead assertions and negations, this can all get even
-tricker. Imagine you'd like to find a sequence of non-digits not
+trickier. Imagine you'd like to find a sequence of non-digits not
followed by "123". You might try to write that as
$_ = "ABC123";