This page describes the syntax of regular expressions in Perl.
-if you haven't used regular expressions before, a quick-start
+If you haven't used regular expressions before, a quick-start
introduction is available in L<perlrequick>, and a longer tutorial
introduction is available in L<perlretut>.
Treat string as single line. That is, change "." to match any character
whatsoever, even a newline, which normally it would not match.
-The C</s> and C</m> modifiers both override the C<$*> setting. That
-is, no matter what C<$*> contains, C</s> without C</m> will force
-"^" to match only at the beginning of the string and "$" to match
-only at the end (or just before a newline at the end) of the string.
-Together, as /ms, they let the "." match any character whatsoever,
+Used together, as /ms, they let the "." match any character whatsoever,
while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
newline within the string, and "$" will match before any newline. At the
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
-but this practice is now deprecated.)
+but this practice has been removed in perl 5.9.)
To simplify multi-line substitutions, the "." character never matches a
newline unless you use the C</s> modifier, which in effect tells Perl to pretend
-the string is a single line--even if it isn't. The C</s> modifier also
-overrides the setting of C<$*>, in case you have some (badly behaved) older
-code that sets it in another module.
+the string is a single line--even if it isn't.
The following standard quantifiers are recognized:
{n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is treated
-as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"
+as a regular character. In particular, the lower bound
+is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+"
modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited
to integral values less than a preset limit defined when perl is built.
This is usually 32766 on the most common platforms. The actual limit can
\C Match a single C char (octet) even under Unicode.
NOTE: breaks up characters into their UTF-8 bytes,
so you may end up with malformed pieces of UTF-8.
+ Unsupported in lookbehind.
A C<\w> matches a single alphanumeric character (an alphabetic
character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
literally. If Unicode is in effect, C<\s> matches also "\x{85}",
"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
-You can define your own C<\p> and C<\P> propreties, see L<perlunicode>.
+You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
The POSIX character class syntax
=item [1]
-A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
=item [2]
Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) `vertical tabulator', "\ck", chr(11).
+also the (very rare) "vertical tabulator", "\ck", chr(11).
=item [3]
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
- blank IsSpace
+ blank IsSpace
cntrl IsCntrl
digit IsDigit \d
graph IsGraph
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
-`word' and `blank').
+"word" and "blank").
The assumedly non-obviously named classes are:
extended patterns (see below), for example to assign a submatch to a
variable.
-The numbered variables ($1, $2, $3, etc.) and the related punctuation
+The numbered match variables ($1, $2, $3, etc.) and the related punctuation
set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
+B<NOTE>: failed matches in Perl do not reset the match variables,
+which makes it easier to write code that tests for a series of more
+specific cases and remembers the best match.
+
B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
C<$'> anywhere in the program, it has to provide them for every
pattern match. This may substantially slow your program. Perl
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
-This zero-width assertion evaluate any embedded Perl code. It
+This zero-width assertion evaluates any embedded Perl code. It
always succeeds, and its C<code> is not interpolated. Currently,
the rules to determine where the C<code> ends are somewhat convoluted.
/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
print "color = $color, animal = $animal\n";
+Inside the C<(?{...})> block, C<$_> refers to the string the regular
+expression is matching against. You can also use C<pos()> to know what is
+the current position of matching within this string.
+
The C<code> is properly scoped in the following sense: If the assertion
is backtracked (compare L<"Backtracking">), all changes introduced after
C<local>ization are undone, so that
you turn on the C<use re 'eval'>, though, it is no longer secure,
so you should only do so if you are also using taint checking.
Better yet, use the carefully constrained evaluation within a Safe
-module. See L<perlsec> for details about both these mechanisms.
+compartment. See L<perlsec> for details about both these mechanisms.
=item C<(??{ code })>
the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between white-space characters and non-whitespace
+matches at boundary between whitespace characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
more complicated version. We can create a module C<customre> to do
sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
- my %rules = ( '\\' => '\\',
+ # We must also take care of not escaping the legitimate \\Y|
+ # sequence, hence the presence of '\\' in the conversion rules.
+ my %rules = ( '\\' => '\\\\',
'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
sub convert {
my $re = shift;