=head1 NAME
+X<regular expression> X<regex> X<regexp>
perlre - Perl regular expressions
=over 4
=item i
+X</i> X<regex, case-insensitive> X<regexp, case-insensitive>
+X<regular expression, case-insensitive>
Do case-insensitive pattern matching.
locale. See L<perllocale>.
=item m
+X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
Treat string as multiple lines. That is, change "^" and "$" from matching
the start or end of the string to matching the start or end of any
line anywhere within the string.
=item s
+X</s> X<regex, single-line> X<regexp, single-line>
+X<regular expression, single-line>
Treat string as single line. That is, change "." to match any character
whatsoever, even a newline, which normally it would not match.
and just before newlines within the string.
=item x
+X</x>
Extend your pattern's legibility by permitting whitespace and comments.
pattern delimiter in the comment--perl has no way of knowing you did
not intend to close the pattern early. See the C-comment deletion code
in L<perlop>.
+X</x>
=head2 Regular Expressions
In particular the following metacharacters have their standard I<egrep>-ish
meanings:
+X<metacharacter>
+X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]>
+
\ Quote the next metacharacter
^ Match the beginning of the line
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
but this practice has been removed in perl 5.9.)
+X<^> X<$> X</m>
To simplify multi-line substitutions, the "." character never matches a
newline unless you use the C</s> modifier, which in effect tells Perl to pretend
the string is a single line--even if it isn't.
+X<.> X</s>
The following standard quantifiers are recognized:
+X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
* Match 0 or more times
+ Match 1 or more times
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a "?". Note
that the meanings don't change, just the "greediness":
+X<metacharacter> X<greedy> X<greedyness>
+X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
*? Match 0 or more times
+? Match 1 or more times
Because patterns are processed as double quoted strings, the following
also work:
+X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q>
+X<\0> X<\c> X<\N> X<\x>
\t tab (HT, TAB)
\n newline (LF, NL)
You'll need to write something like C<m/\Quser\E\@\Qhost/>.
In addition, Perl defines the following:
+X<metacharacter>
+X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
+X<word> X<whitespace>
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character
"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
You can define your own C<\p> and C<\P> properties, see L<perlunicode>.
+X<\w> X<\W> X<word>
The POSIX character class syntax
+X<character class>
[:class:]
is also available. The available classes and their backslash
equivalents (if available) are as follows:
+X<character class>
+X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
+X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
alpha
alnum
=item [1]
-A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
=item [2]
Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) `vertical tabulator', "\ck", chr(11).
+also the (very rare) "vertical tabulator", "\ck", chr(11).
=item [3]
The following equivalences to Unicode \p{} constructs and equivalent
backslash character classes (if available), will hold:
+X<character class> X<\p> X<\p{}>
[:...:] \p{...} backslash
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
- blank IsSpace
+ blank IsSpace
cntrl IsCntrl
digit IsDigit \d
graph IsGraph
If the C<utf8> pragma is not used but the C<locale> pragma is, the
classes correlate with the usual isalpha(3) interface (except for
-`word' and `blank').
+"word" and "blank").
The assumedly non-obviously named classes are:
=over 4
=item cntrl
+X<cntrl>
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
the ord() value of 127 (C<DEL>).
=item graph
+X<graph>
Any alphanumeric or punctuation (special) character.
=item print
+X<print>
Any alphanumeric or punctuation (special) character or the space character.
=item punct
+X<punct>
Any punctuation (special) character.
=item xdigit
+X<xdigit>
Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
work just fine) it is included for completeness.
You can negate the [::] character classes by prefixing the class name
with a '^'. This is a Perl extension. For example:
+X<character class, negation>
POSIX traditional Unicode
use them will cause an error.
Perl defines the following zero-width assertions:
+X<zero-width assertion> X<assertion> X<regex, zero-width assertion>
+X<regexp, zero-width assertion>
+X<regular expression, zero-width assertion>
+X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G>
\b Match a word boundary
\B Match a non-(word boundary)
"^" and "$" will match at every internal line boundary. To match
the actual end of the string and not ignore an optional trailing
newline, use C<\z>.
+X<\b> X<\A> X<\Z> X<\z> X</m>
The C<\G> assertion can be used to chain global matches (using
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
such uses (C</.\G/g>, for example) currently cause problems, and
it is recommended that you avoid such usage for now.
+X<\G>
The bracketing construct C<( ... )> creates capture buffers. To
refer to the digit'th buffer use \<digit> within the
the match. See the warning below about \1 vs $1 for details.)
Referring back to another part of the match is called a
I<backreference>.
+X<regex, capture buffer> X<regexp, capture buffer>
+X<regular expression, capture buffer> X<backreference>
There is no limit to the number of captured substrings that you may
use. However Perl also uses \10, \11, etc. as aliases for \010,
the most-recently closed group (submatch). C<$^N> can be used in
extended patterns (see below), for example to assign a submatch to a
variable.
+X<$+> X<$^N> X<$&> X<$`> X<$'>
The numbered match variables ($1, $2, $3, etc.) and the related punctuation
set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
+X<$+> X<$^N> X<$&> X<$`> X<$'>
+X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
+
B<NOTE>: failed matches in Perl do not reset the match variables,
-which makes easier to write code that tests for a series of more
+which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
them), once you've used them once, use them at will, because you've
already paid the price. As of 5.005, C<$&> is not so costly as the
other two.
+X<$&> X<$`> X<$'>
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
=over 10
=item C<(?#text)>
+X<(?#)>
A comment. The text is ignored. If the C</x> modifier enables
whitespace formatting, a simple C<#> will suffice. Note that Perl closes
C<)> in the comment.
=item C<(?imsx-imsx)>
+X<(?)>
One or more embedded pattern-match modifiers, to be turned on (or
turned off, if preceded by C<->) for the remainder of the pattern or
group.
=item C<(?:pattern)>
+X<(?:)>
=item C<(?imsx-imsx:pattern)>
/(?:(?s-i)more.*than).*million/i
=item C<(?=pattern)>
+X<(?=)> X<look-ahead, positive> X<lookahead, positive>
A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/>
matches a word followed by a tab, without including the tab in C<$&>.
=item C<(?!pattern)>
+X<(?!)> X<look-ahead, negative> X<lookahead, negative>
A zero-width negative look-ahead assertion. For example C</foo(?!bar)/>
matches any occurrence of "foo" that isn't followed by "bar". Note
For look-behind see below.
=item C<(?<=pattern)>
+X<(?<=)> X<look-behind, positive> X<lookbehind, positive>
A zero-width positive look-behind assertion. For example, C</(?<=\t)\w+/>
matches a word that follows a tab, without including the tab in C<$&>.
Works only for fixed-width look-behind.
=item C<(?<!pattern)>
+X<(?<!)> X<look-behind, negative> X<lookbehind, negative>
A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
matches any occurrence of "foo" that does not follow "bar". Works
only for fixed-width look-behind.
=item C<(?{ code })>
+X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
compartment. See L<perlsec> for details about both these mechanisms.
=item C<(??{ code })>
+X<(??{})>
+X<regex, postponed> X<regexp, postponed> X<regular expression, postponed>
+X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
}x;
=item C<< (?>pattern) >>
+X<backtrack> X<backtracking>
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
the above specification of comments.
=item C<(?(condition)yes-pattern|no-pattern)>
+X<(?()>
=item C<(?(condition)yes-pattern)>
=back
=head2 Backtracking
+X<backtrack> X<backtracking>
NOTE: This section presents an abstract approximation of regular
expression behavior. For a more rigorous (and complicated) view of
claims that there is no 123 in the string. Here's a clearer picture of
why that pattern matches, contrary to popular expectations:
- $x = 'ABC123' ;
- $y = 'ABC445' ;
+ $x = 'ABC123';
+ $y = 'ABC445';
- print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
- print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
+ print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
+ print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
- print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
- print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
+ print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
+ print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
This prints
of the string in their match. So rewriting this way produces what
you'd expect; that is, case 5 will fail, but case 6 succeeds:
- print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
- print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
+ print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
+ print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
6: got ABC
following match, see L<C<< (?>pattern) >>>.
=head2 Version 8 Regular Expressions
+X<regular expression, version 8> X<regex, version 8> X<regexp, version 8>
In case you're not familiar with the "regular" Version 8 regex
routines, here are the pattern-matching rules not described above.
the functionality of the RE engine.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
-matches at boundary between white-space characters and non-whitespace
+matches at boundary between whitespace characters and non-whitespace
characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly
at these positions, so we want to have each C<\Y|> in the place of the
more complicated version. We can create a module C<customre> to do