"Hello World" =~ m!World!; # matches, delimited by '!'
"Hello World" =~ m{World}; # matches, note the matching '{}'
- "/usr/bin/perl" =~ m"/perl"; # matches, '/' becomes ordinary char
+ "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
+ # '/' becomes an ordinary char
C</World/>, C<m!World!>, and C<m{World}> all represent the
same thing. When, e.g., C<""> is used as a delimiter, the forward
$foo = 'house';
'housecat' =~ /$foo/; # matches
'cathouse' =~ /cat$foo/; # matches
- 'housecat' =~ /$foocat/; # doesn't match, there is no $foocat
'housecat' =~ /${foo}cat/; # matches
So far, so good. With the knowledge above you can already perform
/cat/; # matches 'cat'
/[bcr]at/; # matches 'bat, 'cat', or 'rat'
/item[0123456789]/; # matches 'item0' or ... or 'item9'
- "abc" =~ /[cab/; # matches 'a'
+ "abc" =~ /[cab]/; # matches 'a'
In the last statement, even though C<'c'> is the first character in
the class, C<'a'> matches because the first character position in the
/[\]c]def/; # matches ']def' or 'cdef'
$x = 'bcr';
- /[$x]at/; # matches 'bat, 'cat', or 'rat'
+ /[$x]at/; # matches 'bat', 'cat', or 'rat'
/[\$x]at/; # matches '$at' or 'xat'
/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
/[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
# 'baa', 'xaa', 'yaa', or 'zaa'
/[0-9a-fA-F]/; # matches a hexadecimal digit
- /[0-9a-zA-Z_]/; # matches an alphanumeric character,
+ /[0-9a-zA-Z_]/; # matches a "word" character,
# like those in a perl variable name
If C<'-'> is the first or last character in a character class, it is
The special character C<^> in the first position of a character class
denotes a B<negated character class>, which matches any character but
-those in the bracket. Both C<[...]> and C<[^...]> must match a
+those in the brackets. Both C<[...]> and C<[^...]> must match a
character, or the match fails. Then
/[^a]at/; # doesn't match 'aat' or 'at', but matches
=over 4
=item *
+
\d is a digit and represents [0-9]
=item *
+
\s is a whitespace character and represents [\ \t\r\n\f]
=item *
+
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
=item *
+
\D is a negated \d; it represents any character but a digit [^0-9]
=item *
+
\S is a negated \s; it represents any non-whitespace character [^\s]
=item *
+
\W is a negated \w; it represents any non-word character [^\w]
=item *
+
The period '.' matches any character but "\n"
=back
=over 4
=item *
+
no modifiers (//): Default behavior. C<'.'> matches any character
except C<"\n">. C<^> matches only at the beginning of the string and
C<$> matches only at the end or before a newline at the end.
=item *
+
s modifier (//s): Treat string as a single long line. C<'.'> matches
any character, even C<"\n">. C<^> matches only at the beginning of
the string and C<$> matches only at the end or before a newline at the
end.
=item *
+
m modifier (//m): Treat string as a set of multiple lines. C<'.'>
matches any character except C<"\n">. C<^> and C<$> are able to match
at the start or end of I<any> line within the string.
=item *
+
both s and m modifiers (//sm): Treat string as a single long line, but
detect multiple lines. C<'.'> matches any character, even
C<"\n">. C<^> and C<$>, however, are able to match at the start or end
=over 4
-=item 0 Start with the first letter in the string 'a'.
+=item 0
-=item 1 Try the first alternative in the first group 'abd'.
+Start with the first letter in the string 'a'.
-=item 2 Match 'a' followed by 'b'. So far so good.
+=item 1
-=item 3 'd' in the regexp doesn't match 'c' in the string - a dead
+Try the first alternative in the first group 'abd'.
+
+=item 2
+
+Match 'a' followed by 'b'. So far so good.
+
+=item 3
+
+'d' in the regexp doesn't match 'c' in the string - a dead
end. So backtrack two characters and pick the second alternative in
the first group 'abc'.
-=item 4 Match 'a' followed by 'b' followed by 'c'. We are on a roll
+=item 4
+
+Match 'a' followed by 'b' followed by 'c'. We are on a roll
and have satisfied the first group. Set $1 to 'abc'.
-=item 5 Move on to the second group and pick the first alternative
+=item 5
+
+Move on to the second group and pick the first alternative
'df'.
-=item 6 Match the 'd'.
+=item 6
+
+Match the 'd'.
+
+=item 7
-=item 7 'f' in the regexp doesn't match 'e' in the string, so a dead
+'f' in the regexp doesn't match 'e' in the string, so a dead
end. Backtrack one character and pick the second alternative in the
second group 'd'.
-=item 8 'd' matches. The second grouping is satisfied, so set $2 to
+=item 8
+
+'d' matches. The second grouping is satisfied, so set $2 to
'd'.
-=item 9 We are at the end of the regexp, so we are done! We have
+=item 9
+
+We are at the end of the regexp, so we are done! We have
matched 'abcd' out of the string "abcde".
=back
In the second match, S<C<$` = ''> > because the regexp matched at the
first character position in the string and stopped, it never saw the
second 'the'. It is important to note that using C<$`> and C<$'>
-slows down regexp matching quite a bit, and C<$&> slows it down to a
+slows down regexp matching quite a bit, and C< $& > slows it down to a
lesser extent, because if they are used in one regexp in a program,
they are generated for <all> regexps in the program. So if raw
performance is a goal of your application, they should be avoided.
=over 4
-=item * C<a?> = match 'a' 1 or 0 times
+=item *
+
+C<a?> = match 'a' 1 or 0 times
+
+=item *
+
+C<a*> = match 'a' 0 or more times, i.e., any number of times
-=item * C<a*> = match 'a' 0 or more times, i.e., any number of times
+=item *
+
+C<a+> = match 'a' 1 or more times, i.e., at least once
-=item * C<a+> = match 'a' 1 or more times, i.e., at least once
+=item *
-=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
+C<a{n,m}> = match at least C<n> times, but not more than C<m>
times.
-=item * C<a{n,}> = match at least C<n> or more times
+=item *
+
+C<a{n,}> = match at least C<n> or more times
-=item * C<a{n}> = match exactly C<n> times
+=item *
+
+C<a{n}> = match exactly C<n> times
=back
stop there, but that wouldn't give the longest possible string to the
first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
much of the string as possible while still having the regexp match. In
-this example, that means having the C<at> sequence with the final <at>
+this example, that means having the C<at> sequence with the final C<at>
in the string. The other important principle illustrated here is that
when there are two or more elements in a regexp, the I<leftmost>
quantifier, if there is one, gets to grab as much the string as
=over 4
=item *
+
Principle 0: Taken as a whole, any regexp will be matched at the
earliest possible position in the string.
=item *
+
Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
that allows a match for the whole regexp will be the one used.
=item *
+
Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
C<{n,m}> will in general match as much of the string as possible while
still allowing the whole regexp to match.
=item *
+
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy quantifier, if any, will match as much of the string
as possible while still allowing the whole regexp to match. The next
=over 4
-=item * C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
+=item *
+
+C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
+
+=item *
-=item * C<a*?> = match 'a' 0 or more times, i.e., any number of times,
+C<a*?> = match 'a' 0 or more times, i.e., any number of times,
but as few times as possible
-=item * C<a+?> = match 'a' 1 or more times, i.e., at least once, but
+=item *
+
+C<a+?> = match 'a' 1 or more times, i.e., at least once, but
as few times as possible
-=item * C<a{n,m}?> = match at least C<n> times, not more than C<m>
+=item *
+
+C<a{n,m}?> = match at least C<n> times, not more than C<m>
times, as few times as possible
-=item * C<a{n,}?> = match at least C<n> times, but as few times as
+=item *
+
+C<a{n,}?> = match at least C<n> times, but as few times as
possible
-=item * C<a{n}?> = match exactly C<n> times. Because we match exactly
+=item *
+
+C<a{n}?> = match exactly C<n> times. Because we match exactly
C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
notational consistency.
=over 4
=item *
+
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy (non-greedy) quantifier, if any, will match as much
(little) of the string as possible while still allowing the whole
=over 4
-=item 0 Start with the first letter in the string 't'.
+=item 0
+
+Start with the first letter in the string 't'.
+
+=item 1
-=item 1 The first quantifier '.*' starts out by matching the whole
+The first quantifier '.*' starts out by matching the whole
string 'the cat in the hat'.
-=item 2 'a' in the regexp element 'at' doesn't match the end of the
+=item 2
+
+'a' in the regexp element 'at' doesn't match the end of the
string. Backtrack one character.
-=item 3 'a' in the regexp element 'at' still doesn't match the last
+=item 3
+
+'a' in the regexp element 'at' still doesn't match the last
letter of the string 't', so backtrack one more character.
-=item 4 Now we can match the 'a' and the 't'.
+=item 4
+
+Now we can match the 'a' and the 't'.
-=item 5 Move on to the third element '.*'. Since we are at the end of
+=item 5
+
+Move on to the third element '.*'. Since we are at the end of
the string and '.*' can match 0 times, assign it the empty string.
-=item 6 We are done!
+=item 6
+
+We are done!
=back
=over 4
-=item * specifying the task in detail,
+=item *
-=item * breaking down the problem into smaller parts,
+specifying the task in detail,
-=item * translating the small parts into regexps,
+=item *
-=item * combining the regexps,
+breaking down the problem into smaller parts,
-=item * and optimizing the final combined regexp.
+=item *
+
+translating the small parts into regexps,
+
+=item *
+
+combining the regexps,
+
+=item *
+
+and optimizing the final combined regexp.
=back
$x =~ /\C/; # matches, but dangerous!
The last regexp matches, but is dangerous because the string
-I<character> position is no longer synchronized to the string <byte>
+I<character> position is no longer synchronized to the string I<byte>
position. This generates the warning 'Malformed UTF-8
character'. C<\C> is best used for matching the binary data in strings
with binary data intermixed with Unicode characters.
As if all those classes weren't enough, Perl also defines POSIX style
character classes. These have the form C<[:name:]>, with C<name> the
-name of the POSIX class. The POSIX classes are alpha, alnum, ascii,
-cntrl, digit, graph, lower, print, punct, space, upper, word, and
-xdigit. If C<utf8> is being used, then these classes are defined the
-same as their corresponding perl Unicode classes: C<[:upper:]> is the
-same as C<\p{IsUpper}>, etc. The POSIX character classes, however,
-don't require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
+name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
+C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
+C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
+extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8>
+is being used, then these classes are defined the same as their
+corresponding perl Unicode classes: C<[:upper:]> is the same as
+C<\p{IsUpper}>, etc. The POSIX character classes, however, don't
+require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
-character classes. To negate a POSIX class, put a C<^> in front of the
-name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
+character classes. To negate a POSIX class, put a C<^> in front of
+the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
be used just like C<\d>, both inside and outside of character classes:
anchor concept. Lookahead and lookbehind are zero-width assertions
that let us specify which characters we want to test for. The
lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
-assertion is denoted by C<(?<=fixed-regexp)>. Some examples are
+assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
$x = "I catch the housecat 'Tom-cat' with catnip";
$x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
$x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
# middle of $x
-Note that the parentheses in C<(?=regexp)> and C<(?<=regexp)> are
+Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
non-capturing, since these are zero-width assertions. Thus in the
second regexp, the substrings captured are those of the whole regexp
-itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
-lookbehind C<(?<=fixed-regexp)> only works for regexps of fixed
-width, i.e., a fixed number of characters long. Thus C<(?<=(ab|bc))>
-is fine, but C<(?<=(ab)*)> is not. The negated versions of the
-lookahead and lookbehind assertions are denoted by C<(?!regexp)>
-and C<(?<!fixed-regexp)> respectively. They evaluate true if the
-regexps do I<not> match:
+itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
+lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
+width, i.e., a fixed number of characters long. Thus
+C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
+negated versions of the lookahead and lookbehind assertions are
+denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
+They evaluate true if the regexps do I<not> match:
$x = "foobar";
$x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
matches a DNA sequence such that it either ends in C<AAG>, or some
other base pair combination and C<C>. Note that the form is
-C<(?(?<=AA)G|C)> and not C<(?((?<=AA))G|C)>; for the lookahead,
-lookbehind or code assertions, the parentheses around the conditional
-are not needed.
+C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
+lookahead, lookbehind or code assertions, the parentheses around the
+conditional are not needed.
=head2 A bit of magic: executing Perl code in a regular expression
# prints 'Hi Mom!'
$x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
# no 'Hi Mom!'
+
+Pay careful attention to the next example:
+
$x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
# no 'Hi Mom!'
+ # but why not?
+
+At first glance, you'd think that it shouldn't print, because obviously
+the C<ddd> isn't going to match the target string. But look at this
+example:
+
+ $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
+ # but _does_ print
+
+Hmm. What happened here? If you've been following along, you know that
+the above pattern should be effectively the same as the last one --
+enclosing the d in a character class isn't going to change what it
+matches. So why does the first not print while the second one does?
+
+The answer lies in the optimizations the REx engine makes. In the first
+case, all the engine sees are plain old characters (aside from the
+C<?{}> construct). It's smart enough to realize that the string 'ddd'
+doesn't occur in our target string before actually running the pattern
+through. But in the second case, we've tricked it into thinking that our
+pattern is more complicated than it is. It takes a look, sees our
+character class, and decides that it will have to actually run the
+pattern to determine whether or not it matches, and in the process of
+running it hits the print statement before it discovers that we don't
+have a match.
+
+To take a closer look at how the engine does optimizations, see the
+section L<"Pragmas and debugging"> below.
+
+More fun with C<?{}>:
+
$x =~ /(?{print "Hi Mom!";})/; # matches,
# prints 'Hi Mom!'
$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
This example uses a code expression in a conditional to match the
article 'the' in either English or German:
- use re 'eval';
$lang = 'DE'; # use German
...
$text = "das";
code expression, we don't need the extra parentheses around the
conditional.
-The S<C<use re 'eval';> > statement is needed because we are both
-interpolating the variable C<$lang> I<and> evaluating code
-within the regexp. From a security point of view, this can be
-dangerous. It is dangerous because many programmers who write search
-engines often take user input and plug it directly into a regexp:
+If you try to use code expressions with interpolating variables, perl
+may surprise you:
+
+ $bar = 5;
+ $pat = '(?{ 1 })';
+ /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
+ /foo(?{ 1 })$bar/; # compile error!
+ /foo${pat}bar/; # compile error!
+
+ $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
+ /foo${pat}bar/; # compiles ok
+
+If a regexp has (1) code expressions and interpolating variables,or
+(2) a variable that interpolates a code expression, perl treats the
+regexp as an error. If the code expression is precompiled into a
+variable, however, interpolating is ok. The question is, why is this
+an error?
+
+The reason is that variable interpolation and code expressions
+together pose a security risk. The combination is dangerous because
+many programmers who write search engines often take user input and
+plug it directly into a regexp:
$regexp = <>; # read user-supplied regexp
$chomp $regexp; # get rid of possible newline
$text =~ /$regexp/; # search $text for the $regexp
-If the C<$regexp> variable is used in a code expression, the user
-could then execute arbitrary Perl code. For instance, some joker could
+If the C<$regexp> variable contains a code expression, the user could
+then execute arbitrary Perl code. For instance, some joker could
search for S<C<system('rm -rf *');> > to erase your files. In this
sense, the combination of interpolation and code expressions B<taints>
your regexp. So by default, using both interpolation and code
-expressions in the same regexp is not allowed. Only by invoking
-S<C<use re 'eval';> > can one use both interpolation and code
-expressions in the same regexp.
+expressions in the same regexp is not allowed. If you're not
+concerned about malicious users, it is possible to bypass this
+security check by invoking S<C<use re 'eval'> >:
+
+ use re 'eval'; # throw caution out the door
+ $bar = 5;
+ $pat = '(?{ 1 })';
+ /foo(?{ 1 })$bar/; # compiles ok
+ /foo${pat}bar/; # compiles ok
Another form of code expression is the S<B<pattern code expression> >.
The pattern code expression is like a regular code expression, except
expressions. It detects if a binary string C<1101010010001...> has a
Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s:
- use re 'eval';
$s0 = 0; $s1 = 1; # initial conditions
$x = "1101010010001000001";
print "It is a Fibonacci sequence\n"
Speaking of debugging, there are several pragmas available to control
and debug regexps in Perl. We have already encountered one pragma in
the previous section, S<C<use re 'eval';> >, that allows variable
-interpolation in a regexp with code expressions. The other pragmas are
+interpolation and code expressions to coexist in a regexp. The other
+pragmas are
use re 'taint';
$tainted = <>;
regexps are often used to extract the safe bits from a tainted
variable. Use C<taint> when you are not extracting safe bits, but are
performing some other processing. Both C<taint> and C<eval> pragmas
-are lexically scoped, which mean they have are in effect only until
+are lexically scoped, which means they are in effect only until
the end of the block enclosing the pragmas.
use re 'debug';
The inspiration for the stop codon DNA example came from the ZIP
code example in chapter 7 of I<Mastering Regular Expressions>.
-The author would like to thank
-Jeff Pinyan,
-Peter Haworth,
-Ronald J Kimball,
-and Joe Smith for all their helpful comments.
+The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
+Haworth, Ronald J Kimball, and Joe Smith for all their helpful
+comments.
=cut
+