"Hello World" =~ m!World!; # matches, delimited by '!'
"Hello World" =~ m{World}; # matches, note the matching '{}'
- "/usr/bin/perl" =~ m"/perl"; # matches, '/' becomes ordinary char
+ "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
+ # '/' becomes an ordinary char
C</World/>, C<m!World!>, and C<m{World}> all represent the
same thing. When, e.g., C<""> is used as a delimiter, the forward
$foo = 'house';
'housecat' =~ /$foo/; # matches
'cathouse' =~ /cat$foo/; # matches
- 'housecat' =~ /$foocat/; # doesn't match, there is no $foocat
'housecat' =~ /${foo}cat/; # matches
So far, so good. With the knowledge above you can already perform
/cat/; # matches 'cat'
/[bcr]at/; # matches 'bat, 'cat', or 'rat'
/item[0123456789]/; # matches 'item0' or ... or 'item9'
- "abc" =~ /[cab/; # matches 'a'
+ "abc" =~ /[cab]/; # matches 'a'
In the last statement, even though C<'c'> is the first character in
the class, C<'a'> matches because the first character position in the
/[\]c]def/; # matches ']def' or 'cdef'
$x = 'bcr';
- /[$x]at/; # matches 'bat, 'cat', or 'rat'
+ /[$x]at/; # matches 'bat', 'cat', or 'rat'
/[\$x]at/; # matches '$at' or 'xat'
/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
/[0-9bx-z]aa/; # matches '0aa', ..., '9aa',
# 'baa', 'xaa', 'yaa', or 'zaa'
/[0-9a-fA-F]/; # matches a hexadecimal digit
- /[0-9a-zA-Z_]/; # matches an alphanumeric character,
+ /[0-9a-zA-Z_]/; # matches a "word" character,
# like those in a perl variable name
If C<'-'> is the first or last character in a character class, it is
The special character C<^> in the first position of a character class
denotes a B<negated character class>, which matches any character but
-those in the bracket. Both C<[...]> and C<[^...]> must match a
+those in the brackets. Both C<[...]> and C<[^...]> must match a
character, or the match fails. Then
/[^a]at/; # doesn't match 'aat' or 'at', but matches
=over 4
=item *
+
\d is a digit and represents [0-9]
=item *
+
\s is a whitespace character and represents [\ \t\r\n\f]
=item *
+
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
=item *
+
\D is a negated \d; it represents any character but a digit [^0-9]
=item *
+
\S is a negated \s; it represents any non-whitespace character [^\s]
=item *
+
\W is a negated \w; it represents any non-word character [^\w]
=item *
+
The period '.' matches any character but "\n"
=back
=over 4
=item *
+
no modifiers (//): Default behavior. C<'.'> matches any character
except C<"\n">. C<^> matches only at the beginning of the string and
C<$> matches only at the end or before a newline at the end.
=item *
+
s modifier (//s): Treat string as a single long line. C<'.'> matches
any character, even C<"\n">. C<^> matches only at the beginning of
the string and C<$> matches only at the end or before a newline at the
end.
=item *
+
m modifier (//m): Treat string as a set of multiple lines. C<'.'>
matches any character except C<"\n">. C<^> and C<$> are able to match
at the start or end of I<any> line within the string.
=item *
+
both s and m modifiers (//sm): Treat string as a single long line, but
detect multiple lines. C<'.'> matches any character, even
C<"\n">. C<^> and C<$>, however, are able to match at the start or end
=over 4
-=item 0 Start with the first letter in the string 'a'.
+=item 0
-=item 1 Try the first alternative in the first group 'abd'.
+Start with the first letter in the string 'a'.
-=item 2 Match 'a' followed by 'b'. So far so good.
+=item 1
-=item 3 'd' in the regexp doesn't match 'c' in the string - a dead
+Try the first alternative in the first group 'abd'.
+
+=item 2
+
+Match 'a' followed by 'b'. So far so good.
+
+=item 3
+
+'d' in the regexp doesn't match 'c' in the string - a dead
end. So backtrack two characters and pick the second alternative in
the first group 'abc'.
-=item 4 Match 'a' followed by 'b' followed by 'c'. We are on a roll
+=item 4
+
+Match 'a' followed by 'b' followed by 'c'. We are on a roll
and have satisfied the first group. Set $1 to 'abc'.
-=item 5 Move on to the second group and pick the first alternative
+=item 5
+
+Move on to the second group and pick the first alternative
'df'.
-=item 6 Match the 'd'.
+=item 6
-=item 7 'f' in the regexp doesn't match 'e' in the string, so a dead
+Match the 'd'.
+
+=item 7
+
+'f' in the regexp doesn't match 'e' in the string, so a dead
end. Backtrack one character and pick the second alternative in the
second group 'd'.
-=item 8 'd' matches. The second grouping is satisfied, so set $2 to
+=item 8
+
+'d' matches. The second grouping is satisfied, so set $2 to
'd'.
-=item 9 We are at the end of the regexp, so we are done! We have
+=item 9
+
+We are at the end of the regexp, so we are done! We have
matched 'abcd' out of the string "abcde".
=back
/(ab(cd|ef)((gi)|j))/;
1 2 34
-so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'.
-For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>,
-... that got assigned.
+so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For
+convenience, perl sets C<$+> to the string held by the highest numbered
+C<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the
+value of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>,
+C<$2>, ... associated with the rightmost closing parenthesis used in the
+match).
Closely associated with the matching variables C<$1>, C<$2>, ... are
the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply
In the second match, S<C<$` = ''> > because the regexp matched at the
first character position in the string and stopped, it never saw the
second 'the'. It is important to note that using C<$`> and C<$'>
-slows down regexp matching quite a bit, and C<$&> slows it down to a
+slows down regexp matching quite a bit, and C< $& > slows it down to a
lesser extent, because if they are used in one regexp in a program,
they are generated for <all> regexps in the program. So if raw
performance is a goal of your application, they should be avoided.
=over 4
-=item * C<a?> = match 'a' 1 or 0 times
+=item *
-=item * C<a*> = match 'a' 0 or more times, i.e., any number of times
+C<a?> = match 'a' 1 or 0 times
-=item * C<a+> = match 'a' 1 or more times, i.e., at least once
+=item *
+
+C<a*> = match 'a' 0 or more times, i.e., any number of times
+
+=item *
+
+C<a+> = match 'a' 1 or more times, i.e., at least once
-=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
+=item *
+
+C<a{n,m}> = match at least C<n> times, but not more than C<m>
times.
-=item * C<a{n,}> = match at least C<n> or more times
+=item *
+
+C<a{n,}> = match at least C<n> or more times
+
+=item *
-=item * C<a{n}> = match exactly C<n> times
+C<a{n}> = match exactly C<n> times
=back
stop there, but that wouldn't give the longest possible string to the
first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as
much of the string as possible while still having the regexp match. In
-this example, that means having the C<at> sequence with the final <at>
+this example, that means having the C<at> sequence with the final C<at>
in the string. The other important principle illustrated here is that
when there are two or more elements in a regexp, the I<leftmost>
quantifier, if there is one, gets to grab as much the string as
=over 4
=item *
+
Principle 0: Taken as a whole, any regexp will be matched at the
earliest possible position in the string.
=item *
+
Principle 1: In an alternation C<a|b|c...>, the leftmost alternative
that allows a match for the whole regexp will be the one used.
=item *
+
Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and
C<{n,m}> will in general match as much of the string as possible while
still allowing the whole regexp to match.
=item *
+
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy quantifier, if any, will match as much of the string
as possible while still allowing the whole regexp to match. The next
=over 4
-=item * C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
+=item *
+
+C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.
+
+=item *
-=item * C<a*?> = match 'a' 0 or more times, i.e., any number of times,
+C<a*?> = match 'a' 0 or more times, i.e., any number of times,
but as few times as possible
-=item * C<a+?> = match 'a' 1 or more times, i.e., at least once, but
+=item *
+
+C<a+?> = match 'a' 1 or more times, i.e., at least once, but
as few times as possible
-=item * C<a{n,m}?> = match at least C<n> times, not more than C<m>
+=item *
+
+C<a{n,m}?> = match at least C<n> times, not more than C<m>
times, as few times as possible
-=item * C<a{n,}?> = match at least C<n> times, but as few times as
+=item *
+
+C<a{n,}?> = match at least C<n> times, but as few times as
possible
-=item * C<a{n}?> = match exactly C<n> times. Because we match exactly
+=item *
+
+C<a{n}?> = match exactly C<n> times. Because we match exactly
C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for
notational consistency.
=over 4
=item *
+
Principle 3: If there are two or more elements in a regexp, the
leftmost greedy (non-greedy) quantifier, if any, will match as much
(little) of the string as possible while still allowing the whole
=over 4
-=item 0 Start with the first letter in the string 't'.
+=item 0
+
+Start with the first letter in the string 't'.
+
+=item 1
-=item 1 The first quantifier '.*' starts out by matching the whole
+The first quantifier '.*' starts out by matching the whole
string 'the cat in the hat'.
-=item 2 'a' in the regexp element 'at' doesn't match the end of the
+=item 2
+
+'a' in the regexp element 'at' doesn't match the end of the
string. Backtrack one character.
-=item 3 'a' in the regexp element 'at' still doesn't match the last
+=item 3
+
+'a' in the regexp element 'at' still doesn't match the last
letter of the string 't', so backtrack one more character.
-=item 4 Now we can match the 'a' and the 't'.
+=item 4
-=item 5 Move on to the third element '.*'. Since we are at the end of
+Now we can match the 'a' and the 't'.
+
+=item 5
+
+Move on to the third element '.*'. Since we are at the end of
the string and '.*' can match 0 times, assign it the empty string.
-=item 6 We are done!
+=item 6
+
+We are done!
=back
=over 4
-=item * specifying the task in detail,
+=item *
-=item * breaking down the problem into smaller parts,
+specifying the task in detail,
-=item * translating the small parts into regexps,
+=item *
-=item * combining the regexps,
+breaking down the problem into smaller parts,
-=item * and optimizing the final combined regexp.
+=item *
+
+translating the small parts into regexps,
+
+=item *
+
+combining the regexps,
+
+=item *
+
+and optimizing the final combined regexp.
=back
than C<chr(127)> may be represented using the C<\x{hex}> notation,
with C<hex> a hexadecimal integer:
- use utf8; # We will be doing Unicode processing
/\x{263a}/; # match a Unicode smiley face :)
Unicode characters in the range of 128-255 use two hexadecimal digits
with braces: C<\x{ab}>. Note that this is different than C<\xab>,
-which is just a hexadecimal byte with no Unicode
-significance.
+which is just a hexadecimal byte with no Unicode significance.
+
+B<NOTE>: in perl 5.6.0 it used to be that one needed to say C<use utf8>
+to use any Unicode features. This is no more the case: for almost all
+Unicode processing, the explicit C<utf8> pragma is not needed.
+(The only case where it matters is if your Perl script is in Unicode,
+that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C<use utf8>
+is needed.)
Figuring out the hexadecimal sequence of a Unicode character you want
or deciphering someone else's hexadecimal Unicode regexp is about as
represent or match the astrological sign for the planet Mercury, we
could use
- use utf8; # We will be doing Unicode processing
use charnames ":full"; # use named chars with Unicode full names
$x = "abc\N{MERCURY}def";
$x =~ /\N{MERCURY}/; # matches
One can also use short names or restrict names to a certain alphabet:
- use utf8; # We will be doing Unicode processing
-
use charnames ':full';
print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";
print "\N{sigma} is Greek sigma\n";
A list of full names is found in the file Names.txt in the
-lib/perl5/5.6.0/unicode directory.
+lib/perl5/5.X.X/unicore directory.
The answer to requirement 2), as of 5.6.0, is that if a regexp
contains Unicode characters, the string is searched as a sequence of
escape sequence. C<\C> is a character class akin to C<.> except that
it matches I<any> byte 0-255. So
- use utf8; # We will be doing Unicode processing
use charnames ":full"; # use named chars with Unicode full names
$x = "a";
$x =~ /\C/; # matches 'a', eats one byte
$x =~ /\C/; # matches, but dangerous!
The last regexp matches, but is dangerous because the string
-I<character> position is no longer synchronized to the string <byte>
+I<character> position is no longer synchronized to the string I<byte>
position. This generates the warning 'Malformed UTF-8
character'. C<\C> is best used for matching the binary data in strings
with binary data intermixed with Unicode characters.
C<\p{name}> class. For example, to match lower and uppercase
characters,
- use utf8; # We will be doing Unicode processing
use charnames ":full"; # use named chars with Unicode full names
$x = "BOB";
$x =~ /^\p{IsUpper}/; # matches, uppercase char class
$x =~ /^\p{IsLower}/; # doesn't match, lowercase char class
$x =~ /^\P{IsLower}/; # matches, char class sans lowercase
-If a C<name> is just one letter, the braces can be dropped. For
-instance, C<\pM> is the character class of Unicode 'marks'. Here is
-the association between some Perl named classes and the traditional
-Unicode classes:
+Here is the association between some Perl named classes and the
+traditional Unicode classes:
- Perl class name Unicode class name
+ Perl class name Unicode class name or regular expression
- IsAlpha Lu, Ll, or Lo
- IsAlnum Lu, Ll, Lo, or Nd
- IsASCII $code le 127
- IsCntrl C
+ IsAlpha /^[LM]/
+ IsAlnum /^[LMN]/
+ IsASCII $code <= 127
+ IsCntrl /^C/
+ IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/
IsDigit Nd
- IsGraph [^C] and $code ne "0020"
+ IsGraph /^([LMNPS]|Co)/
IsLower Ll
- IsPrint [^C]
- IsPunct P
- IsSpace Z, or ($code lt "0020" and chr(hex $code) is a \s)
- IsUpper Lu
- IsWord Lu, Ll, Lo, Nd or $code eq "005F"
+ IsPrint /^([LMNPS]|Co|Zs)/
+ IsPunct /^P/
+ IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
+ IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/
+ IsUpper /^L[ut]/
+ IsWord /^[LMN]/ || $code eq "005F"
IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/
-For a full list of Perl class names, consult the mktables.PL program
-in the lib/perl5/5.6.0/unicode directory.
+You can also use the official Unicode class names with the C<\p> and
+C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase
+letters, or C<\P{Nd}> for non-digits. If a C<name> is just one
+letter, the braces can be dropped. For instance, C<\pM> is the
+character class of Unicode 'marks', for example accent marks.
+For the full list see L<perlunicode>.
+
+The Unicode has also been separated into various sets of charaters
+which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
+for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>.
+For the full list see L<perlunicode>.
C<\X> is an abbreviation for a character class sequence that includes
the Unicode 'combining character sequences'. A 'combining character
atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>,
i.e., a non-mark followed by one or more marks.
+For the the full and latest information about Unicode see the latest
+Unicode standard, or the Unicode Consortium's website http://www.unicode.org/
+
As if all those classes weren't enough, Perl also defines POSIX style
character classes. These have the form C<[:name:]>, with C<name> the
-name of the POSIX class. The POSIX classes are alpha, alnum, ascii,
-cntrl, digit, graph, lower, print, punct, space, upper, word, and
-xdigit. If C<utf8> is being used, then these classes are defined the
-same as their corresponding perl Unicode classes: C<[:upper:]> is the
-same as C<\p{IsUpper}>, etc. The POSIX character classes, however,
-don't require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
+name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>,
+C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>,
+C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl
+extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8>
+is being used, then these classes are defined the same as their
+corresponding perl Unicode classes: C<[:upper:]> is the same as
+C<\p{IsUpper}>, etc. The POSIX character classes, however, don't
+require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and
C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s>
-character classes. To negate a POSIX class, put a C<^> in front of the
-name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
+character classes. To negate a POSIX class, put a C<^> in front of
+the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
be used just like C<\d>, both inside and outside of character classes:
/\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
/^=item\s[:digit:]/; # match '=item',
# followed by a space and a digit
- use utf8;
use charnames ":full";
/\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
/^=item\s\p{IsDigit}/; # match '=item',
anchor concept. Lookahead and lookbehind are zero-width assertions
that let us specify which characters we want to test for. The
lookahead assertion is denoted by C<(?=regexp)> and the lookbehind
-assertion is denoted by C<(?<=fixed-regexp)>. Some examples are
+assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are
$x = "I catch the housecat 'Tom-cat' with catnip";
$x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat'
$x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in
# middle of $x
-Note that the parentheses in C<(?=regexp)> and C<(?<=regexp)> are
+Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are
non-capturing, since these are zero-width assertions. Thus in the
second regexp, the substrings captured are those of the whole regexp
-itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
-lookbehind C<(?<=fixed-regexp)> only works for regexps of fixed
-width, i.e., a fixed number of characters long. Thus C<(?<=(ab|bc))>
-is fine, but C<(?<=(ab)*)> is not. The negated versions of the
-lookahead and lookbehind assertions are denoted by C<(?!regexp)>
-and C<(?<!fixed-regexp)> respectively. They evaluate true if the
-regexps do I<not> match:
+itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but
+lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed
+width, i.e., a fixed number of characters long. Thus
+C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The
+negated versions of the lookahead and lookbehind assertions are
+denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively.
+They evaluate true if the regexps do I<not> match:
$x = "foobar";
$x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo'
matches a DNA sequence such that it either ends in C<AAG>, or some
other base pair combination and C<C>. Note that the form is
-C<(?(?<=AA)G|C)> and not C<(?((?<=AA))G|C)>; for the lookahead,
-lookbehind or code assertions, the parentheses around the conditional
-are not needed.
+C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the
+lookahead, lookbehind or code assertions, the parentheses around the
+conditional are not needed.
=head2 A bit of magic: executing Perl code in a regular expression
# prints 'Hi Mom!'
$x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
# no 'Hi Mom!'
+
+Pay careful attention to the next example:
+
$x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match,
# no 'Hi Mom!'
+ # but why not?
+
+At first glance, you'd think that it shouldn't print, because obviously
+the C<ddd> isn't going to match the target string. But look at this
+example:
+
+ $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
+ # but _does_ print
+
+Hmm. What happened here? If you've been following along, you know that
+the above pattern should be effectively the same as the last one --
+enclosing the d in a character class isn't going to change what it
+matches. So why does the first not print while the second one does?
+
+The answer lies in the optimizations the REx engine makes. In the first
+case, all the engine sees are plain old characters (aside from the
+C<?{}> construct). It's smart enough to realize that the string 'ddd'
+doesn't occur in our target string before actually running the pattern
+through. But in the second case, we've tricked it into thinking that our
+pattern is more complicated than it is. It takes a look, sees our
+character class, and decides that it will have to actually run the
+pattern to determine whether or not it matches, and in the process of
+running it hits the print statement before it discovers that we don't
+have a match.
+
+To take a closer look at how the engine does optimizations, see the
+section L<"Pragmas and debugging"> below.
+
+More fun with C<?{}>:
+
$x =~ /(?{print "Hi Mom!";})/; # matches,
# prints 'Hi Mom!'
$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
This example uses a code expression in a conditional to match the
article 'the' in either English or German:
- use re 'eval';
$lang = 'DE'; # use German
...
$text = "das";
code expression, we don't need the extra parentheses around the
conditional.
-The S<C<use re 'eval';> > statement is needed because we are both
-interpolating the variable C<$lang> I<and> evaluating code
-within the regexp. From a security point of view, this can be
-dangerous. It is dangerous because many programmers who write search
-engines often take user input and plug it directly into a regexp:
+If you try to use code expressions with interpolating variables, perl
+may surprise you:
+
+ $bar = 5;
+ $pat = '(?{ 1 })';
+ /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated
+ /foo(?{ 1 })$bar/; # compile error!
+ /foo${pat}bar/; # compile error!
+
+ $pat = qr/(?{ $foo = 1 })/; # precompile code regexp
+ /foo${pat}bar/; # compiles ok
+
+If a regexp has (1) code expressions and interpolating variables,or
+(2) a variable that interpolates a code expression, perl treats the
+regexp as an error. If the code expression is precompiled into a
+variable, however, interpolating is ok. The question is, why is this
+an error?
+
+The reason is that variable interpolation and code expressions
+together pose a security risk. The combination is dangerous because
+many programmers who write search engines often take user input and
+plug it directly into a regexp:
$regexp = <>; # read user-supplied regexp
$chomp $regexp; # get rid of possible newline
$text =~ /$regexp/; # search $text for the $regexp
-If the C<$regexp> variable is used in a code expression, the user
-could then execute arbitrary Perl code. For instance, some joker could
+If the C<$regexp> variable contains a code expression, the user could
+then execute arbitrary Perl code. For instance, some joker could
search for S<C<system('rm -rf *');> > to erase your files. In this
sense, the combination of interpolation and code expressions B<taints>
your regexp. So by default, using both interpolation and code
-expressions in the same regexp is not allowed. Only by invoking
-S<C<use re 'eval';> > can one use both interpolation and code
-expressions in the same regexp.
+expressions in the same regexp is not allowed. If you're not
+concerned about malicious users, it is possible to bypass this
+security check by invoking S<C<use re 'eval'> >:
+
+ use re 'eval'; # throw caution out the door
+ $bar = 5;
+ $pat = '(?{ 1 })';
+ /foo(?{ 1 })$bar/; # compiles ok
+ /foo${pat}bar/; # compiles ok
Another form of code expression is the S<B<pattern code expression> >.
The pattern code expression is like a regular code expression, except
expressions. It detects if a binary string C<1101010010001...> has a
Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s:
- use re 'eval';
$s0 = 0; $s1 = 1; # initial conditions
$x = "1101010010001000001";
print "It is a Fibonacci sequence\n"
Speaking of debugging, there are several pragmas available to control
and debug regexps in Perl. We have already encountered one pragma in
the previous section, S<C<use re 'eval';> >, that allows variable
-interpolation in a regexp with code expressions. The other pragmas are
+interpolation and code expressions to coexist in a regexp. The other
+pragmas are
use re 'taint';
$tainted = <>;
regexps are often used to extract the safe bits from a tainted
variable. Use C<taint> when you are not extracting safe bits, but are
performing some other processing. Both C<taint> and C<eval> pragmas
-are lexically scoped, which mean they have are in effect only until
+are lexically scoped, which means they are in effect only until
the end of the block enclosing the pragmas.
use re 'debug';
The inspiration for the stop codon DNA example came from the ZIP
code example in chapter 7 of I<Mastering Regular Expressions>.
-The author would like to thank
-Jeff Pinyan,
-Peter Haworth,
-Ronald J Kimball,
-and Joe Smith for all their helpful comments.
+The author would like to thank Jeff Pinyan, Andrew Johnson, Peter
+Haworth, Ronald J Kimball, and Joe Smith for all their helpful
+comments.
=cut
+