From: Gurusamy Sarathy Date: Fri, 28 Apr 2000 07:44:17 +0000 (+0000) Subject: add regular expressions tutorial and quick-start guide (from X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=47f9c88be9d598405922d17837ce4e9664e82b5a;p=p5sagit%2Fp5-mst-13.2.git add regular expressions tutorial and quick-start guide (from Mark Kvale ) p4raw-id: //depot/perl@5986 --- diff --git a/AUTHORS b/AUTHORS index f978b51..aaa9b87 100644 --- a/AUTHORS +++ b/AUTHORS @@ -39,6 +39,7 @@ mbiggar Mark A Biggar mab@wdl.loral.com mbligh Martin J. Bligh mbligh@sequent.com mike Mike Stok mike@stok.co.uk millert Todd Miller millert@openbsd.org +mkvale Mark Kvale kvale@phy.ucsf.edu laszlo.molnar Laszlo Molnar Laszlo.Molnar@eth.ericsson.se mpeix Mark Bixby markb@cccd.edu muir David Muir Sharnoff muir@idiom.com diff --git a/MAINTAIN b/MAINTAIN index 37ef489..0d80457 100644 --- a/MAINTAIN +++ b/MAINTAIN @@ -595,6 +595,8 @@ pod/perlport.pod pudge pod/perlre.pod regex pod/perlref.pod pod/perlreftut.pod mjd +pod/perlrequick.pod mkvale +pod/perlretut.pod mkvale pod/perlrun.pod pod/perlsec.pod pod/perlstyle.pod diff --git a/MANIFEST b/MANIFEST index 70932d8..8d4a8d3 100644 --- a/MANIFEST +++ b/MANIFEST @@ -1108,6 +1108,8 @@ pod/perlport.pod Portability guide pod/perlre.pod Regular expression info pod/perlref.pod References info pod/perlreftut.pod Mark's references tutorial +pod/perlrequick.pod Quick start guide for Perl regular expressions +pod/perlretut.pod Tutorial for Perl regular expressions pod/perlrun.pod Execution info pod/perlsec.pod Security info pod/perlstyle.pod Style info diff --git a/pod/perlrequick.pod b/pod/perlrequick.pod new file mode 100644 index 0000000..d151e26 --- /dev/null +++ b/pod/perlrequick.pod @@ -0,0 +1,473 @@ +=head1 NAME + +perlrequick - Perl regular expressions quick start + +=head1 DESCRIPTION + +This page covers the very basics of understanding, creating and +using regular expressions ('regexps') in Perl. + +=head1 The Guide + +=head2 Simple word matching + +The simplest regexp is simply a word, or more generally, a string of +characters. A regexp consisting of a word matches any string that +contains that word: + + "Hello World" =~ /World/; # matches + +In this statement, C is a regexp and the C enclosing +C tells perl to search a string for a match. The operator +C<=~> associates the string with the regexp match and produces a true +value if the regexp matched, or false if the regexp did not match. In +our case, C matches the second word in C<"Hello World">, so the +expression is true. This idea has several variations. + +Expressions like this are useful in conditionals: + + print "It matches\n" if "Hello World" =~ /World/; + +The sense of the match can be reversed by using C operator: + + print "It doesn't match\n" if "Hello World" !~ /World/; + +The literal string in the regexp can be replaced by a variable: + + $greeting = "World"; + print "It matches\n" if "Hello World" =~ /$greeting/; + +If you're matching against C<$_>, the C<$_ =~> part can be omitted: + + $_ = "Hello World"; + print "It matches\n" if /World/; + +Finally, the C default delimiters for a match can be changed to +arbitrary delimiters by putting an C<'m'> out front: + + "Hello World" =~ m!World!; # matches, delimited by '!' + "Hello World" =~ m{World}; # matches, note the matching '{}' + "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', + # '/' becomes an ordinary char + +Regexps must match a part of the string I in order for the +statement to be true: + + "Hello World" =~ /world/; # doesn't match, case sensitive + "Hello World" =~ /o W/; # matches, ' ' is an ordinary char + "Hello World" =~ /World /; # doesn't match, no ' ' at end + +perl will always match at the earliest possible point in the string: + + "Hello World" =~ /o/; # matches 'o' in 'Hello' + "That hat is red" =~ /hat/; # matches 'hat' in 'That' + +Not all characters can be used 'as is' in a match. Some characters, +called B, are reserved for use in regexp notation. +The metacharacters are + + {}[]()^$.|*+?\ + +A metacharacter can be matched by putting a backslash before it: + + "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter + "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + + 'C:\WIN32' =~ /C:\\WIN/; # matches + "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches + +In the last regexp, the forward slash C<'/'> is also backslashed, +because it is used to delimit the regexp. + +Non-printable ASCII characters are represented by B. +Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> +for a carriage return. Arbitrary bytes are represented by octal +escape sequences, e.g., C<\033>, or hexadecimal escape sequences, +e.g., C<\x1B>: + + "1000\t2000" =~ m(0\t2) # matches + "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat + +Regexps are treated mostly as double quoted strings, so variable +substitution works: + + $foo = 'house'; + 'cathouse' =~ /cat$foo/; # matches + 'housecat' =~ /${foo}cat/; # matches + +With all of the regexps above, if the regexp matched anywhere in the +string, it was considered a match. To specify I it should +match, we would use the B metacharacters C<^> and C<$>. The +anchor C<^> means match at the beginning of the string and the anchor +C<$> means match at the end of the string, or before a newline at the +end of the string. Some examples: + + "housekeeper" =~ /keeper/; # matches + "housekeeper" =~ /^keeper/; # doesn't match + "housekeeper" =~ /keeper$/; # matches + "housekeeper\n" =~ /keeper$/; # matches + +=head2 Using character classes + +A B allows a set of possible characters, rather than +just a single character, to match at a particular point in a regexp. +Character classes are denoted by brackets C<[...]>, with the set of +characters to be possibly matched inside. Here are some examples: + + /cat/; # matches 'cat' + /[bcr]at/; # matches 'bat, 'cat', or 'rat' + "abc" =~ /[cab]/; # matches 'a' + +In the last statement, even though C<'c'> is the first character in +the class, the earliest point at which the regexp can match is C<'a'>. + + /[yY][eE][sS]/; # match 'yes' in a case-insensitive way + # 'yes', 'Yes', 'YES', etc. + /yes/i; # also match 'yes' in a case-insensitive way + +The last example shows a match with an C<'i'> B, which makes +the match case-insensitive. + +Character classes also have ordinary and special characters, but the +sets of ordinary and special characters inside a character class are +different than those outside a character class. The special +characters for a character class are C<-]\^$> and are matched using an +escape: + + /[\]c]def/; # matches ']def' or 'cdef' + $x = 'bcr'; + /[$x]at/; # matches 'bat, 'cat', or 'rat' + /[\$x]at/; # matches '$at' or 'xat' + /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' + +The special character C<'-'> acts as a range operator within character +classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> +become the svelte C<[0-9]> and C<[a-z]>: + + /item[0-9]/; # matches 'item0' or ... or 'item9' + /[0-9a-fA-F]/; # matches a hexadecimal digit + +If C<'-'> is the first or last character in a character class, it is +treated as an ordinary character. + +The special character C<^> in the first position of a character class +denotes a B, which matches any character but +those in the bracket. Both C<[...]> and C<[^...]> must match a +character, or the match fails. Then + + /[^a]at/; # doesn't match 'aat' or 'at', but matches + # all other 'bat', 'cat, '0at', '%at', etc. + /[^0-9]/; # matches a non-numeric character + /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary + +Perl has several abbreviations for common character classes: + +=over 4 + +=item * +\d is a digit and represents [0-9] + +=item * +\s is a whitespace character and represents [\ \t\r\n\f] + +=item * +\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] + +=item * +\D is a negated \d; it represents any character but a digit [^0-9] + +=item * +\S is a negated \s; it represents any non-whitespace character [^\s] + +=item * +\W is a negated \w; it represents any non-word character [^\w] + +=item * +The period '.' matches any character but "\n" + +=back + +The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside +of character classes. Here are some in use: + + /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format + /[\d\s]/; # matches any digit or whitespace character + /\w\W\w/; # matches a word char, followed by a + # non-word char, followed by a word char + /..rt/; # matches any two chars, followed by 'rt' + /end\./; # matches 'end.' + /end[.]/; # same thing, matches 'end.' + +The S > C<\b> matches a boundary between a word +character and a non-word character C<\w\W> or C<\W\w>: + + $x = "Housecat catenates house and cat"; + $x =~ /\bcat/; # matches cat in 'catenates' + $x =~ /cat\b/; # matches cat in 'housecat' + $x =~ /\bcat\b/; # matches 'cat' at end of string + +In the last example, the end of the string is considered a word +boundary. + +=head2 Matching this or that + +We can match match different character strings with the B +metacharacter C<'|'>. To match C or C, we form the regexp +C. As before, perl will try to match the regexp at the +earliest possible point in the string. At each character position, +perl will first try to match the the first alternative, C. If +C doesn't match, perl will then try the next alternative, C. +If C doesn't match either, then the match fails and perl moves to +the next position in the string. Some examples: + + "cats and dogs" =~ /cat|dog|bird/; # matches "cat" + "cats and dogs" =~ /dog|cat|bird/; # matches "cat" + +Even though C is the first alternative in the second regexp, +C is able to match earlier in the string. + + "cats" =~ /c|ca|cat|cats/; # matches "c" + "cats" =~ /cats|cat|ca|c/; # matches "cats" + +At a given character position, the first alternative that allows the +regexp match to succeed wil be the one that matches. Here, all the +alternatives match at the first string position, so th first matches. + +=head2 Grouping things and hierarchical matching + +The B metacharacters C<()> allow a part of a regexp to be +treated as a single unit. Parts of a regexp are grouped by enclosing +them in parentheses. The regexp C means match +C followed by either C or C. Some more examples +are + + /(a|b)b/; # matches 'ab' or 'bb' + /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere + + /house(cat|)/; # matches either 'housecat' or 'house' + /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or + # 'house'. Note groups can be nested. + + "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', + # because '20\d\d' can't match + +=head2 Extracting matches + +The grouping metacharacters C<()> also allow the extraction of the +parts of a string that matched. For each grouping, the part that +matched inside goes into the special variables C<$1>, C<$2>, etc. +They can be used just as ordinary variables: + + # extract hours, minutes, seconds + $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format + $hours = $1; + $minutes = $2; + $seconds = $3; + +In list context, a match C. So we could rewrite it as + + ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); + +If the groupings in a regexp are nested, C<$1> gets the group with the +leftmost opening parenthesis, C<$2> the next opening parenthesis, +etc. For example, here is a complex regexp and the matching variables +indicated below it: + + /(ab(cd|ef)((gi)|j))/; + 1 2 34 + +Associated with the matching variables C<$1>, C<$2>, ... are +the B C<\1>, C<\2>, ... Backreferences are +matching variables that can be used I a regexp: + + /(\w\w\w)\s\1/; # find sequences like 'the the' in string + +C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>, +C<\2>, ... only inside a regexp. + +=head2 Matching repetitions + +The B metacharacters C, C<*>, C<+>, and C<{}> allow us +to determine the number of repeats of a portion of a regexp we +consider to be a match. Quantifiers are put immediately after the +character, character class, or grouping that we want to specify. They +have the following meanings: + +=over 4 + +=item * C = match 'a' 1 or 0 times + +=item * C = match 'a' 0 or more times, i.e., any number of times + +=item * C = match 'a' 1 or more times, i.e., at least once + +=item * C = match at least C times, but not more than C +times. + +=item * C = match at least C or more times + +=item * C = match exactly C times + +=back + +Here are some examples: + + /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and + # any number of digits + /(\w+)\s+\1/; # match doubled words of arbitrary length + $year =~ /\d{2,4}/; # make sure year is at least 2 but not more + # than 4 digits + $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates + +These quantifiers will try to match as much of the string as possible, +while still allowing the regexp to match. So we have + + $x =~ /^(.*)(at)(.*)$/; # matches, + # $1 = 'the cat in the h' + # $2 = 'at' + # $3 = '' (0 matches) + +The first quantifier C<.*> grabs as much of the string as possible +while still having the regexp match. The second quantifier C<.*> has +no string left to it, so it matches 0 times. + +=head2 More matching + +There are a few more things you might want to know about matching +operators. In the code + + $pattern = 'Seuss'; + while (<>) { + print if /$pattern/; + } + +perl has to re-evaluate C<$pattern> each time through the loop. If +C<$pattern> won't be changing, use the C modifier, to only +perform variable substitutions once. If you don't want any +substitutions at all, use the special delimiter C: + + $pattern = 'Seuss'; + m'$pattern'; # matches '$pattern', not 'Seuss' + +The global modifier C allows the matching operator to match +within a string as many times as possible. In scalar context, +successive matches against a string will have C jump from match +to match, keeping track of position in the string as it goes along. +You can get or set the position with the C function. +For example, + + $x = "cat dog house"; # 3 words + while ($x =~ /(\w+)/g) { + print "Word is $1, ends at position ", pos $x, "\n"; + } + +prints + + Word is cat, ends at position 3 + Word is dog, ends at position 7 + Word is house, ends at position 13 + +A failed match or changing the target string resets the position. If +you don't want the position reset after failure to match, add the +C, as in C. + +In list context, C returns a list of matched groupings, or if +there are no groupings, a list of matches to the whole regexp. So + + @words = ($x =~ /(\w+)/g); # matches, + # $word[0] = 'cat' + # $word[1] = 'dog' + # $word[2] = 'house' + +=head2 Search and replace + +Search and replace is perform using C. +The C is a Perl double quoted string that replaces in the +string whatever is matched with the C. The operator C<=~> is +also used here to associate a string with C. If matching +against C<$_>, the S > can be dropped. If there is a match, +C returns the number of substitutions made, otherwise it returns +false. Here are a few examples: + + $x = "Time to feed the cat!"; + $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" + $y = "'quoted words'"; + $y =~ s/^'(.*)'$/$1/; # strip single quotes, + # $y contains "quoted words" + +With the C operator, the matched variables C<$1>, C<$2>, etc. +are immediately available for use in the replacement expression. With +the global modifier, C will search and replace all occurrences +of the regexp in the string: + + $x = "I batted 4 for 4"; + $x =~ s/4/four/; # $x contains "I batted four for 4" + $x = "I batted 4 for 4"; + $x =~ s/4/four/g; # $x contains "I batted four for four" + +The evaluation modifier C wraps an C around the +replacement string and the evaluated result is substituted for the +matched substring. This counts character frequencies in a line: + + $x = "the cat"; + $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself + print "frequency of '$_' is $chars{$_}\n" + foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); + +This prints + + frequency of 't' is 2 + frequency of 'e' is 1 + frequency of ' ' is 1 + frequency of 'h' is 1 + frequency of 'a' is 1 + frequency of 'c' is 1 + +C can use other delimiters, such as C and C, and +even C. If single quotes are used C, then the regexp and +replacement are treated as single quoted strings. + +=head2 The split operator + +C splits C into a list of substrings +and returns that list. The regexp determines the character sequence +that C is split with respect to. For example, to split a +string into words, use + + $x = "Calvin and Hobbes"; + @words = split /\s+/, $x; # $word[0] = 'Calvin' + # $word[1] = 'and' + # $word[2] = 'Hobbes' + +If the empty regexp C is used, the string is split into individual +characters. If the regexp has groupings, then list produced contains +the matched substrings from the groupings as well: + + $x = "/usr/bin"; + @parts = split m!(/)!, $x; # $parts[0] = '' + # $parts[1] = '/' + # $parts[2] = 'usr' + # $parts[3] = '/' + # $parts[4] = 'bin' + +Since the first character of $x matched the regexp, C prepended +an empty initial element to the list. + +=head1 BUGS + +None. + +=head1 SEE ALSO + +This is just a quick start guide. For a more in-depth tutorial on +regexps, see L and for the reference page, see L. + +=head1 AUTHOR AND COPYRIGHT + +Copyright (c) 2000 Mark Kvale +All rights reserved. + +This document may be distributed under the same terms as Perl itself. + +=cut + diff --git a/pod/perlretut.pod b/pod/perlretut.pod new file mode 100644 index 0000000..9ff41b2 --- /dev/null +++ b/pod/perlretut.pod @@ -0,0 +1,2361 @@ +=head1 NAME + +perlretut - Perl regular expressions tutorial + +=head1 DESCRIPTION + +This page provides a basic tutorial on understanding, creating and +using regular expressions in Perl. It serves as a complement to the +reference page on regular expressions L. Regular expressions +are an integral part of the C, C, C and C +operators and so this tutorial also overlaps with +L and L. + +Perl is widely renowned for excellence in text processing, and regular +expressions are one of the big factors behind this fame. Perl regular +expressions display an efficiency and flexibility unknown in most +other computer languages. Mastering even the basics of regular +expressions will allow you to manipulate text with surprising ease. + +What is a regular expression? A regular expression is simply a string +that describes a pattern. Patterns are in common use these days; +examples are the patterns typed into a search engine to find web pages +and the patterns used to list files in a directory, e.g., C +or C. In Perl, the patterns described by regular expressions +are used to search strings, extract desired parts of strings, and to +do search and replace operations. + +Regular expressions have the undeserved reputation of being abstract +and difficult to understand. Regular expressions are constructed using +simple concepts like conditionals and loops and are no more difficult +to understand than the corresponding C conditionals and C +loops in the Perl language itself. In fact, the main challenge in +learning regular expressions is just getting used to the terse +notation used to express these concepts. + +This tutorial flattens the learning curve by discussing regular +expression concepts, along with their notation, one at a time and with +many examples. The first part of the tutorial will progress from the +simplest word searches to the basic regular expression concepts. If +you master the first part, you will have all the tools needed to solve +about 98% of your needs. The second part of the tutorial is for those +comfortable with the basics and hungry for more power tools. It +discusses the more advanced regular expression operators and +introduces the latest cutting edge innovations in 5.6.0. + +A note: to save time, 'regular expression' is often abbreviated as +regexp or regex. Regexp is a more natural abbreviation than regex, but +is harder to pronounce. The Perl pod documentation is evenly split on +regexp vs regex; in Perl, there is more than one way to abbreviate it. +We'll use regexp in this tutorial. + +=head1 Part 1: The basics + +=head2 Simple word matching + +The simplest regexp is simply a word, or more generally, a string of +characters. A regexp consisting of a word matches any string that +contains that word: + + "Hello World" =~ /World/; # matches + +What is this perl statement all about? C<"Hello World"> is a simple +double quoted string. C is the regular expression and the +C enclosing C tells perl to search a string for a match. +The operator C<=~> associates the string with the regexp match and +produces a true value if the regexp matched, or false if the regexp +did not match. In our case, C matches the second word in +C<"Hello World">, so the expression is true. Expressions like this +are useful in conditionals: + + if ("Hello World" =~ /World/) { + print "It matches\n"; + } + else { + print "It doesn't match\n"; + } + +There are useful variations on this theme. The sense of the match can +be reversed by using C operator: + + if ("Hello World" !~ /World/) { + print "It doesn't match\n"; + } + else { + print "It matches\n"; + } + +The literal string in the regexp can be replaced by a variable: + + $greeting = "World"; + if ("Hello World" =~ /$greeting/) { + print "It matches\n"; + } + else { + print "It doesn't match\n"; + } + +If you're matching against the special default variable C<$_>, the +C<$_ =~> part can be omitted: + + $_ = "Hello World"; + if (/World/) { + print "It matches\n"; + } + else { + print "It doesn't match\n"; + } + +And finally, the C default delimiters for a match can be changed +to arbitrary delimiters by putting an C<'m'> out front: + + "Hello World" =~ m!World!; # matches, delimited by '!' + "Hello World" =~ m{World}; # matches, note the matching '{}' + "/usr/bin/perl" =~ m"/perl"; # matches, '/' becomes ordinary char + +C, C, and C all represent the +same thing. When, e.g., C<""> is used as a delimiter, the forward +slash C<'/'> becomes an ordinary character and can be used in a regexp +without trouble. + +Let's consider how different regexps would match C<"Hello World">: + + "Hello World" =~ /world/; # doesn't match + "Hello World" =~ /o W/; # matches + "Hello World" =~ /oW/; # doesn't match + "Hello World" =~ /World /; # doesn't match + +The first regexp C doesn't match because regexps are +case-sensitive. The second regexp matches because the substring +S > occurs in the string S >. The space +character ' ' is treated like any other character in a regexp and is +needed to match in this case. The lack of a space character is the +reason the third regexp C<'oW'> doesn't match. The fourth regexp +C<'World '> doesn't match because there is a space at the end of the +regexp, but not at the end of the string. The lesson here is that +regexps must match a part of the string I in order for the +statement to be true. + +If a regexp matches in more than one place in the string, perl will +always match at the earliest possible point in the string: + + "Hello World" =~ /o/; # matches 'o' in 'Hello' + "That hat is red" =~ /hat/; # matches 'hat' in 'That' + +With respect to character matching, there are a few more points you +need to know about. First of all, not all characters can be used 'as +is' in a match. Some characters, called B, are reserved +for use in regexp notation. The metacharacters are + + {}[]()^$.|*+?\ + +The significance of each of these will be explained +in the rest of the tutorial, but for now, it is important only to know +that a metacharacter can be matched by putting a backslash before it: + + "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter + "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! + "The interval is [0,1)." =~ /\[0,1\)\./ # matches + "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches + +In the last regexp, the forward slash C<'/'> is also backslashed, +because it is used to delimit the regexp. This can lead to LTS +(leaning toothpick syndrome), however, and it is often more readable +to change delimiters. + + +The backslash character C<'\'> is a metacharacter itself and needs to +be backslashed: + + 'C:\WIN32' =~ /C:\\WIN/; # matches + +In addition to the metacharacters, there are some ASCII characters +which don't have printable character equivalents and are instead +represented by B. Common examples are C<\t> for a +tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a +bell. If your string is better thought of as a sequence of arbitrary +bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape +sequence, e.g., C<\x1B> may be a more natural representation for your +bytes. Here are some examples of escapes: + + "1000\t2000" =~ m(0\t2) # matches + "1000\n2000" =~ /0\n20/ # matches + "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" + "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat + +If you've been around Perl a while, all this talk of escape sequences +may seem familiar. Similar escape sequences are used in double-quoted +strings and in fact the regexps in Perl are mostly treated as +double-quoted strings. This means that variables can be used in +regexps as well. Just like double-quoted strings, the values of the +variables in the regexp will be substituted in before the regexp is +evaluated for matching purposes. So we have: + + $foo = 'house'; + 'housecat' =~ /$foo/; # matches + 'cathouse' =~ /cat$foo/; # matches + 'housecat' =~ /$foocat/; # doesn't match, there is no $foocat + 'housecat' =~ /${foo}cat/; # matches + +So far, so good. With the knowledge above you can already perform +searches with just about any literal string regexp you can dream up. +Here is a I emulation of the Unix grep program: + + % cat > simple_grep + #!/usr/bin/perl + $regexp = shift; + while (<>) { + print if /$regexp/; + } + ^D + + % chmod +x simple_grep + + % simple_grep abba /usr/dict/words + Babbage + cabbage + cabbages + sabbath + Sabbathize + Sabbathizes + sabbatical + scabbard + scabbards + +This program is easy to understand. C<#!/usr/bin/perl> is the standard +way to invoke a perl program from the shell. +S > saves the first command line argument as the +regexp to be used, leaving the rest of the command line arguments to +be treated as files. S) >> > loops over all the lines in +all the files. For each line, S > prints the +line if the regexp matches the line. In this line, both C and +C use the default variable C<$_> implicitly. + +With all of the regexps above, if the regexp matched anywhere in the +string, it was considered a match. Sometimes, however, we'd like to +specify I in the string the regexp should try to match. To do +this, we would use the B metacharacters C<^> and C<$>. The +anchor C<^> means match at the beginning of the string and the anchor +C<$> means match at the end of the string, or before a newline at the +end of the string. Here is how they are used: + + "housekeeper" =~ /keeper/; # matches + "housekeeper" =~ /^keeper/; # doesn't match + "housekeeper" =~ /keeper$/; # matches + "housekeeper\n" =~ /keeper$/; # matches + +The second regexp doesn't match because C<^> constrains C to +match only at the beginning of the string, but C<"housekeeper"> has +keeper starting in the middle. The third regexp does match, since the +C<$> constrains C to match only at the end of the string. + +When both C<^> and C<$> are used at the same time, the regexp has to +match both the beginning and the end of the string, i.e., the regexp +matches the whole string. Consider + + "keeper" =~ /^keep$/; # doesn't match + "keeper" =~ /^keeper$/; # matches + "" =~ /^$/; # ^$ matches an empty string + +The first regexp doesn't match because the string has more to it than +C. Since the second regexp is exactly the string, it +matches. Using both C<^> and C<$> in a regexp forces the complete +string to match, so it gives you complete control over which strings +match and which don't. Suppose you are looking for a fellow named +bert, off in a string by himself: + + "dogbert" =~ /bert/; # matches, but not what you want + + "dilbert" =~ /^bert/; # doesn't match, but .. + "bertram" =~ /^bert/; # matches, so still not good enough + + "bertram" =~ /^bert$/; # doesn't match, good + "dilbert" =~ /^bert$/; # doesn't match, good + "bert" =~ /^bert$/; # matches, perfect + +Of course, in the case of a literal string, one could just as easily +use the string equivalence S > and it would be +more efficient. The C<^...$> regexp really becomes useful when we +add in the more powerful regexp tools below. + +=head2 Using character classes + +Although one can already do quite a lot with the literal string +regexps above, we've only scratched the surface of regular expression +technology. In this and subsequent sections we will introduce regexp +concepts (and associated metacharacter notations) that will allow a +regexp to not just represent a single character sequence, but a I of them. + +One such concept is that of a B. A character class +allows a set of possible characters, rather than just a single +character, to match at a particular point in a regexp. Character +classes are denoted by brackets C<[...]>, with the set of characters +to be possibly matched inside. Here are some examples: + + /cat/; # matches 'cat' + /[bcr]at/; # matches 'bat, 'cat', or 'rat' + /item[0123456789]/; # matches 'item0' or ... or 'item9' + "abc" =~ /[cab/; # matches 'a' + +In the last statement, even though C<'c'> is the first character in +the class, C<'a'> matches because the first character position in the +string is the earliest point at which the regexp can match. + + /[yY][eE][sS]/; # match 'yes' in a case-insensitive way + # 'yes', 'Yes', 'YES', etc. + +This regexp displays a common task: perform a a case-insensitive +match. Perl provides away of avoiding all those brackets by simply +appending an C<'i'> to the end of the match. Then C +can be rewritten as C. The C<'i'> stands for +case-insensitive and is an example of a B of the matching +operation. We will meet other modifiers later in the tutorial. + +We saw in the section above that there were ordinary characters, which +represented themselves, and special characters, which needed a +backslash C<\> to represent themselves. The same is true in a +character class, but the sets of ordinary and special characters +inside a character class are different than those outside a character +class. The special characters for a character class are C<-]\^$>. C<]> +is special because it denotes the end of a character class. C<$> is +special because it denotes a scalar variable. C<\> is special because +it is used in escape sequences, just like above. Here is how the +special characters C<]$\> are handled: + + /[\]c]def/; # matches ']def' or 'cdef' + $x = 'bcr'; + /[$x]at/; # matches 'bat, 'cat', or 'rat' + /[\$x]at/; # matches '$at' or 'xat' + /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' + +The last two are a little tricky. in C<[\$x]>, the backslash protects +the dollar sign, so the character class has two members C<$> and C. +In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a +variable and substituted in double quote fashion. + +The special character C<'-'> acts as a range operator within character +classes, so that a contiguous set of characters can be written as a +range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> +become the svelte C<[0-9]> and C<[a-z]>. Some examples are + + /item[0-9]/; # matches 'item0' or ... or 'item9' + /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', + # 'baa', 'xaa', 'yaa', or 'zaa' + /[0-9a-fA-F]/; # matches a hexadecimal digit + /[0-9a-zA-Z_]/; # matches an alphanumeric character, + # like those in a perl variable name + +If C<'-'> is the first or last character in a character class, it is +treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are +all equivalent. + +The special character C<^> in the first position of a character class +denotes a B, which matches any character but +those in the bracket. Both C<[...]> and C<[^...]> must match a +character, or the match fails. Then + + /[^a]at/; # doesn't match 'aat' or 'at', but matches + # all other 'bat', 'cat, '0at', '%at', etc. + /[^0-9]/; # matches a non-numeric character + /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary + +Now, even C<[0-9]> can be a bother the write multiple times, so in the +interest of saving keystrokes and making regexps more readable, Perl +has several abbreviations for common character classes: + +=over 4 + +=item * +\d is a digit and represents [0-9] + +=item * +\s is a whitespace character and represents [\ \t\r\n\f] + +=item * +\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] + +=item * +\D is a negated \d; it represents any character but a digit [^0-9] + +=item * +\S is a negated \s; it represents any non-whitespace character [^\s] + +=item * +\W is a negated \w; it represents any non-word character [^\w] + +=item * +The period '.' matches any character but "\n" + +=back + +The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside +of character classes. Here are some in use: + + /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format + /[\d\s]/; # matches any digit or whitespace character + /\w\W\w/; # matches a word char, followed by a + # non-word char, followed by a word char + /..rt/; # matches any two chars, followed by 'rt' + /end\./; # matches 'end.' + /end[.]/; # same thing, matches 'end.' + +Because a period is a metacharacter, it needs to be escaped to match +as an ordinary period. Because, for example, C<\d> and C<\w> are sets +of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in +fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as +C<[\W]>. Think DeMorgan's laws. + +An anchor useful in basic regexps is the S > +C<\b>. This matches a boundary between a word character and a non-word +character C<\w\W> or C<\W\w>: + + $x = "Housecat catenates house and cat"; + $x =~ /cat/; # matches cat in 'housecat' + $x =~ /\bcat/; # matches cat in 'catenates' + $x =~ /cat\b/; # matches cat in 'housecat' + $x =~ /\bcat\b/; # matches 'cat' at end of string + +Note in the last example, the end of the string is considered a word +boundary. + +You might wonder why C<'.'> matches everything but C<"\n"> - why not +every character? The reason is that often one is matching against +lines and would like to ignore the newline characters. For instance, +while the string C<"\n"> represents one line, we would like to think +of as empty. Then + + "" =~ /^$/; # matches + "\n" =~ /^$/; # matches, "\n" is ignored + + "" =~ /./; # doesn't match; it needs a char + "" =~ /^.$/; # doesn't match; it needs a char + "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" + "a" =~ /^.$/; # matches + "a\n" =~ /^.$/; # matches, ignores the "\n" + +This behavior is convenient, because we usually want to ignore +newlines when we count and match characters in a line. Sometimes, +however, we want to keep track of newlines. We might even want C<^> +and C<$> to anchor at the beginning and end of lines within the +string, rather than just the beginning and end of the string. Perl +allows us to choose between ignoring and paying attention to newlines +by using the C and C modifiers. C and C stand for +single line and multi-line and they determine whether a string is to +be treated as one continuous string, or as a set of lines. The two +modifiers affect two aspects of how the regexp is interpreted: 1) how +the C<'.'> character class is defined, and 2) where the anchors C<^> +and C<$> are able to match. Here are the four possible combinations: + +=over 4 + +=item * +no modifiers (//): Default behavior. C<'.'> matches any character +except C<"\n">. C<^> matches only at the beginning of the string and +C<$> matches only at the end or before a newline at the end. + +=item * +s modifier (//s): Treat string as a single long line. C<'.'> matches +any character, even C<"\n">. C<^> matches only at the beginning of +the string and C<$> matches only at the end or before a newline at the +end. + +=item * +m modifier (//m): Treat string as a set of multiple lines. C<'.'> +matches any character except C<"\n">. C<^> and C<$> are able to match +at the start or end of I line within the string. + +=item * +both s and m modifiers (//sm): Treat string as a single long line, but +detect multiple lines. C<'.'> matches any character, even +C<"\n">. C<^> and C<$>, however, are able to match at the start or end +of I line within the string. + +=back + +Here are examples of C and C in action: + + $x = "There once was a girl\nWho programmed in Perl\n"; + + $x =~ /^Who/; # doesn't match, "Who" not at start of string + $x =~ /^Who/s; # doesn't match, "Who" not at start of string + $x =~ /^Who/m; # matches, "Who" at start of second line + $x =~ /^Who/sm; # matches, "Who" at start of second line + + $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" + $x =~ /girl.Who/s; # matches, "." matches "\n" + $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" + $x =~ /girl.Who/sm; # matches, "." matches "\n" + +Most of the time, the default behavior is what is want, but C and +C are occasionally very useful. If C is being used, the start +of the string can still be matched with C<\A> and the end of string +can still be matched with the anchors C<\Z> (matches both the end and +the newline before, like C<$>), and C<\z> (matches only the end): + + $x =~ /^Who/m; # matches, "Who" at start of second line + $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string + + $x =~ /girl$/m; # matches, "girl" at end of first line + $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string + + $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end + $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string + +We now know how to create choices among classes of characters in a +regexp. What about choices among words or character strings? Such +choices are described in the next section. + +=head2 Matching this or that + +Sometimes we would like to our regexp to be able to match different +possible words or character strings. This is accomplished by using +the B metacharacter C<|>. To match C or C, we +form the regexp C. As before, perl will try to match the +regexp at the earliest possible point in the string. At each +character position, perl will first try to match the first +alternative, C. If C doesn't match, perl will then try the +next alternative, C. If C doesn't match either, then the +match fails and perl moves to the next position in the string. Some +examples: + + "cats and dogs" =~ /cat|dog|bird/; # matches "cat" + "cats and dogs" =~ /dog|cat|bird/; # matches "cat" + +Even though C is the first alternative in the second regexp, +C is able to match earlier in the string. + + "cats" =~ /c|ca|cat|cats/; # matches "c" + "cats" =~ /cats|cat|ca|c/; # matches "cats" + +Here, all the alternatives match at the first string position, so the +first alternative is the one that matches. If some of the +alternatives are truncations of the others, put the longest ones first +to give them a chance to match. + + "cab" =~ /a|b|c/ # matches "c" + # /a|b|c/ == /[abc]/ + +The last example points out that character classes are like +alternations of characters. At a given character position, the first +alternative that allows the regexp match to succeed wil be the one +that matches. + +=head2 Grouping things and hierarchical matching + +Alternation allows a regexp to choose among alternatives, but by +itself it unsatisfying. The reason is that each alternative is a whole +regexp, but sometime we want alternatives for just part of a +regexp. For instance, suppose we want to search for housecats or +housekeepers. The regexp C fits the bill, but is +inefficient because we had to type C twice. It would be nice to +have parts of the regexp be constant, like C, and and some +parts have alternatives, like C. + +The B metacharacters C<()> solve this problem. Grouping +allows parts of a regexp to be treated as a single unit. Parts of a +regexp are grouped by enclosing them in parentheses. Thus we could solve +the C by forming the regexp as +C. The regexp C means match +C followed by either C or C. Some more examples +are + + /(a|b)b/; # matches 'ab' or 'bb' + /(ac|b)b/; # matches 'acb' or 'bb' + /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere + /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' + + /house(cat|)/; # matches either 'housecat' or 'house' + /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or + # 'house'. Note groups can be nested. + + /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx + "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', + # because '20\d\d' can't match + +Alternations behave the same way in groups as out of them: at a given +string position, the leftmost alternative that allows the regexp to +match is taken. So in the last example at tth first string position, +C<"20"> matches the second alternative, but there is nothing left over +to match the next two digits C<\d\d>. So perl moves on to the next +alternative, which is the null alternative and that works, since +C<"20"> is two digits. + +The process of trying one alternative, seeing if it matches, and +moving on to the next alternative if it doesn't, is called +B. The term 'backtracking' comes from the idea that +matching a regexp is like a walk in the woods. Successfully matching +a regexp is like arriving at a destination. There are many possible +trailheads, one for each string position, and each one is tried in +order, left to right. From each trailhead there may be many paths, +some of which get you there, and some which are dead ends. When you +walk along a trail and hit a dead end, you have to backtrack along the +trail to an earlier point to try another trail. If you hit your +destination, you stop immediately and forget about trying all the +other trails. You are persistent, and only if you have tried all the +trails from all the trailheads and not arrived at your destination, do +you declare failure. To be concrete, here is a step-by-step analysis +of what perl does when it tries to match the regexp + + "abcde" =~ /(abd|abc)(df|d|de)/; + +=over 4 + +=item 0 Start with the first letter in the string 'a'. + +=item 1 Try the first alternative in the first group 'abd'. + +=item 2 Match 'a' followed by 'b'. So far so good. + +=item 3 'd' in the regexp doesn't match 'c' in the string - a dead +end. So backtrack two characters and pick the second alternative in +the first group 'abc'. + +=item 4 Match 'a' followed by 'b' followed by 'c'. We are on a roll +and have satisfied the first group. Set $1 to 'abc'. + +=item 5 Move on to the second group and pick the first alternative +'df'. + +=item 6 Match the 'd'. + +=item 7 'f' in the regexp doesn't match 'e' in the string, so a dead +end. Backtrack one character and pick the second alternative in the +second group 'd'. + +=item 8 'd' matches. The second grouping is satisfied, so set $2 to +'d'. + +=item 9 We are at the end of the regexp, so we are done! We have +matched 'abcd' out of the string "abcde". + +=back + +There are a couple of things to note about this analysis. First, the +third alternative in the second group 'de' also allows a match, but we +stopped before we got to it - at a given character position, leftmost +wins. Second, we were able to get a match at the first character +position of the string 'a'. If there were no matches at the first +position, perl would move to the second character position 'b' and +attempt the match all over again. Only when all possible paths at all +possible character positions have been exhausted does perl give give +up and declare S > to be false. + +Even with all this work, regexp matching happens remarkably fast. To +speed things up, during compilation stage, perl compiles the regexp +into a compact sequence of opcodes that can often fit inside a +processor cache. When the code is executed, these opcodes can then run +at full throttle and search very quickly. + +=head2 Extracting matches + +The grouping metacharacters C<()> also serve another completely +different function: they allow the extraction of the parts of a string +that matched. This is very useful to find out what matched and for +text processing in general. For each grouping, the part that matched +inside goes into the special variables C<$1>, C<$2>, etc. They can be +used just as ordinary variables: + + # extract hours, minutes, seconds + $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format + $hours = $1; + $minutes = $2; + $seconds = $3; + +Now, we know that in scalar context, +S > returns a true or false +value. In list context, however, it returns the list of matched values +C<($1,$2,$3)>. So we could write the code more compactly as + + # extract hours, minutes, seconds + ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); + +If the groupings in a regexp are nested, C<$1> gets the group with the +leftmost opening parenthesis, C<$2> the next opening parenthesis, +etc. For example, here is a complex regexp and the matching variables +indicated below it: + + /(ab(cd|ef)((gi)|j))/; + 1 2 34 + +so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. +For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>, +... that got assigned. + +Closely associated with the matching variables C<$1>, C<$2>, ... are +the B C<\1>, C<\2>, ... . Backreferences are simply +matching variables that can be used I a regexp. This is a +really nice feature - what matches later in a regexp can depend on +what matched earlier in the regexp. Suppose we wanted to look +for doubled words in text, like 'the the'. The following regexp finds +all 3-letter doubles with a space in between: + + /(\w\w\w)\s\1/; + +The grouping assigns a value to \1, so that the same 3 letter sequence +is used for both parts. Here are some words with repeated parts: + + % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words + beriberi + booboo + coco + mama + murmur + papa + +The regexp has a single grouping which considers 4-letter +combinations, then 3-letter combinations, etc. and uses C<\1> to look for +a repeat. Although C<$1> and C<\1> represent the same thing, care should be +taken to use matched variables C<$1>, C<$2>, ... only outside a regexp +and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing +so may lead to surprising and/or undefined results. + +In addition to what was matched, Perl 5.6.0 also provides the +positions of what was matched with the C<@-> and C<@+> +arrays. C<$-[0]> is the position of the start of the entire match and +C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the +position of the start of the C<$n> match and C<$+[n]> is the position +of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then +this code + + $x = "Mmm...donut, thought Homer"; + $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches + foreach $expr (1..$#-) { + print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; + } + +prints + + Match 1: 'Mmm' at position (0,3) + Match 2: 'donut' at position (6,11) + +Even if there are no groupings in a regexp, it is still possible to +find out what exactly matched in a string. If you use them, perl +will set C<$`> to the part of the string before the match, will set C<$&> +to the part of the string that matched, and will set C<$'> to the part +of the string after the match. An example: + + $x = "the cat caught the mouse"; + $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' + $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' + +In the second match, S > because the regexp matched at the +first character position in the string and stopped, it never saw the +second 'the'. It is important to note that using C<$`> and C<$'> +slows down regexp matching quite a bit, and C<$&> slows it down to a +lesser extent, because if they are used in one regexp in a program, +they are generated for regexps in the program. So if raw +performance is a goal of your application, they should be avoided. +If you need them, use C<@-> and C<@+> instead: + + $` is the same as substr( $x, 0, $-[0] ) + $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) + $' is the same as substr( $x, $+[0] ) + +=head2 Matching repetitions + +The examples in the previous section display an annoying weakness. We +were only matching 3-letter words, or syllables of 4 letters or +less. We'd like to be able to match words or syllables of any length, +without writing out tedious alternatives like +C<\w\w\w\w|\w\w\w|\w\w|\w>. + +This is exactly the problem the B metacharacters C, +C<*>, C<+>, and C<{}> were created for. They allow us to determine the +number of repeats of a portion of a regexp we consider to be a +match. Quantifiers are put immediately after the character, character +class, or grouping that we want to specify. They have the following +meanings: + +=over 4 + +=item * C = match 'a' 1 or 0 times + +=item * C = match 'a' 0 or more times, i.e., any number of times + +=item * C = match 'a' 1 or more times, i.e., at least once + +=item * C = match at least C times, but not more than C +times. + +=item * C = match at least C or more times + +=item * C = match exactly C times + +=back + +Here are some examples: + + /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and + # any number of digits + /(\w+)\s+\1/; # match doubled words of arbitrary length + /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' + $year =~ /\d{2,4}/; # make sure year is at least 2 but not more + # than 4 digits + $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates + $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, + # this produces $1 and the other does not. + + % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? + beriberi + booboo + coco + mama + murmur + papa + +For all of these quantifiers, perl will try to match as much of the +string as possible, while still allowing the regexp to succeed. Thus +with C, perl will first try to match the regexp with the C +present; if that fails, perl will try to match the regexp without the +C present. For the quantifier C<*>, we get the following: + + $x = "the cat in the hat"; + $x =~ /^(.*)(cat)(.*)$/; # matches, + # $1 = 'the ' + # $2 = 'cat' + # $3 = ' in the hat' + +Which is what we might expect, the match finds the only C in the +string and locks onto it. Consider, however, this regexp: + + $x =~ /^(.*)(at)(.*)$/; # matches, + # $1 = 'the cat in the h' + # $2 = 'at' + # $3 = '' (0 matches) + +One might initially guess that perl would find the C in C and +stop there, but that wouldn't give the longest possible string to the +first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as +much of the string as possible while still having the regexp match. In +this example, that means having the C sequence with the final +in the string. The other important principle illustrated here is that +when there are two or more elements in a regexp, the I +quantifier, if there is one, gets to grab as much the string as +possible, leaving the rest of the regexp to fight over scraps. Thus in +our example, the first quantifier C<.*> grabs most of the string, while +the second quantifier C<.*> gets the empty string. Quantifiers that +grab as much of the string as possible are called B or +B quantifiers. + +When a regexp can match a string in several different ways, we can use +the principles above to predict which way the regexp will match: + +=over 4 + +=item * +Principle 0: Taken as a whole, any regexp will be matched at the +earliest possible position in the string. + +=item * +Principle 1: In an alternation C, the leftmost alternative +that allows a match for the whole regexp will be the one used. + +=item * +Principle 2: The maximal matching quantifiers C, C<*>, C<+> and +C<{n,m}> will in general match as much of the string as possible while +still allowing the whole regexp to match. + +=item * +Principle 3: If there are two or more elements in a regexp, the +leftmost greedy quantifier, if any, will match as much of the string +as possible while still allowing the whole regexp to match. The next +leftmost greedy quantifier, if any, will try to match as much of the +string remaining available to it as possible, while still allowing the +whole regexp to match. And so on, until all the regexp elements are +satisfied. + +=back + +As we have seen above, Principle 0 overrides the others - the regexp +will be matched as early as possible, with the other principles +determining how the regexp matches at that earliest character +position. + +Here is an example of these principles in action: + + $x = "The programming republic of Perl"; + $x =~ /^(.+)(e|r)(.*)$/; # matches, + # $1 = 'The programming republic of Pe' + # $2 = 'r' + # $3 = 'l' + +This regexp matches at the earliest string position, C<'T'>. One +might think that C, being leftmost in the alternation, would be +matched, but C produces the longest string in the first quantifier. + + $x =~ /(m{1,2})(.*)$/; # matches, + # $1 = 'mm' + # $2 = 'ing republic of Perl' + +Here, The earliest possible match is at the first C<'m'> in +C. C is the first quantifier, so it gets to match +a maximal C. + + $x =~ /.*(m{1,2})(.*)$/; # matches, + # $1 = 'm' + # $2 = 'ing republic of Perl' + +Here, the regexp matches at the start of the string. The first +quantifier C<.*> grabs as much as possible, leaving just a single +C<'m'> for the second quantifier C. + + $x =~ /(.?)(m{1,2})(.*)$/; # matches, + # $1 = 'a' + # $2 = 'mm' + # $3 = 'ing republic of Perl' + +Here, C<.?> eats its maximal one character at the earliest possible +position in the string, C<'a'> in C, leaving C +the opportunity to match both C's. Finally, + + "aXXXb" =~ /(X*)/; # matches with $1 = '' + +because it can match zero copies of C<'X'> at the beginning of the +string. If you definitely want to match at least one C<'X'>, use +C, not C. + +Sometimes greed is not good. At times, we would like quantifiers to +match a I piece of string, rather than a maximal piece. For +this purpose, Larry Wall created the S > or +B quantifiers C,C<*?>, C<+?>, and C<{}?>. These are +the usual quantifiers with a C appended to them. They have the +following meanings: + +=over 4 + +=item * C = match 'a' 0 or 1 times. Try 0 first, then 1. + +=item * C = match 'a' 0 or more times, i.e., any number of times, +but as few times as possible + +=item * C = match 'a' 1 or more times, i.e., at least once, but +as few times as possible + +=item * C = match at least C times, not more than C +times, as few times as possible + +=item * C = match at least C times, but as few times as +possible + +=item * C = match exactly C times. Because we match exactly +C times, C is equivalent to C and is just there for +notational consistency. + +=back + +Let's look at the example above, but with minimal quantifiers: + + $x = "The programming republic of Perl"; + $x =~ /^(.+?)(e|r)(.*)$/; # matches, + # $1 = 'Th' + # $2 = 'e' + # $3 = ' programming republic of Perl' + +The minimal string that will allow both the start of the string C<^> +and the alternation to match is C, with the alternation C +matching C. The second quantifier C<.*> is free to gobble up the +rest of the string. + + $x =~ /(m{1,2}?)(.*?)$/; # matches, + # $1 = 'm' + # $2 = 'ming republic of Perl' + +The first string position that this regexp can match is at the first +C<'m'> in C. At this position, the minimal C +matches just one C<'m'>. Although the second quantifier C<.*?> would +prefer to match no characters, it is constrained by the end-of-string +anchor C<$> to match the rest of the string. + + $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, + # $1 = 'The progra' + # $2 = 'm' + # $3 = 'ming republic of Perl' + +In this regexp, you might expect the first minimal quantifier C<.*?> +to match the empty string, because it is not constrained by a C<^> +anchor to match the beginning of the word. Principle 0 applies here, +however. Because it is possible for the whole regexp to match at the +start of the string, it I match at the start of the string. Thus +the first quantifier has to match everything up to the first C. The +second minimal quantifier matches just one C and the third +quantifier matches the rest of the string. + + $x =~ /(.??)(m{1,2})(.*)$/; # matches, + # $1 = 'a' + # $2 = 'mm' + # $3 = 'ing republic of Perl' + +Just as in the previous regexp, the first quantifier C<.??> can match +earliest at position C<'a'>, so it does. The second quantifier is +greedy, so it matches C, and the third matches the rest of the +string. + +We can modify principle 3 above to take into account non-greedy +quantifiers: + +=over 4 + +=item * +Principle 3: If there are two or more elements in a regexp, the +leftmost greedy (non-greedy) quantifier, if any, will match as much +(little) of the string as possible while still allowing the whole +regexp to match. The next leftmost greedy (non-greedy) quantifier, if +any, will try to match as much (little) of the string remaining +available to it as possible, while still allowing the whole regexp to +match. And so on, until all the regexp elements are satisfied. + +=back + +Just like alternation, quantifiers are also susceptible to +backtracking. Here is a step-by-step analysis of the example + + $x = "the cat in the hat"; + $x =~ /^(.*)(at)(.*)$/; # matches, + # $1 = 'the cat in the h' + # $2 = 'at' + # $3 = '' (0 matches) + +=over 4 + +=item 0 Start with the first letter in the string 't'. + +=item 1 The first quantifier '.*' starts out by matching the whole +string 'the cat in the hat'. + +=item 2 'a' in the regexp element 'at' doesn't match the end of the +string. Backtrack one character. + +=item 3 'a' in the regexp element 'at' still doesn't match the last +letter of the string 't', so backtrack one more character. + +=item 4 Now we can match the 'a' and the 't'. + +=item 5 Move on to the third element '.*'. Since we are at the end of +the string and '.*' can match 0 times, assign it the empty string. + +=item 6 We are done! + +=back + +Most of the time, all this moving forward and backtracking happens +quickly and searching is fast. There are some pathological regexps, +however, whose execution time exponentially grows with the size of the +string. A typical structure that blows up in your face is of the form + + /(a|b+)*/; + +The problem is the nested indeterminate quantifiers. There are many +different ways of partitioning a string of length n between the C<+> +and C<*>: one repetition with C of length n, two repetitions with +the first C length k and the second with length n-k, m repetitions +whose bits add up to length n, etc. In fact there are an exponential +number of ways to partition a string as a function of length. A +regexp may get lucky and match early in the process, but if there is +no match, perl will try I possibility before giving up. So be +careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book +I by Jeffrey Friedl gives a wonderful +discussion of this and other efficiency issues. + +=head2 Building a regexp + +At this point, we have all the basic regexp concepts covered, so let's +give a more involved example of a regular expression. We will build a +regexp that matches numbers. + +The first task in building a regexp is to decide what we want to match +and what we want to exclude. In our case, we want to match both +integers and floating point numbers and we want to reject any string +that isn't a number. + +The next task is to break the problem down into smaller problems that +are easily converted into a regexp. + +The simplest case is integers. These consist of a sequence of digits, +with an optional sign in front. The digits we can represent with +C<\d+> and the sign can be matched with C<[+-]>. Thus the integer +regexp is + + /[+-]?\d+/; # matches integers + +A floating point number potentially has a sign, an integral part, a +decimal point, a fractional part, and an exponent. One or more of these +parts is optional, so we need to check out the different +possibilities. Floating point numbers which are in proper form include +123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out +front is completely optional and can be matched by C<[+-]?>. We can +see that if there is no exponent, floating point numbers must have a +decimal point, otherwise they are integers. We might be tempted to +model these with C<\d*\.\d*>, but this would also match just a single +decimal point, which is not a number. So the three cases of floating +point number sans exponent are + + /[+-]?\d+\./; # 1., 321., etc. + /[+-]?\.\d+/; # .1, .234, etc. + /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. + +These can be combined into a single regexp with a three-way alternation: + + /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent + +In this alternation, it is important to put C<'\d+\.\d+'> before +C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that +and ignore the fractional part of the number. + +Now consider floating point numbers with exponents. The key +observation here is that I integers and numbers with decimal +points are allowed in front of an exponent. Then exponents, like the +overall sign, are independent of whether we are matching numbers with +or without decimal points, and can be 'decoupled' from the +mantissa. The overall form of the regexp now becomes clear: + + /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; + +The exponent is an C or C, followed by an integer. So the +exponent regexp is + + /[eE][+-]?\d+/; # exponent + +Putting all the parts together, we get a regexp that matches numbers: + + /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! + +Long regexps like this may impress your friends, but can be hard to +decipher. In complex situations like this, the C modifier for a +match is invaluable. It allows one to put nearly arbitrary whitespace +and comments into a regexp without affecting their meaning. Using it, +we can rewrite our 'extended' regexp in the more pleasing form + + /^ + [+-]? # first, match an optional sign + ( # then match integers or f.p. mantissas: + \d+\.\d+ # mantissa of the form a.b + |\d+\. # mantissa of the form a. + |\.\d+ # mantissa of the form .b + |\d+ # integer of the form a + ) + ([eE][+-]?\d+)? # finally, optionally match an exponent + $/x; + +If whitespace is mostly irrelevant, how does one include space +characters in an extended regexp? The answer is to backslash it +S > or put it in a character class S >. The same thing +goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows +a space between the sign and the mantissa/integer, and we could add +this to our regexp as follows: + + /^ + [+-]?\ * # first, match an optional sign *and space* + ( # then match integers or f.p. mantissas: + \d+\.\d+ # mantissa of the form a.b + |\d+\. # mantissa of the form a. + |\.\d+ # mantissa of the form .b + |\d+ # integer of the form a + ) + ([eE][+-]?\d+)? # finally, optionally match an exponent + $/x; + +In this form, it is easier to see a way to simplify the +alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it +could be factored out: + + /^ + [+-]?\ * # first, match an optional sign + ( # then match integers or f.p. mantissas: + \d+ # start out with a ... + ( + \.\d* # mantissa of the form a.b or a. + )? # ? takes care of integers of the form a + |\.\d+ # mantissa of the form .b + ) + ([eE][+-]?\d+)? # finally, optionally match an exponent + $/x; + +or written in the compact form, + + /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; + +This is our final regexp. To recap, we built a regexp by + +=over 4 + +=item * specifying the task in detail, + +=item * breaking down the problem into smaller parts, + +=item * translating the small parts into regexps, + +=item * combining the regexps, + +=item * and optimizing the final combined regexp. + +=back + +These are also the typical steps involved in writing a computer +program. This makes perfect sense, because regular expressions are +essentially programs written a little computer language that specifies +patterns. + +=head2 Using regular expressions in Perl + +The last topic of Part 1 briefly covers how regexps are used in Perl +programs. Where do they fit into Perl syntax? + +We have already introduced the matching operator in its default +C and arbitrary delimiter C forms. We have used +the binding operator C<=~> and its negation C to test for string +matches. Associated with the matching operator, we have discussed the +single line C, multi-line C, case-insensitive C and +extended C modifiers. + +There are a few more things you might want to know about matching +operators. First, we pointed out earlier that variables in regexps are +substituted before the regexp is evaluated: + + $pattern = 'Seuss'; + while (<>) { + print if /$pattern/; + } + +This will print any lines containing the word C. It is not as +efficient as it could be, however, because perl has to re-evaluate +C<$pattern> each time through the loop. If C<$pattern> won't be +changing over the lifetime of the script, we can add the C +modifier, which directs perl to only perform variable substitutions +once: + + #!/usr/bin/perl + # Improved simple_grep + $regexp = shift; + while (<>) { + print if /$regexp/o; # a good deal faster + } + +If you change C<$pattern> after the first substitution happens, perl +will ignore it. If you don't want any substitutions at all, use the +special delimiter C: + + $pattern = 'Seuss'; + while (<>) { + print if m'$pattern'; # matches '$pattern', not 'Seuss' + } + +C acts like single quotes on a regexp; all other C delimiters +act like double quotes. If the regexp evaluates to the empty string, +the regexp in the I is used instead. So we have + + "dog" =~ /d/; # 'd' matches + "dogbert =~ //; # this matches the 'd' regexp used before + +The final two modifiers C and C concern multiple matches. +The modifier C stands for global matching and allows the the +matching operator to match within a string as many times as possible. +In scalar context, successive invocations against a string will have +`C jump from match to match, keeping track of position in the +string as it goes along. You can get or set the position with the +C function. + +The use of C is shown in the following example. Suppose we have +a string that consists of words separated by spaces. If we know how +many words there are in advance, we could extract the words using +groupings: + + $x = "cat dog house"; # 3 words + $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, + # $1 = 'cat' + # $2 = 'dog' + # $3 = 'house' + +But what if we had an indeterminate number of words? This is the sort +of task C was made for. To extract all words, form the simple +regexp C<(\w+)> and loop over all matches with C: + + while ($x =~ /(\w+)/g) { + print "Word is $1, ends at position ", pos $x, "\n"; + } + +prints + + Word is cat, ends at position 3 + Word is dog, ends at position 7 + Word is house, ends at position 13 + +A failed match or changing the target string resets the position. If +you don't want the position reset after failure to match, add the +C, as in C. The current position in the string is +associated with the string, not the regexp. This means that different +strings have different positions and their respective positions can be +set or read independently. + +In list context, C returns a list of matched groupings, or if +there are no groupings, a list of matches to the whole regexp. So if +we wanted just the words, we could use + + @words = ($x =~ /(\w+)/g); # matches, + # $word[0] = 'cat' + # $word[1] = 'dog' + # $word[2] = 'house' + +Closely associated with the C modifier is the C<\G> anchor. The +C<\G> anchor matches at the point where the previous C match left +off. C<\G> allows us to easily do context-sensitive matching: + + $metric = 1; # use metric units + ... + $x = ; # read in measurement + $x =~ /^([+-]?\d+)\s*/g; # get magnitude + $weight = $1; + if ($metric) { # error checking + print "Units error!" unless $x =~ /\Gkg\./g; + } + else { + print "Units error!" unless $x =~ /\Glbs\./g; + } + $x =~ /\G\s+(widget|sprocket)/g; # continue processing + +The combination of C and C<\G> allows us to process the string a +bit at a time and use arbitrary Perl logic to decide what to do next. + +C<\G> is also invaluable in processing fixed length records with +regexps. Suppose we have a snippet of coding region DNA, encoded as +base pair letters C and we want to find all the stop +codons C. In a coding region, codons are 3-letter sequences, so +we can think of the DNA snippet as a sequence of 3-letter records. The +naive regexp + + # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" + $dna = "ATCGTTGAATGCAAATGACATGAC"; + $dna =~ /TGA/; + +doesn't work; it may match an C, but there is no guarantee that +the match is aligned with codon boundaries, e.g., the substring +S > gives a match. A better solution is + + while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? + print "Got a TGA stop codon at position ", pos $dna, "\n"; + } + +which prints + + Got a TGA stop codon at position 18 + Got a TGA stop codon at position 23 + +Position 18 is good, but position 23 is bogus. What happened? + +The answer is that our regexp works well until we get past the last +real match. Then the regexp will fail to match a synchronized C +and start stepping ahead one character position at a time, not what we +want. The solution is to use C<\G> to anchor the match to the codon +alignment: + + while ($dna =~ /\G(\w\w\w)*?TGA/g) { + print "Got a TGA stop codon at position ", pos $dna, "\n"; + } + +This prints + + Got a TGA stop codon at position 18 + +which is the correct answer. This example illustrates that it is +important not only to match what is desired, but to reject what is not +desired. + +B + +Regular expressions also play a big role in B +operations in Perl. Search and replace is accomplished with the +C operator. The general form is +C, with everything we know about +regexps and modifiers applying in this case as well. The +C is a Perl double quoted string that replaces in the +string whatever is matched with the C. The operator C<=~> is +also used here to associate a string with C. If matching +against C<$_>, the S > can be dropped. If there is a match, +C returns the number of substitutions made, otherwise it returns +false. Here are a few examples: + + $x = "Time to feed the cat!"; + $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" + if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { + $more_insistent = 1; + } + $y = "'quoted words'"; + $y =~ s/^'(.*)'$/$1/; # strip single quotes, + # $y contains "quoted words" + +In the last example, the whole string was matched, but only the part +inside the single quotes was grouped. With the C operator, the +matched variables C<$1>, C<$2>, etc. are immediately available for use +in the replacement expression, so we use C<$1> to replace the quoted +string with just what was quoted. With the global modifier, C +will search and replace all occurrences of the regexp in the string: + + $x = "I batted 4 for 4"; + $x =~ s/4/four/; # doesn't do it all: + # $x contains "I batted four for 4" + $x = "I batted 4 for 4"; + $x =~ s/4/four/g; # does it all: + # $x contains "I batted four for four" + +If you prefer 'regex' over 'regexp' in this tutorial, you could use +the following program to replace it: + + % cat > simple_replace + #!/usr/bin/perl + $regexp = shift; + $replacement = shift; + while (<>) { + s/$regexp/$replacement/go; + print; + } + ^D + + % simple_replace regexp regex perlretut.pod + +In C we used the C modifier to replace all +occurrences of the regexp on each line and the C modifier to +compile the regexp only once. As with C, both the +C and the C use C<$_> implicitly. + +A modifier available specifically to search and replace is the +C evaluation modifier. C wraps an C around +the replacement string and the evaluated result is substituted for the +matched substring. C is useful if you need to do a bit of +computation in the process of replacing text. This example counts +character frequencies in a line: + + $x = "Bill the cat"; + $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself + print "frequency of '$_' is $chars{$_}\n" + foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); + +This prints + + frequency of ' ' is 2 + frequency of 't' is 2 + frequency of 'l' is 2 + frequency of 'B' is 1 + frequency of 'c' is 1 + frequency of 'e' is 1 + frequency of 'h' is 1 + frequency of 'i' is 1 + frequency of 'a' is 1 + +As with the match C operator, C can use other delimiters, +such as C and C, and even C. If single quotes are +used C, then the regexp and replacement are treated as single +quoted strings and there are no substitutions. C in list context +returns the same thing as in scalar context, i.e., the number of +matches. + +B + +The B > function can also optionally use a matching operator +C to split a string. C splits +C into a list of substrings and returns that list. The regexp +is used to match the character sequence that the C is split +with respect to. The C, if present, constrains splitting into +no more than C number of strings. For example, to split a +string into words, use + + $x = "Calvin and Hobbes"; + @words = split /\s+/, $x; # $word[0] = 'Calvin' + # $word[1] = 'and' + # $word[2] = 'Hobbes' + +If the empty regexp C is used, the regexp always matches and +the string is split into individual characters. If the regexp has +groupings, then list produced contains the matched substrings from the +groupings as well. For instance, + + $x = "/usr/bin/perl"; + @dirs = split m!/!, $x; # $dirs[0] = '' + # $dirs[1] = 'usr' + # $dirs[2] = 'bin' + # $dirs[3] = 'perl' + @parts = split m!(/)!, $x; # $parts[0] = '' + # $parts[1] = '/' + # $parts[2] = 'usr' + # $parts[3] = '/' + # $parts[4] = 'bin' + # $parts[5] = '/' + # $parts[6] = 'perl' + +Since the first character of $x matched the regexp, C prepended +an empty initial element to the list. + +If you have read this far, congratulations! You now have all the basic +tools needed to use regular expressions to solve a wide range of text +processing problems. If this is your first time through the tutorial, +why not stop here and play around with regexps a while... S +concerns the more esoteric aspects of regular expressions and those +concepts certainly aren't needed right at the start. + +=head1 Part 2: Power tools + +OK, you know the basics of regexps and you want to know more. If +matching regular expressions is analogous to a walk in the woods, then +the tools discussed in Part 1 are analogous to topo maps and a +compass, basic tools we use all the time. Most of the tools in part 2 +are are analogous to flare guns and satellite phones. They aren't used +too often on a hike, but when we are stuck, they can be invaluable. + +What follows are the more advanced, less used, or sometimes esoteric +capabilities of perl regexps. In Part 2, we will assume you are +comfortable with the basics and concentrate on the new features. + +=head2 More on characters, strings, and character classes + +There are a number of escape sequences and character classes that we +haven't covered yet. + +There are several escape sequences that convert characters or strings +between upper and lower case. C<\l> and C<\u> convert the next +character to lower or upper case, respectively: + + $x = "perl"; + $string =~ /\u$x/; # matches 'Perl' in $string + $x = "M(rs?|s)\\."; # note the double backslash + $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', + +C<\L> and C<\U> converts a whole substring, delimited by C<\L> or +C<\U> and C<\E>, to lower or upper case: + + $x = "This word is in lower case:\L SHOUT\E"; + $x =~ /shout/; # matches + $x = "I STILL KEYPUNCH CARDS FOR MY 360" + $x =~ /\Ukeypunch/; # matches punch card string + +If there is no C<\E>, case is converted until the end of the +string. The regexps C<\L\u$word> or C<\u\L$word> convert the first +character of C<$word> to uppercase and the rest of the characters to +lowercase. + +Control characters can be escaped with C<\c>, so that a control-Z +character would be matched with C<\cZ>. The escape sequence +C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For +instance, + + $x = "\QThat !^*&%~& cat!"; + $x =~ /\Q!^*&%~&\E/; # check for rough language + +It does not protect C<$> or C<@>, so that variables can still be +substituted. + +With the advent of 5.6.0, perl regexps can handle more than just the +standard ASCII character set. Perl now supports B, a standard +for encoding the character sets from many of the world's written +languages. Unicode does this by allowing characters to be more than +one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters +are still encoded as one byte, but characters greater than C +may be stored as two or more bytes. + +What does this mean for regexps? Well, regexp users don't need to know +much about perl's internal representation of strings. But they do need +to know 1) how to represent Unicode characters in a regexp and 2) when +a matching operation will treat the string to be searched as a +sequence of bytes (the old way) or as a sequence of Unicode characters +(the new way). The answer to 1) is that Unicode characters greater +than C may be represented using the C<\x{hex}> notation, +with C a hexadecimal integer: + + use utf8; # We will be doing Unicode processing + /\x{263a}/; # match a Unicode smiley face :) + +Unicode characters in the range of 128-255 use two hexadecimal digits +with braces: C<\x{ab}>. Note that this is different than C<\xab>, +which is just a hexadecimal byte with no Unicode +significance. + +Figuring out the hexadecimal sequence of a Unicode character you want +or deciphering someone else's hexadecimal Unicode regexp is about as +much fun as programming in machine code. So another way to specify +Unicode characters is to use the S > escape +sequence C<\N{name}>. C is a name for the Unicode character, as +specified in the Unicode standard. For instance, if we wanted to +represent or match the astrological sign for the planet Mercury, we +could use + + use utf8; # We will be doing Unicode processing + use charnames ":full"; # use named chars with Unicode full names + $x = "abc\N{MERCURY}def"; + $x =~ /\N{MERCURY}/; # matches + +One can also use short names or restrict names to a certain alphabet: + + use utf8; # We will be doing Unicode processing + + use charnames ':full'; + print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; + + use charnames ":short"; + print "\N{greek:Sigma} is an upper-case sigma.\n"; + + use charnames qw(greek); + print "\N{sigma} is Greek sigma\n"; + +A list of full names is found in the file Names.txt in the +lib/perl5/5.6.0/unicode directory. + +The answer to requirement 2), as of 5.6.0, is that if a regexp +contains Unicode characters, the string is searched as a sequence of +Unicode characters. Otherwise, the string is searched as a sequence of +bytes. If the string is being searched as a sequence of Unicode +characters, but matching a single byte is required, we can use the C<\C> +escape sequence. C<\C> is a character class akin to C<.> except that +it matches I byte 0-255. So + + use utf8; # We will be doing Unicode processing + use charnames ":full"; # use named chars with Unicode full names + $x = "a"; + $x =~ /\C/; # matches 'a', eats one byte + $x = ""; + $x =~ /\C/; # doesn't match, no bytes to match + $x = "\N{MERCURY}"; # two-byte Unicode character + $x =~ /\C/; # matches, but dangerous! + +The last regexp matches, but is dangerous because the string +I position is no longer synchronized to the string +position. This generates the warning 'Malformed UTF-8 +character'. C<\C> is best used for matching the binary data in strings +with binary data intermixed with Unicode characters. + +Let us now discuss the rest of the character classes. Just as with +Unicode characters, there are named Unicode character classes +represented by the C<\p{name}> escape sequence. Closely associated is +the C<\P{name}> character class, which is the negation of the +C<\p{name}> class. For example, to match lower and uppercase +characters, + + use utf8; # We will be doing Unicode processing + use charnames ":full"; # use named chars with Unicode full names + $x = "BOB"; + $x =~ /^\p{IsUpper}/; # matches, uppercase char class + $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase + $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class + $x =~ /^\P{IsLower}/; # matches, char class sans lowercase + +If a C is just one letter, the braces can be dropped. For +instance, C<\pM> is the character class of Unicode 'marks'. Here is +the association between some Perl named classes and the traditional +Unicode classes: + + Perl class name Unicode class name + + IsAlpha Lu, Ll, or Lo + IsAlnum Lu, Ll, Lo, or Nd + IsASCII $code le 127 + IsCntrl C + IsDigit Nd + IsGraph [^C] and $code ne "0020" + IsLower Ll + IsPrint [^C] + IsPunct P + IsSpace Z, or ($code lt "0020" and chr(hex $code) is a \s) + IsUpper Lu + IsWord Lu, Ll, Lo, Nd or $code eq "005F" + IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ + +For a full list of Perl class names, consult the mktables.PL program +in the lib/perl5/5.6.0/unicode directory. + +C<\X> is an abbreviation for a character class sequence that includes +the Unicode 'combining character sequences'. A 'combining character +sequence' is a base character followed by any number of combining +characters. An example of a combining character is an accent. Using +the Unicode full names, e.g., S > is a combining +character sequence with base character C and combining character +S >, which translates in Danish to A with the circle +atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, +i.e., a non-mark followed by one or more marks. + +As if all those classes weren't enough, Perl also defines POSIX style +character classes. These have the form C<[:name:]>, with C the +name of the POSIX class. The POSIX classes are alpha, alnum, ascii, +cntrl, digit, graph, lower, print, punct, space, upper, word, and +xdigit. If C is being used, then these classes are defined the +same as their corresponding perl Unicode classes: C<[:upper:]> is the +same as C<\p{IsUpper}>, etc. The POSIX character classes, however, +don't require using C. The C<[:digit:]>, C<[:word:]>, and +C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> +character classes. To negate a POSIX class, put a C<^> in front of the +name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under +C, C<\P{IsDigit}>. The Unicode and POSIX character classes can +be used just like C<\d>, both inside and outside of character classes: + + /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit + /^=item\s[:digit:]/; # match '=item', + # followed by a space and a digit + use utf8; + use charnames ":full"; + /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit + /^=item\s\p{IsDigit}/; # match '=item', + # followed by a space and a digit + +Whew! That is all the rest of the characters and character classes. + +=head2 Compiling and saving regular expressions + +In Part 1 we discussed the C modifier, which compiles a regexp +just once. This suggests that a compiled regexp is some data structure +that can be stored once and used again and again. The regexp quote +C does exactly that: C compiles the C as a +regexp and transforms the result into a form that can be assigned to a +variable: + + $reg = qr/foo+bar?/; # reg contains a compiled regexp + +Then C<$reg> can be used as a regexp: + + $x = "fooooba"; + $x =~ $reg; # matches, just like /foo+bar?/ + $x =~ /$reg/; # same thing, alternate form + +C<$reg> can also be interpolated into a larger regexp: + + $x =~ /(abc)?$reg/; # still matches + +As with the matching operator, the regexp quote can use different +delimiters, e.g., C, C and C. The single quote +delimiters C prevent any interpolation from taking place. + +Pre-compiled regexps are useful for creating dynamic matches that +don't need to be recompiled each time they are encountered. Using +pre-compiled regexps, C program can be expanded into a +program that matches multiple patterns: + + % cat > multi_grep + #!/usr/bin/perl + # multi_grep - match any of regexps + # usage: multi_grep regexp1 regexp2 ... file1 file2 ... + + $number = shift; + $regexp[$_] = shift foreach (0..$number-1); + @compiled = map qr/$_/, @regexp; + while ($line = <>) { + foreach $pattern (@compiled) { + if ($line =~ /$pattern/) { + print $line; + last; # we matched, so move onto the next line + } + } + } + ^D + + % multi_grep 2 last for multi_grep + $regexp[$_] = shift foreach (0..$number-1); + foreach $pattern (@compiled) { + last; + +Storing pre-compiled regexps in an array C<@compiled> allows us to +simply loop through the regexps without any recompilation, thus gaining +flexibility without sacrificing speed. + +=head2 Embedding comments and modifiers in a regular expression + +Starting with this section, we will be discussing Perl's set of +B. These are extensions to the traditional regular +expression syntax that provide powerful new tools for pattern +matching. We have already seen extensions in the form of the minimal +matching constructs C, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The +rest of the extensions below have the form C<(?char...)>, where the +C is a character that determines the type of extension. + +The first extension is an embedded comment C<(?#text)>. This embeds a +comment into the regular expression without affecting its meaning. The +comment should not have any closing parentheses in the text. An +example is + + /(?# Match an integer:)[+-]?\d+/; + +This style of commenting has been largely superseded by the raw, +freeform commenting that is allowed with the C modifier. + +The modifiers C, C, C, and C can also embedded in +a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, + + /(?i)yes/; # match 'yes' case insensitively + /yes/i; # same thing + /(?x)( # freeform version of an integer regexp + [+-]? # match an optional sign + \d+ # match a sequence of digits + ) + /x; + +Embedded modifiers can have two important advantages over the usual +modifiers. Embedded modifiers allow a custom set of modifiers to +I regexp pattern. This is great for matching an array of regexps +that must have different modifiers: + + $pattern[0] = '(?i)doctor'; + $pattern[1] = 'Johnson'; + ... + while (<>) { + foreach $patt (@pattern) { + print if /$patt/; + } + } + +The second advantage is that embedded modifiers only affect the regexp +inside the group the embedded modifier is contained in. So grouping +can be used to localize the modifier's effects: + + /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. + +Embedded modifiers can also turn off any modifiers already present +by using, e.g., C<(?-i)>. Modifiers can also be combined into +a single expression, e.g., C<(?s-i)> turns on single line mode and +turns off case insensitivity. + +=head2 Non-capturing groupings + +We noted in Part 1 that groupings C<()> had two distinct functions: 1) +group regexp elements together as a single unit, and 2) extract, or +capture, substrings that matched the regexp in the +grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the +regexp to be treated as a single unit, but don't extract substrings or +set matching variables C<$1>, etc. Both capturing and non-capturing +groupings are allowed to co-exist in the same regexp. Because there is +no extraction, non-capturing groupings are faster than capturing +groupings. Non-capturing groupings are also handy for choosing exactly +which parts of a regexp are to be extracted to matching variables: + + # match a number, $1-$4 are set, but we only want $1 + /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; + + # match a number faster , only $1 is set + /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; + + # match a number, get $1 = whole number, $2 = exponent + /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; + +Non-capturing groupings are also useful for removing nuisance +elements gathered from a split operation: + + $x = '12a34b5'; + @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') + @num = split /(?:a|b)/, $x; # @num = ('12','34','5') + +Non-capturing groupings may also have embedded modifiers: +C<(?i-m:regexp)> is a non-capturing grouping that matches C +case insensitively and turns off multi-line mode. + +=head2 Looking ahead and looking behind + +This section concerns the lookahead and lookbehind assertions. First, +a little background. + +In Perl regular expressions, most regexp elements 'eat up' a certain +amount of string when they match. For instance, the regexp element +C<[abc}]> eats up one character of the string when it matches, in the +sense that perl moves to the next character position in the string +after the match. There are some elements, however, that don't eat up +characters (advance the character position) if they match. The examples +we have seen so far are the anchors. The anchor C<^> matches the +beginning of the line, but doesn't eat any characters. Similarly, the +word boundary anchor C<\b> matches, e.g., if the character to the left +is a word character and the character to the right is a non-word +character, but it doesn't eat up any characters itself. Anchors are +examples of 'zero-width assertions'. Zero-width, because they consume +no characters, and assertions, because they test some property of the +string. In the context of our walk in the woods analogy to regexp +matching, most regexp elements move us along a trail, but anchors have +us stop a moment and check our surroundings. If the local environment +checks out, we can proceed forward. But if the local environment +doesn't satisfy us, we must backtrack. + +Checking the environment entails either looking ahead on the trail, +looking behind, or both. C<^> looks behind, to see that there are no +characters before. C<$> looks ahead, to see that there are no +characters after. C<\b> looks both ahead and behind, to see if the +characters on either side differ in their 'word'-ness. + +The lookahead and lookbehind assertions are generalizations of the +anchor concept. Lookahead and lookbehind are zero-width assertions +that let us specify which characters we want to test for. The +lookahead assertion is denoted by C<(?=regexp)> and the lookbehind +assertion is denoted by C<(?<=fixed-regexp)>. Some examples are + + $x = "I catch the housecat 'Tom-cat' with catnip"; + $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' + @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, + # $catwords[0] = 'catch' + # $catwords[1] = 'catnip' + $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' + $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in + # middle of $x + +Note that the parentheses in C<(?=regexp)> and C<(?<=regexp)> are +non-capturing, since these are zero-width assertions. Thus in the +second regexp, the substrings captured are those of the whole regexp +itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but +lookbehind C<(?<=fixed-regexp)> only works for regexps of fixed +width, i.e., a fixed number of characters long. Thus C<(?<=(ab|bc))> +is fine, but C<(?<=(ab)*)> is not. The negated versions of the +lookahead and lookbehind assertions are denoted by C<(?!regexp)> +and C<(? respectively. They evaluate true if the +regexps do I match: + + $x = "foobar"; + $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' + $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' + $x =~ /(? > are regular expressions, in the +context of a larger regular expression, that function independently of +the larger regular expression. That is, they consume as much or as +little of the string as they wish without regard for the ability of +the larger regexp to match. Independent subexpressions are represented +by C<< (?>regexp) >>. We can illustrate their behavior by first +considering an ordinary regexp: + + $x = "ab"; + $x =~ /a*ab/; # matches + +This obviously matches, but in the process of matching, the +subexpression C first grabbed the C. Doing so, however, +wouldn't allow the whole regexp to match, so after backtracking, C +eventually gave back the C and matched the empty string. Here, what +C matched was I on what the rest of the regexp matched. + +Contrast that with an independent subexpression: + + $x =~ /(?>a*)ab/; # doesn't match! + +The independent subexpression C<< (?>a*) >> doesn't care about the rest +of the regexp, so it sees an C and grabs it. Then the rest of the +regexp C cannot match. Because C<< (?>a*) >> is independent, there +is no backtracking and and the independent subexpression does not give +up its C. Thus the match of the regexp as a whole fails. A similar +behavior occurs with completely independent regexps: + + $x = "ab"; + $x =~ /a*/g; # matches, eats an 'a' + $x =~ /\Gab/g; # doesn't match, no 'a' available + +Here C and C<\G> create a 'tag team' handoff of the string from +one regexp to the other. Regexps with an independent subexpression are +much like this, with a handoff of the string to the independent +subexpression, and a handoff of the string back to the enclosing +regexp. + +The ability of an independent subexpression to prevent backtracking +can be quite useful. Suppose we want to match a non-empty string +enclosed in parentheses up to two levels deep. Then the following +regexp matches: + + $x = "abc(de(fg)h"; # unbalanced parentheses + $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; + +The regexp matches an open parenthesis, one or more copies of an +alternation, and a close parenthesis. The alternation is two-way, with +the first alternative C<[^()]+> matching a substring with no +parentheses and the second alternative C<\([^()]*\)> matching a +substring delimited by parentheses. The problem with this regexp is +that it is pathological: it has nested indeterminate quantifiers + of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers +like this could take an exponentially long time to execute if there +was no match possible. To prevent the exponential blowup, we need to +prevent useless backtracking at some point. This can be done by +enclosing the inner quantifier as an independent subexpression: + + $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; + +Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning +by gobbling up as much of the string as possible and keeping it. Then +match failures fail much more quickly. + +=head2 Conditional expressions + +A S > is a form of if-then-else statement +that allows one to choose which patterns are to be matched, based on +some condition. There are two types of conditional expression: +C<(?(condition)yes-regexp)> and +C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is +like an S > statement in Perl. If the C is true, +the C will be matched. If the C is false, the +C will be skipped and perl will move onto the next regexp +element. The second form is like an S > statement +in Perl. If the C is true, the C will be +matched, otherwise the C will be matched. + +The C can have two forms. The first form is simply an +integer in parentheses C<(integer)>. It is true if the corresponding +backreference C<\integer> matched earlier in the regexp. The second +form is a bare zero width assertion C<(?...)>, either a +lookahead, a lookbehind, or a code assertion (discussed in the next +section). + +The integer form of the C allows us to choose, with more +flexibility, what to match based on what matched earlier in the +regexp. This searches for words of the form C<"$x$x"> or +C<"$x$y$y$x">: + + % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words + beriberi + coco + couscous + deed + ... + toot + toto + tutu + +The lookbehind C allows, along with backreferences, +an earlier part of the match to influence a later part of the +match. For instance, + + /[ATGC]+(?(?<=AA)G|C)$/; + +matches a DNA sequence such that it either ends in C, or some +other base pair combination and C. Note that the form is +C<(?(?<=AA)G|C)> and not C<(?((?<=AA))G|C)>; for the lookahead, +lookbehind or code assertions, the parentheses around the conditional +are not needed. + +=head2 A bit of magic: executing Perl code in a regular expression + +Normally, regexps are a part of Perl expressions. +S > expressions turn that around by allowing +arbitrary Perl code to be a part of of a regexp. A code evaluation +expression is denoted C<(?{code})>, with C a string of Perl +statements. + +Code expressions are zero-width assertions, and the value they return +depends on their environment. There are two possibilities: either the +code expression is used as a conditional in a conditional expression +C<(?(condition)...)>, or it is not. If the code expression is a +conditional, the code is evaluated and the result (i.e., the result of +the last statement) is used to determine truth or falsehood. If the +code expression is not used as a conditional, the assertion always +evaluates true and the result is put into the special variable +C<$^R>. The variable C<$^R> can then be used in code expressions later +in the regexp. Here are some silly examples: + + $x = "abcdef"; + $x =~ /abc(?{print "Hi Mom!";})def/; # matches, + # prints 'Hi Mom!' + $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, + # no 'Hi Mom!' + $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, + # no 'Hi Mom!' + $x =~ /(?{print "Hi Mom!";})/; # matches, + # prints 'Hi Mom!' + $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, + # prints '1' + $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, + # prints '1' + +The bit of magic mentioned in the section title occurs when the regexp +backtracks in the process of searching for a match. If the regexp +backtracks over a code expression and if the variables used within are +localized using C, the changes in the variables produced by the +code expression are undone! Thus, if we wanted to count how many times +a character got matched inside a group, we could use, e.g., + + $x = "aaaa"; + $count = 0; # initialize 'a' count + $c = "bob"; # test if $c gets clobbered + $x =~ /(?{local $c = 0;}) # initialize count + ( a # match 'a' + (?{local $c = $c + 1;}) # increment count + )* # do this any number of times, + aa # but match 'aa' at the end + (?{$count = $c;}) # copy local $c var into $count + /x; + print "'a' count is $count, \$c variable is '$c'\n"; + +This prints + + 'a' count is 2, $c variable is 'bob' + +If we replace the S > with +S >, the variable changes are I undone +during backtracking, and we get + + 'a' count is 4, $c variable is 'bob' + +Note that only localized variable changes are undone. Other side +effects of code expression execution are permanent. Thus + + $x = "aaaa"; + $x =~ /(a(?{print "Yow\n";}))*aa/; + +produces + + Yow + Yow + Yow + Yow + +The result C<$^R> is automatically localized, so that it will behave +properly in the presence of backtracking. + +This example uses a code expression in a conditional to match the +article 'the' in either English or German: + + use re 'eval'; + $lang = 'DE'; # use German + ... + $text = "das"; + print "matched\n" + if $text =~ /(?(?{ + $lang eq 'EN'; # is the language English? + }) + the | # if so, then match 'the' + (die|das|der) # else, match 'die|das|der' + ) + /xi; + +Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not +C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a +code expression, we don't need the extra parentheses around the +conditional. + +The S > statement is needed because we are both +interpolating the variable C<$lang> I evaluating code +within the regexp. From a security point of view, this can be +dangerous. It is dangerous because many programmers who write search +engines often take user input and plug it directly into a regexp: + + $regexp = <>; # read user-supplied regexp + $chomp $regexp; # get rid of possible newline + $text =~ /$regexp/; # search $text for the $regexp + +If the C<$regexp> variable is used in a code expression, the user +could then execute arbitrary Perl code. For instance, some joker could +search for S > to erase your files. In this +sense, the combination of interpolation and code expressions B +your regexp. So by default, using both interpolation and code +expressions in the same regexp is not allowed. Only by invoking +S > can one use both interpolation and code +expressions in the same regexp. + +Another form of code expression is the S >. +The pattern code expression is like a regular code expression, except +that the result of the code evaluation is treated as a regular +expression and matched immediately. A simple example is + + $length = 5; + $char = 'a'; + $x = 'aaaaabb'; + $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' + + +This final example contains both ordinary and pattern code +expressions. It detects if a binary string C<1101010010001...> has a +Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: + + use re 'eval'; + $s0 = 0; $s1 = 1; # initial conditions + $x = "1101010010001000001"; + print "It is a Fibonacci sequence\n" + if $x =~ /^1 # match an initial '1' + ( + (??{'0' x $s0}) # match $s0 of '0' + 1 # and then a '1' + (?{ + $largest = $s0; # largest seq so far + $s2 = $s1 + $s0; # compute next term + $s0 = $s1; # in Fibonacci sequence + $s1 = $s2; + }) + )+ # repeat as needed + $ # that is all there is + /x; + print "Largest sequence matched was $largest\n"; + +This prints + + It is a Fibonacci sequence + Largest sequence matched was 5 + +Ha! Try that with your garden variety regexp package... + +Note that the variables C<$s0> and C<$s1> are not substituted when the +regexp is compiled, as happens for ordinary variables outside a code +expression. Rather, the code expressions are evaluated when perl +encounters them during the search for a match. + +The regexp without the C modifier is + + /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; + +and is a great start on an Obfuscated Perl entry :-) When working with +code and conditional expressions, the extended form of regexps is +almost necessary in creating and debugging regexps. + +=head2 Pragmas and debugging + +Speaking of debugging, there are several pragmas available to control +and debug regexps in Perl. We have already encountered one pragma in +the previous section, S >, that allows variable +interpolation in a regexp with code expressions. The other pragmas are + + use re 'taint'; + $tainted = <>; + @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted + +The C pragma causes any substrings from a match with a tainted +variable to be tainted as well. This is not normally the case, as +regexps are often used to extract the safe bits from a tainted +variable. Use C when you are not extracting safe bits, but are +performing some other processing. Both C and C pragmas +are lexically scoped, which mean they have are in effect only until +the end of the block enclosing the pragmas. + + use re 'debug'; + /^(.*)$/s; # output debugging info + + use re 'debugcolor'; + /^(.*)$/s; # output debugging info in living color + +The global C and C pragmas allow one to get +detailed debugging info about regexp compilation and +execution. C is the same as debug, except the debugging +information is displayed in color on terminals that can display +termcap color sequences. Here is example output: + + % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' + Compiling REx `a*b+c' + size 9 first at 1 + 1: STAR(4) + 2: EXACT (0) + 4: PLUS(7) + 5: EXACT (0) + 7: EXACT (9) + 9: END(0) + floating `bc' at 0..2147483647 (checking floating) minlen 2 + Guessing start of match, REx `a*b+c' against `abc'... + Found floating substr `bc' at offset 1... + Guessed: match at offset 0 + Matching REx `a*b+c' against `abc' + Setting an EVAL scope, savestack=3 + 0 <> | 1: STAR + EXACT can match 1 times out of 32767... + Setting an EVAL scope, savestack=3 + 1 | 4: PLUS + EXACT can match 1 times out of 32767... + Setting an EVAL scope, savestack=3 + 2 | 7: EXACT + 3 <> | 9: END + Match successful! + Freeing REx: `a*b+c' + +If you have gotten this far into the tutorial, you can probably guess +what the different parts of the debugging output tell you. The first +part + + Compiling REx `a*b+c' + size 9 first at 1 + 1: STAR(4) + 2: EXACT (0) + 4: PLUS(7) + 5: EXACT (0) + 7: EXACT (9) + 9: END(0) + +describes the compilation stage. C means that there is a +starred object, in this case C<'a'>, and if it matches, goto line 4, +i.e., C. The middle lines describe some heuristics and +optimizations performed before a match: + + floating `bc' at 0..2147483647 (checking floating) minlen 2 + Guessing start of match, REx `a*b+c' against `abc'... + Found floating substr `bc' at offset 1... + Guessed: match at offset 0 + +Then the match is executed and the remaining lines describe the +process: + + Matching REx `a*b+c' against `abc' + Setting an EVAL scope, savestack=3 + 0 <> | 1: STAR + EXACT can match 1 times out of 32767... + Setting an EVAL scope, savestack=3 + 1 | 4: PLUS + EXACT can match 1 times out of 32767... + Setting an EVAL scope, savestack=3 + 2 | 7: EXACT + 3 <> | 9: END + Match successful! + Freeing REx: `a*b+c' + +Each step is of the form S >> >, with C<< >> the +part of the string matched and C<< >> the part not yet +matched. The S> > says that perl is at line number 1 +n the compilation list above. See +L for much more detail. + +An alternative method of debugging regexps is to embed C +statements within the regexp. This provides a blow-by-blow account of +the backtracking in an alternation: + + "that this" =~ m@(?{print "Start at position ", pos, "\n";}) + t(?{print "t1\n";}) + h(?{print "h1\n";}) + i(?{print "i1\n";}) + s(?{print "s1\n";}) + | + t(?{print "t2\n";}) + h(?{print "h2\n";}) + a(?{print "a2\n";}) + t(?{print "t2\n";}) + (?{print "Done at position ", pos, "\n";}) + @x; + +prints + + Start at position 0 + t1 + h1 + t2 + h2 + a2 + t2 + Done at position 4 + +=head1 BUGS + +Code expressions, conditional expressions, and independent expressions +are B. Don't use them in production code. Yet. + +=head1 SEE ALSO + +This is just a tutorial. For the full story on perl regular +expressions, see the L regular expressions reference page. + +For more information on the matching C and substitution C +operators, see L. For +information on the C operation, see L. + +For an excellent all-around resource on the care and feeding of +regular expressions, see the book I by +Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). + +=head1 AUTHOR AND COPYRIGHT + +Copyright (c) 2000 Mark Kvale +All rights reserved. + +This document may be distributed under the same terms as Perl itself. + +=head2 Acknowledgments + +The inspiration for the stop codon DNA example came from the ZIP +code example in chapter 7 of I. + +The author would like to thank +Jeff Pinyan, +Peter Haworth, +Ronald J Kimball, +and Joe Smith for all their helpful comments. + +=cut