Commit | Line | Data |
47f9c88b |
1 | =head1 NAME |
2 | |
3 | perlretut - Perl regular expressions tutorial |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | This page provides a basic tutorial on understanding, creating and |
8 | using regular expressions in Perl. It serves as a complement to the |
9 | reference page on regular expressions L<perlre>. Regular expressions |
10 | are an integral part of the C<m//>, C<s///>, C<qr//> and C<split> |
11 | operators and so this tutorial also overlaps with |
12 | L<perlop/"Regexp Quote-Like Operators"> and L<perlfunc/split>. |
13 | |
14 | Perl is widely renowned for excellence in text processing, and regular |
15 | expressions are one of the big factors behind this fame. Perl regular |
16 | expressions display an efficiency and flexibility unknown in most |
17 | other computer languages. Mastering even the basics of regular |
18 | expressions will allow you to manipulate text with surprising ease. |
19 | |
20 | What is a regular expression? A regular expression is simply a string |
21 | that describes a pattern. Patterns are in common use these days; |
22 | examples are the patterns typed into a search engine to find web pages |
23 | and the patterns used to list files in a directory, e.g., C<ls *.txt> |
24 | or C<dir *.*>. In Perl, the patterns described by regular expressions |
25 | are used to search strings, extract desired parts of strings, and to |
26 | do search and replace operations. |
27 | |
28 | Regular expressions have the undeserved reputation of being abstract |
29 | and difficult to understand. Regular expressions are constructed using |
30 | simple concepts like conditionals and loops and are no more difficult |
31 | to understand than the corresponding C<if> conditionals and C<while> |
32 | loops in the Perl language itself. In fact, the main challenge in |
33 | learning regular expressions is just getting used to the terse |
34 | notation used to express these concepts. |
35 | |
36 | This tutorial flattens the learning curve by discussing regular |
37 | expression concepts, along with their notation, one at a time and with |
38 | many examples. The first part of the tutorial will progress from the |
39 | simplest word searches to the basic regular expression concepts. If |
40 | you master the first part, you will have all the tools needed to solve |
41 | about 98% of your needs. The second part of the tutorial is for those |
42 | comfortable with the basics and hungry for more power tools. It |
43 | discusses the more advanced regular expression operators and |
44 | introduces the latest cutting edge innovations in 5.6.0. |
45 | |
46 | A note: to save time, 'regular expression' is often abbreviated as |
47 | regexp or regex. Regexp is a more natural abbreviation than regex, but |
48 | is harder to pronounce. The Perl pod documentation is evenly split on |
49 | regexp vs regex; in Perl, there is more than one way to abbreviate it. |
50 | We'll use regexp in this tutorial. |
51 | |
52 | =head1 Part 1: The basics |
53 | |
54 | =head2 Simple word matching |
55 | |
56 | The simplest regexp is simply a word, or more generally, a string of |
57 | characters. A regexp consisting of a word matches any string that |
58 | contains that word: |
59 | |
60 | "Hello World" =~ /World/; # matches |
61 | |
62 | What is this perl statement all about? C<"Hello World"> is a simple |
63 | double quoted string. C<World> is the regular expression and the |
64 | C<//> enclosing C</World/> tells perl to search a string for a match. |
65 | The operator C<=~> associates the string with the regexp match and |
66 | produces a true value if the regexp matched, or false if the regexp |
67 | did not match. In our case, C<World> matches the second word in |
68 | C<"Hello World">, so the expression is true. Expressions like this |
69 | are useful in conditionals: |
70 | |
71 | if ("Hello World" =~ /World/) { |
72 | print "It matches\n"; |
73 | } |
74 | else { |
75 | print "It doesn't match\n"; |
76 | } |
77 | |
78 | There are useful variations on this theme. The sense of the match can |
79 | be reversed by using C<!~> operator: |
80 | |
81 | if ("Hello World" !~ /World/) { |
82 | print "It doesn't match\n"; |
83 | } |
84 | else { |
85 | print "It matches\n"; |
86 | } |
87 | |
88 | The literal string in the regexp can be replaced by a variable: |
89 | |
90 | $greeting = "World"; |
91 | if ("Hello World" =~ /$greeting/) { |
92 | print "It matches\n"; |
93 | } |
94 | else { |
95 | print "It doesn't match\n"; |
96 | } |
97 | |
98 | If you're matching against the special default variable C<$_>, the |
99 | C<$_ =~> part can be omitted: |
100 | |
101 | $_ = "Hello World"; |
102 | if (/World/) { |
103 | print "It matches\n"; |
104 | } |
105 | else { |
106 | print "It doesn't match\n"; |
107 | } |
108 | |
109 | And finally, the C<//> default delimiters for a match can be changed |
110 | to arbitrary delimiters by putting an C<'m'> out front: |
111 | |
112 | "Hello World" =~ m!World!; # matches, delimited by '!' |
113 | "Hello World" =~ m{World}; # matches, note the matching '{}' |
a6b2f353 |
114 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', |
115 | # '/' becomes an ordinary char |
47f9c88b |
116 | |
117 | C</World/>, C<m!World!>, and C<m{World}> all represent the |
118 | same thing. When, e.g., C<""> is used as a delimiter, the forward |
119 | slash C<'/'> becomes an ordinary character and can be used in a regexp |
120 | without trouble. |
121 | |
122 | Let's consider how different regexps would match C<"Hello World">: |
123 | |
124 | "Hello World" =~ /world/; # doesn't match |
125 | "Hello World" =~ /o W/; # matches |
126 | "Hello World" =~ /oW/; # doesn't match |
127 | "Hello World" =~ /World /; # doesn't match |
128 | |
129 | The first regexp C<world> doesn't match because regexps are |
130 | case-sensitive. The second regexp matches because the substring |
131 | S<C<'o W'> > occurs in the string S<C<"Hello World"> >. The space |
132 | character ' ' is treated like any other character in a regexp and is |
133 | needed to match in this case. The lack of a space character is the |
134 | reason the third regexp C<'oW'> doesn't match. The fourth regexp |
135 | C<'World '> doesn't match because there is a space at the end of the |
136 | regexp, but not at the end of the string. The lesson here is that |
137 | regexps must match a part of the string I<exactly> in order for the |
138 | statement to be true. |
139 | |
140 | If a regexp matches in more than one place in the string, perl will |
141 | always match at the earliest possible point in the string: |
142 | |
143 | "Hello World" =~ /o/; # matches 'o' in 'Hello' |
144 | "That hat is red" =~ /hat/; # matches 'hat' in 'That' |
145 | |
146 | With respect to character matching, there are a few more points you |
147 | need to know about. First of all, not all characters can be used 'as |
148 | is' in a match. Some characters, called B<metacharacters>, are reserved |
149 | for use in regexp notation. The metacharacters are |
150 | |
151 | {}[]()^$.|*+?\ |
152 | |
153 | The significance of each of these will be explained |
154 | in the rest of the tutorial, but for now, it is important only to know |
155 | that a metacharacter can be matched by putting a backslash before it: |
156 | |
157 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter |
158 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + |
159 | "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! |
160 | "The interval is [0,1)." =~ /\[0,1\)\./ # matches |
161 | "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches |
162 | |
163 | In the last regexp, the forward slash C<'/'> is also backslashed, |
164 | because it is used to delimit the regexp. This can lead to LTS |
165 | (leaning toothpick syndrome), however, and it is often more readable |
166 | to change delimiters. |
167 | |
168 | |
169 | The backslash character C<'\'> is a metacharacter itself and needs to |
170 | be backslashed: |
171 | |
172 | 'C:\WIN32' =~ /C:\\WIN/; # matches |
173 | |
174 | In addition to the metacharacters, there are some ASCII characters |
175 | which don't have printable character equivalents and are instead |
176 | represented by B<escape sequences>. Common examples are C<\t> for a |
177 | tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a |
178 | bell. If your string is better thought of as a sequence of arbitrary |
179 | bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape |
180 | sequence, e.g., C<\x1B> may be a more natural representation for your |
181 | bytes. Here are some examples of escapes: |
182 | |
183 | "1000\t2000" =~ m(0\t2) # matches |
184 | "1000\n2000" =~ /0\n20/ # matches |
185 | "1000\t2000" =~ /\000\t2/ # doesn't match, "0" ne "\000" |
186 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat |
187 | |
188 | If you've been around Perl a while, all this talk of escape sequences |
189 | may seem familiar. Similar escape sequences are used in double-quoted |
190 | strings and in fact the regexps in Perl are mostly treated as |
191 | double-quoted strings. This means that variables can be used in |
192 | regexps as well. Just like double-quoted strings, the values of the |
193 | variables in the regexp will be substituted in before the regexp is |
194 | evaluated for matching purposes. So we have: |
195 | |
196 | $foo = 'house'; |
197 | 'housecat' =~ /$foo/; # matches |
198 | 'cathouse' =~ /cat$foo/; # matches |
47f9c88b |
199 | 'housecat' =~ /${foo}cat/; # matches |
200 | |
201 | So far, so good. With the knowledge above you can already perform |
202 | searches with just about any literal string regexp you can dream up. |
203 | Here is a I<very simple> emulation of the Unix grep program: |
204 | |
205 | % cat > simple_grep |
206 | #!/usr/bin/perl |
207 | $regexp = shift; |
208 | while (<>) { |
209 | print if /$regexp/; |
210 | } |
211 | ^D |
212 | |
213 | % chmod +x simple_grep |
214 | |
215 | % simple_grep abba /usr/dict/words |
216 | Babbage |
217 | cabbage |
218 | cabbages |
219 | sabbath |
220 | Sabbathize |
221 | Sabbathizes |
222 | sabbatical |
223 | scabbard |
224 | scabbards |
225 | |
226 | This program is easy to understand. C<#!/usr/bin/perl> is the standard |
227 | way to invoke a perl program from the shell. |
228 | S<C<$regexp = shift;> > saves the first command line argument as the |
229 | regexp to be used, leaving the rest of the command line arguments to |
230 | be treated as files. S<C<< while (<>) >> > loops over all the lines in |
231 | all the files. For each line, S<C<print if /$regexp/;> > prints the |
232 | line if the regexp matches the line. In this line, both C<print> and |
233 | C</$regexp/> use the default variable C<$_> implicitly. |
234 | |
235 | With all of the regexps above, if the regexp matched anywhere in the |
236 | string, it was considered a match. Sometimes, however, we'd like to |
237 | specify I<where> in the string the regexp should try to match. To do |
238 | this, we would use the B<anchor> metacharacters C<^> and C<$>. The |
239 | anchor C<^> means match at the beginning of the string and the anchor |
240 | C<$> means match at the end of the string, or before a newline at the |
241 | end of the string. Here is how they are used: |
242 | |
243 | "housekeeper" =~ /keeper/; # matches |
244 | "housekeeper" =~ /^keeper/; # doesn't match |
245 | "housekeeper" =~ /keeper$/; # matches |
246 | "housekeeper\n" =~ /keeper$/; # matches |
247 | |
248 | The second regexp doesn't match because C<^> constrains C<keeper> to |
249 | match only at the beginning of the string, but C<"housekeeper"> has |
250 | keeper starting in the middle. The third regexp does match, since the |
251 | C<$> constrains C<keeper> to match only at the end of the string. |
252 | |
253 | When both C<^> and C<$> are used at the same time, the regexp has to |
254 | match both the beginning and the end of the string, i.e., the regexp |
255 | matches the whole string. Consider |
256 | |
257 | "keeper" =~ /^keep$/; # doesn't match |
258 | "keeper" =~ /^keeper$/; # matches |
259 | "" =~ /^$/; # ^$ matches an empty string |
260 | |
261 | The first regexp doesn't match because the string has more to it than |
262 | C<keep>. Since the second regexp is exactly the string, it |
263 | matches. Using both C<^> and C<$> in a regexp forces the complete |
264 | string to match, so it gives you complete control over which strings |
265 | match and which don't. Suppose you are looking for a fellow named |
266 | bert, off in a string by himself: |
267 | |
268 | "dogbert" =~ /bert/; # matches, but not what you want |
269 | |
270 | "dilbert" =~ /^bert/; # doesn't match, but .. |
271 | "bertram" =~ /^bert/; # matches, so still not good enough |
272 | |
273 | "bertram" =~ /^bert$/; # doesn't match, good |
274 | "dilbert" =~ /^bert$/; # doesn't match, good |
275 | "bert" =~ /^bert$/; # matches, perfect |
276 | |
277 | Of course, in the case of a literal string, one could just as easily |
278 | use the string equivalence S<C<$string eq 'bert'> > and it would be |
279 | more efficient. The C<^...$> regexp really becomes useful when we |
280 | add in the more powerful regexp tools below. |
281 | |
282 | =head2 Using character classes |
283 | |
284 | Although one can already do quite a lot with the literal string |
285 | regexps above, we've only scratched the surface of regular expression |
286 | technology. In this and subsequent sections we will introduce regexp |
287 | concepts (and associated metacharacter notations) that will allow a |
288 | regexp to not just represent a single character sequence, but a I<whole |
289 | class> of them. |
290 | |
291 | One such concept is that of a B<character class>. A character class |
292 | allows a set of possible characters, rather than just a single |
293 | character, to match at a particular point in a regexp. Character |
294 | classes are denoted by brackets C<[...]>, with the set of characters |
295 | to be possibly matched inside. Here are some examples: |
296 | |
297 | /cat/; # matches 'cat' |
298 | /[bcr]at/; # matches 'bat, 'cat', or 'rat' |
299 | /item[0123456789]/; # matches 'item0' or ... or 'item9' |
a6b2f353 |
300 | "abc" =~ /[cab]/; # matches 'a' |
47f9c88b |
301 | |
302 | In the last statement, even though C<'c'> is the first character in |
303 | the class, C<'a'> matches because the first character position in the |
304 | string is the earliest point at which the regexp can match. |
305 | |
306 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way |
307 | # 'yes', 'Yes', 'YES', etc. |
308 | |
309 | This regexp displays a common task: perform a a case-insensitive |
310 | match. Perl provides away of avoiding all those brackets by simply |
311 | appending an C<'i'> to the end of the match. Then C</[yY][eE][sS]/;> |
312 | can be rewritten as C</yes/i;>. The C<'i'> stands for |
313 | case-insensitive and is an example of a B<modifier> of the matching |
314 | operation. We will meet other modifiers later in the tutorial. |
315 | |
316 | We saw in the section above that there were ordinary characters, which |
317 | represented themselves, and special characters, which needed a |
318 | backslash C<\> to represent themselves. The same is true in a |
319 | character class, but the sets of ordinary and special characters |
320 | inside a character class are different than those outside a character |
321 | class. The special characters for a character class are C<-]\^$>. C<]> |
322 | is special because it denotes the end of a character class. C<$> is |
323 | special because it denotes a scalar variable. C<\> is special because |
324 | it is used in escape sequences, just like above. Here is how the |
325 | special characters C<]$\> are handled: |
326 | |
327 | /[\]c]def/; # matches ']def' or 'cdef' |
328 | $x = 'bcr'; |
a6b2f353 |
329 | /[$x]at/; # matches 'bat', 'cat', or 'rat' |
47f9c88b |
330 | /[\$x]at/; # matches '$at' or 'xat' |
331 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' |
332 | |
333 | The last two are a little tricky. in C<[\$x]>, the backslash protects |
334 | the dollar sign, so the character class has two members C<$> and C<x>. |
335 | In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a |
336 | variable and substituted in double quote fashion. |
337 | |
338 | The special character C<'-'> acts as a range operator within character |
339 | classes, so that a contiguous set of characters can be written as a |
340 | range. With ranges, the unwieldy C<[0123456789]> and C<[abc...xyz]> |
341 | become the svelte C<[0-9]> and C<[a-z]>. Some examples are |
342 | |
343 | /item[0-9]/; # matches 'item0' or ... or 'item9' |
344 | /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', |
345 | # 'baa', 'xaa', 'yaa', or 'zaa' |
346 | /[0-9a-fA-F]/; # matches a hexadecimal digit |
36bbe248 |
347 | /[0-9a-zA-Z_]/; # matches a "word" character, |
47f9c88b |
348 | # like those in a perl variable name |
349 | |
350 | If C<'-'> is the first or last character in a character class, it is |
351 | treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are |
352 | all equivalent. |
353 | |
354 | The special character C<^> in the first position of a character class |
355 | denotes a B<negated character class>, which matches any character but |
a6b2f353 |
356 | those in the brackets. Both C<[...]> and C<[^...]> must match a |
47f9c88b |
357 | character, or the match fails. Then |
358 | |
359 | /[^a]at/; # doesn't match 'aat' or 'at', but matches |
360 | # all other 'bat', 'cat, '0at', '%at', etc. |
361 | /[^0-9]/; # matches a non-numeric character |
362 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary |
363 | |
364 | Now, even C<[0-9]> can be a bother the write multiple times, so in the |
365 | interest of saving keystrokes and making regexps more readable, Perl |
366 | has several abbreviations for common character classes: |
367 | |
368 | =over 4 |
369 | |
370 | =item * |
551e1d92 |
371 | |
47f9c88b |
372 | \d is a digit and represents [0-9] |
373 | |
374 | =item * |
551e1d92 |
375 | |
47f9c88b |
376 | \s is a whitespace character and represents [\ \t\r\n\f] |
377 | |
378 | =item * |
551e1d92 |
379 | |
47f9c88b |
380 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] |
381 | |
382 | =item * |
551e1d92 |
383 | |
47f9c88b |
384 | \D is a negated \d; it represents any character but a digit [^0-9] |
385 | |
386 | =item * |
551e1d92 |
387 | |
47f9c88b |
388 | \S is a negated \s; it represents any non-whitespace character [^\s] |
389 | |
390 | =item * |
551e1d92 |
391 | |
47f9c88b |
392 | \W is a negated \w; it represents any non-word character [^\w] |
393 | |
394 | =item * |
551e1d92 |
395 | |
47f9c88b |
396 | The period '.' matches any character but "\n" |
397 | |
398 | =back |
399 | |
400 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside |
401 | of character classes. Here are some in use: |
402 | |
403 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format |
404 | /[\d\s]/; # matches any digit or whitespace character |
405 | /\w\W\w/; # matches a word char, followed by a |
406 | # non-word char, followed by a word char |
407 | /..rt/; # matches any two chars, followed by 'rt' |
408 | /end\./; # matches 'end.' |
409 | /end[.]/; # same thing, matches 'end.' |
410 | |
411 | Because a period is a metacharacter, it needs to be escaped to match |
412 | as an ordinary period. Because, for example, C<\d> and C<\w> are sets |
413 | of characters, it is incorrect to think of C<[^\d\w]> as C<[\D\W]>; in |
414 | fact C<[^\d\w]> is the same as C<[^\w]>, which is the same as |
415 | C<[\W]>. Think DeMorgan's laws. |
416 | |
417 | An anchor useful in basic regexps is the S<B<word anchor> > |
418 | C<\b>. This matches a boundary between a word character and a non-word |
419 | character C<\w\W> or C<\W\w>: |
420 | |
421 | $x = "Housecat catenates house and cat"; |
422 | $x =~ /cat/; # matches cat in 'housecat' |
423 | $x =~ /\bcat/; # matches cat in 'catenates' |
424 | $x =~ /cat\b/; # matches cat in 'housecat' |
425 | $x =~ /\bcat\b/; # matches 'cat' at end of string |
426 | |
427 | Note in the last example, the end of the string is considered a word |
428 | boundary. |
429 | |
430 | You might wonder why C<'.'> matches everything but C<"\n"> - why not |
431 | every character? The reason is that often one is matching against |
432 | lines and would like to ignore the newline characters. For instance, |
433 | while the string C<"\n"> represents one line, we would like to think |
434 | of as empty. Then |
435 | |
436 | "" =~ /^$/; # matches |
437 | "\n" =~ /^$/; # matches, "\n" is ignored |
438 | |
439 | "" =~ /./; # doesn't match; it needs a char |
440 | "" =~ /^.$/; # doesn't match; it needs a char |
441 | "\n" =~ /^.$/; # doesn't match; it needs a char other than "\n" |
442 | "a" =~ /^.$/; # matches |
443 | "a\n" =~ /^.$/; # matches, ignores the "\n" |
444 | |
445 | This behavior is convenient, because we usually want to ignore |
446 | newlines when we count and match characters in a line. Sometimes, |
447 | however, we want to keep track of newlines. We might even want C<^> |
448 | and C<$> to anchor at the beginning and end of lines within the |
449 | string, rather than just the beginning and end of the string. Perl |
450 | allows us to choose between ignoring and paying attention to newlines |
451 | by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for |
452 | single line and multi-line and they determine whether a string is to |
453 | be treated as one continuous string, or as a set of lines. The two |
454 | modifiers affect two aspects of how the regexp is interpreted: 1) how |
455 | the C<'.'> character class is defined, and 2) where the anchors C<^> |
456 | and C<$> are able to match. Here are the four possible combinations: |
457 | |
458 | =over 4 |
459 | |
460 | =item * |
551e1d92 |
461 | |
47f9c88b |
462 | no modifiers (//): Default behavior. C<'.'> matches any character |
463 | except C<"\n">. C<^> matches only at the beginning of the string and |
464 | C<$> matches only at the end or before a newline at the end. |
465 | |
466 | =item * |
551e1d92 |
467 | |
47f9c88b |
468 | s modifier (//s): Treat string as a single long line. C<'.'> matches |
469 | any character, even C<"\n">. C<^> matches only at the beginning of |
470 | the string and C<$> matches only at the end or before a newline at the |
471 | end. |
472 | |
473 | =item * |
551e1d92 |
474 | |
47f9c88b |
475 | m modifier (//m): Treat string as a set of multiple lines. C<'.'> |
476 | matches any character except C<"\n">. C<^> and C<$> are able to match |
477 | at the start or end of I<any> line within the string. |
478 | |
479 | =item * |
551e1d92 |
480 | |
47f9c88b |
481 | both s and m modifiers (//sm): Treat string as a single long line, but |
482 | detect multiple lines. C<'.'> matches any character, even |
483 | C<"\n">. C<^> and C<$>, however, are able to match at the start or end |
484 | of I<any> line within the string. |
485 | |
486 | =back |
487 | |
488 | Here are examples of C<//s> and C<//m> in action: |
489 | |
490 | $x = "There once was a girl\nWho programmed in Perl\n"; |
491 | |
492 | $x =~ /^Who/; # doesn't match, "Who" not at start of string |
493 | $x =~ /^Who/s; # doesn't match, "Who" not at start of string |
494 | $x =~ /^Who/m; # matches, "Who" at start of second line |
495 | $x =~ /^Who/sm; # matches, "Who" at start of second line |
496 | |
497 | $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n" |
498 | $x =~ /girl.Who/s; # matches, "." matches "\n" |
499 | $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" |
500 | $x =~ /girl.Who/sm; # matches, "." matches "\n" |
501 | |
502 | Most of the time, the default behavior is what is want, but C<//s> and |
503 | C<//m> are occasionally very useful. If C<//m> is being used, the start |
504 | of the string can still be matched with C<\A> and the end of string |
505 | can still be matched with the anchors C<\Z> (matches both the end and |
506 | the newline before, like C<$>), and C<\z> (matches only the end): |
507 | |
508 | $x =~ /^Who/m; # matches, "Who" at start of second line |
509 | $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string |
510 | |
511 | $x =~ /girl$/m; # matches, "girl" at end of first line |
512 | $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string |
513 | |
514 | $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end |
515 | $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string |
516 | |
517 | We now know how to create choices among classes of characters in a |
518 | regexp. What about choices among words or character strings? Such |
519 | choices are described in the next section. |
520 | |
521 | =head2 Matching this or that |
522 | |
523 | Sometimes we would like to our regexp to be able to match different |
524 | possible words or character strings. This is accomplished by using |
525 | the B<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we |
526 | form the regexp C<dog|cat>. As before, perl will try to match the |
527 | regexp at the earliest possible point in the string. At each |
528 | character position, perl will first try to match the first |
529 | alternative, C<dog>. If C<dog> doesn't match, perl will then try the |
530 | next alternative, C<cat>. If C<cat> doesn't match either, then the |
531 | match fails and perl moves to the next position in the string. Some |
532 | examples: |
533 | |
534 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat" |
535 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat" |
536 | |
537 | Even though C<dog> is the first alternative in the second regexp, |
538 | C<cat> is able to match earlier in the string. |
539 | |
540 | "cats" =~ /c|ca|cat|cats/; # matches "c" |
541 | "cats" =~ /cats|cat|ca|c/; # matches "cats" |
542 | |
543 | Here, all the alternatives match at the first string position, so the |
544 | first alternative is the one that matches. If some of the |
545 | alternatives are truncations of the others, put the longest ones first |
546 | to give them a chance to match. |
547 | |
548 | "cab" =~ /a|b|c/ # matches "c" |
549 | # /a|b|c/ == /[abc]/ |
550 | |
551 | The last example points out that character classes are like |
552 | alternations of characters. At a given character position, the first |
553 | alternative that allows the regexp match to succeed wil be the one |
554 | that matches. |
555 | |
556 | =head2 Grouping things and hierarchical matching |
557 | |
558 | Alternation allows a regexp to choose among alternatives, but by |
559 | itself it unsatisfying. The reason is that each alternative is a whole |
560 | regexp, but sometime we want alternatives for just part of a |
561 | regexp. For instance, suppose we want to search for housecats or |
562 | housekeepers. The regexp C<housecat|housekeeper> fits the bill, but is |
563 | inefficient because we had to type C<house> twice. It would be nice to |
564 | have parts of the regexp be constant, like C<house>, and and some |
565 | parts have alternatives, like C<cat|keeper>. |
566 | |
567 | The B<grouping> metacharacters C<()> solve this problem. Grouping |
568 | allows parts of a regexp to be treated as a single unit. Parts of a |
569 | regexp are grouped by enclosing them in parentheses. Thus we could solve |
570 | the C<housecat|housekeeper> by forming the regexp as |
571 | C<house(cat|keeper)>. The regexp C<house(cat|keeper)> means match |
572 | C<house> followed by either C<cat> or C<keeper>. Some more examples |
573 | are |
574 | |
575 | /(a|b)b/; # matches 'ab' or 'bb' |
576 | /(ac|b)b/; # matches 'acb' or 'bb' |
577 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere |
578 | /(a|[bc])d/; # matches 'ad', 'bd', or 'cd' |
579 | |
580 | /house(cat|)/; # matches either 'housecat' or 'house' |
581 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or |
582 | # 'house'. Note groups can be nested. |
583 | |
584 | /(19|20|)\d\d/; # match years 19xx, 20xx, or the Y2K problem, xx |
585 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', |
586 | # because '20\d\d' can't match |
587 | |
588 | Alternations behave the same way in groups as out of them: at a given |
589 | string position, the leftmost alternative that allows the regexp to |
590 | match is taken. So in the last example at tth first string position, |
591 | C<"20"> matches the second alternative, but there is nothing left over |
592 | to match the next two digits C<\d\d>. So perl moves on to the next |
593 | alternative, which is the null alternative and that works, since |
594 | C<"20"> is two digits. |
595 | |
596 | The process of trying one alternative, seeing if it matches, and |
597 | moving on to the next alternative if it doesn't, is called |
598 | B<backtracking>. The term 'backtracking' comes from the idea that |
599 | matching a regexp is like a walk in the woods. Successfully matching |
600 | a regexp is like arriving at a destination. There are many possible |
601 | trailheads, one for each string position, and each one is tried in |
602 | order, left to right. From each trailhead there may be many paths, |
603 | some of which get you there, and some which are dead ends. When you |
604 | walk along a trail and hit a dead end, you have to backtrack along the |
605 | trail to an earlier point to try another trail. If you hit your |
606 | destination, you stop immediately and forget about trying all the |
607 | other trails. You are persistent, and only if you have tried all the |
608 | trails from all the trailheads and not arrived at your destination, do |
609 | you declare failure. To be concrete, here is a step-by-step analysis |
610 | of what perl does when it tries to match the regexp |
611 | |
612 | "abcde" =~ /(abd|abc)(df|d|de)/; |
613 | |
614 | =over 4 |
615 | |
551e1d92 |
616 | =item 0 |
617 | |
618 | Start with the first letter in the string 'a'. |
619 | |
620 | =item 1 |
47f9c88b |
621 | |
551e1d92 |
622 | Try the first alternative in the first group 'abd'. |
47f9c88b |
623 | |
551e1d92 |
624 | =item 2 |
47f9c88b |
625 | |
551e1d92 |
626 | Match 'a' followed by 'b'. So far so good. |
627 | |
628 | =item 3 |
629 | |
630 | 'd' in the regexp doesn't match 'c' in the string - a dead |
47f9c88b |
631 | end. So backtrack two characters and pick the second alternative in |
632 | the first group 'abc'. |
633 | |
551e1d92 |
634 | =item 4 |
635 | |
636 | Match 'a' followed by 'b' followed by 'c'. We are on a roll |
47f9c88b |
637 | and have satisfied the first group. Set $1 to 'abc'. |
638 | |
551e1d92 |
639 | =item 5 |
640 | |
641 | Move on to the second group and pick the first alternative |
47f9c88b |
642 | 'df'. |
643 | |
551e1d92 |
644 | =item 6 |
47f9c88b |
645 | |
551e1d92 |
646 | Match the 'd'. |
647 | |
648 | =item 7 |
649 | |
650 | 'f' in the regexp doesn't match 'e' in the string, so a dead |
47f9c88b |
651 | end. Backtrack one character and pick the second alternative in the |
652 | second group 'd'. |
653 | |
551e1d92 |
654 | =item 8 |
655 | |
656 | 'd' matches. The second grouping is satisfied, so set $2 to |
47f9c88b |
657 | 'd'. |
658 | |
551e1d92 |
659 | =item 9 |
660 | |
661 | We are at the end of the regexp, so we are done! We have |
47f9c88b |
662 | matched 'abcd' out of the string "abcde". |
663 | |
664 | =back |
665 | |
666 | There are a couple of things to note about this analysis. First, the |
667 | third alternative in the second group 'de' also allows a match, but we |
668 | stopped before we got to it - at a given character position, leftmost |
669 | wins. Second, we were able to get a match at the first character |
670 | position of the string 'a'. If there were no matches at the first |
671 | position, perl would move to the second character position 'b' and |
672 | attempt the match all over again. Only when all possible paths at all |
673 | possible character positions have been exhausted does perl give give |
674 | up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false. |
675 | |
676 | Even with all this work, regexp matching happens remarkably fast. To |
677 | speed things up, during compilation stage, perl compiles the regexp |
678 | into a compact sequence of opcodes that can often fit inside a |
679 | processor cache. When the code is executed, these opcodes can then run |
680 | at full throttle and search very quickly. |
681 | |
682 | =head2 Extracting matches |
683 | |
684 | The grouping metacharacters C<()> also serve another completely |
685 | different function: they allow the extraction of the parts of a string |
686 | that matched. This is very useful to find out what matched and for |
687 | text processing in general. For each grouping, the part that matched |
688 | inside goes into the special variables C<$1>, C<$2>, etc. They can be |
689 | used just as ordinary variables: |
690 | |
691 | # extract hours, minutes, seconds |
692 | $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format |
693 | $hours = $1; |
694 | $minutes = $2; |
695 | $seconds = $3; |
696 | |
697 | Now, we know that in scalar context, |
698 | S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false |
699 | value. In list context, however, it returns the list of matched values |
700 | C<($1,$2,$3)>. So we could write the code more compactly as |
701 | |
702 | # extract hours, minutes, seconds |
703 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); |
704 | |
705 | If the groupings in a regexp are nested, C<$1> gets the group with the |
706 | leftmost opening parenthesis, C<$2> the next opening parenthesis, |
707 | etc. For example, here is a complex regexp and the matching variables |
708 | indicated below it: |
709 | |
710 | /(ab(cd|ef)((gi)|j))/; |
711 | 1 2 34 |
712 | |
a01268b5 |
713 | so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'. For |
714 | convenience, perl sets C<$+> to the string held by the highest numbered |
715 | C<$1>, C<$2>, ... that got assigned (and, somewhat related, C<$^N> to the |
716 | value of the C<$1>, C<$2>, ... most-recently assigned; i.e. the C<$1>, |
717 | C<$2>, ... associated with the rightmost closing parenthesis used in the |
718 | match). |
47f9c88b |
719 | |
720 | Closely associated with the matching variables C<$1>, C<$2>, ... are |
721 | the B<backreferences> C<\1>, C<\2>, ... . Backreferences are simply |
722 | matching variables that can be used I<inside> a regexp. This is a |
723 | really nice feature - what matches later in a regexp can depend on |
724 | what matched earlier in the regexp. Suppose we wanted to look |
725 | for doubled words in text, like 'the the'. The following regexp finds |
726 | all 3-letter doubles with a space in between: |
727 | |
728 | /(\w\w\w)\s\1/; |
729 | |
730 | The grouping assigns a value to \1, so that the same 3 letter sequence |
731 | is used for both parts. Here are some words with repeated parts: |
732 | |
733 | % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words |
734 | beriberi |
735 | booboo |
736 | coco |
737 | mama |
738 | murmur |
739 | papa |
740 | |
741 | The regexp has a single grouping which considers 4-letter |
742 | combinations, then 3-letter combinations, etc. and uses C<\1> to look for |
743 | a repeat. Although C<$1> and C<\1> represent the same thing, care should be |
744 | taken to use matched variables C<$1>, C<$2>, ... only outside a regexp |
745 | and backreferences C<\1>, C<\2>, ... only inside a regexp; not doing |
746 | so may lead to surprising and/or undefined results. |
747 | |
748 | In addition to what was matched, Perl 5.6.0 also provides the |
749 | positions of what was matched with the C<@-> and C<@+> |
750 | arrays. C<$-[0]> is the position of the start of the entire match and |
751 | C<$+[0]> is the position of the end. Similarly, C<$-[n]> is the |
752 | position of the start of the C<$n> match and C<$+[n]> is the position |
753 | of the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Then |
754 | this code |
755 | |
756 | $x = "Mmm...donut, thought Homer"; |
757 | $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches |
758 | foreach $expr (1..$#-) { |
759 | print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; |
760 | } |
761 | |
762 | prints |
763 | |
764 | Match 1: 'Mmm' at position (0,3) |
765 | Match 2: 'donut' at position (6,11) |
766 | |
767 | Even if there are no groupings in a regexp, it is still possible to |
768 | find out what exactly matched in a string. If you use them, perl |
769 | will set C<$`> to the part of the string before the match, will set C<$&> |
770 | to the part of the string that matched, and will set C<$'> to the part |
771 | of the string after the match. An example: |
772 | |
773 | $x = "the cat caught the mouse"; |
774 | $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' |
775 | $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse' |
776 | |
777 | In the second match, S<C<$` = ''> > because the regexp matched at the |
778 | first character position in the string and stopped, it never saw the |
779 | second 'the'. It is important to note that using C<$`> and C<$'> |
a6b2f353 |
780 | slows down regexp matching quite a bit, and C< $& > slows it down to a |
47f9c88b |
781 | lesser extent, because if they are used in one regexp in a program, |
782 | they are generated for <all> regexps in the program. So if raw |
783 | performance is a goal of your application, they should be avoided. |
784 | If you need them, use C<@-> and C<@+> instead: |
785 | |
786 | $` is the same as substr( $x, 0, $-[0] ) |
787 | $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) |
788 | $' is the same as substr( $x, $+[0] ) |
789 | |
790 | =head2 Matching repetitions |
791 | |
792 | The examples in the previous section display an annoying weakness. We |
793 | were only matching 3-letter words, or syllables of 4 letters or |
794 | less. We'd like to be able to match words or syllables of any length, |
795 | without writing out tedious alternatives like |
796 | C<\w\w\w\w|\w\w\w|\w\w|\w>. |
797 | |
798 | This is exactly the problem the B<quantifier> metacharacters C<?>, |
799 | C<*>, C<+>, and C<{}> were created for. They allow us to determine the |
800 | number of repeats of a portion of a regexp we consider to be a |
801 | match. Quantifiers are put immediately after the character, character |
802 | class, or grouping that we want to specify. They have the following |
803 | meanings: |
804 | |
805 | =over 4 |
806 | |
551e1d92 |
807 | =item * |
47f9c88b |
808 | |
551e1d92 |
809 | C<a?> = match 'a' 1 or 0 times |
47f9c88b |
810 | |
551e1d92 |
811 | =item * |
812 | |
813 | C<a*> = match 'a' 0 or more times, i.e., any number of times |
814 | |
815 | =item * |
47f9c88b |
816 | |
551e1d92 |
817 | C<a+> = match 'a' 1 or more times, i.e., at least once |
818 | |
819 | =item * |
820 | |
821 | C<a{n,m}> = match at least C<n> times, but not more than C<m> |
47f9c88b |
822 | times. |
823 | |
551e1d92 |
824 | =item * |
825 | |
826 | C<a{n,}> = match at least C<n> or more times |
827 | |
828 | =item * |
47f9c88b |
829 | |
551e1d92 |
830 | C<a{n}> = match exactly C<n> times |
47f9c88b |
831 | |
832 | =back |
833 | |
834 | Here are some examples: |
835 | |
836 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and |
837 | # any number of digits |
838 | /(\w+)\s+\1/; # match doubled words of arbitrary length |
839 | /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' |
840 | $year =~ /\d{2,4}/; # make sure year is at least 2 but not more |
841 | # than 4 digits |
842 | $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates |
843 | $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, |
844 | # this produces $1 and the other does not. |
845 | |
846 | % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? |
847 | beriberi |
848 | booboo |
849 | coco |
850 | mama |
851 | murmur |
852 | papa |
853 | |
854 | For all of these quantifiers, perl will try to match as much of the |
855 | string as possible, while still allowing the regexp to succeed. Thus |
856 | with C</a?.../>, perl will first try to match the regexp with the C<a> |
857 | present; if that fails, perl will try to match the regexp without the |
858 | C<a> present. For the quantifier C<*>, we get the following: |
859 | |
860 | $x = "the cat in the hat"; |
861 | $x =~ /^(.*)(cat)(.*)$/; # matches, |
862 | # $1 = 'the ' |
863 | # $2 = 'cat' |
864 | # $3 = ' in the hat' |
865 | |
866 | Which is what we might expect, the match finds the only C<cat> in the |
867 | string and locks onto it. Consider, however, this regexp: |
868 | |
869 | $x =~ /^(.*)(at)(.*)$/; # matches, |
870 | # $1 = 'the cat in the h' |
871 | # $2 = 'at' |
872 | # $3 = '' (0 matches) |
873 | |
874 | One might initially guess that perl would find the C<at> in C<cat> and |
875 | stop there, but that wouldn't give the longest possible string to the |
876 | first quantifier C<.*>. Instead, the first quantifier C<.*> grabs as |
877 | much of the string as possible while still having the regexp match. In |
a6b2f353 |
878 | this example, that means having the C<at> sequence with the final C<at> |
47f9c88b |
879 | in the string. The other important principle illustrated here is that |
880 | when there are two or more elements in a regexp, the I<leftmost> |
881 | quantifier, if there is one, gets to grab as much the string as |
882 | possible, leaving the rest of the regexp to fight over scraps. Thus in |
883 | our example, the first quantifier C<.*> grabs most of the string, while |
884 | the second quantifier C<.*> gets the empty string. Quantifiers that |
885 | grab as much of the string as possible are called B<maximal match> or |
886 | B<greedy> quantifiers. |
887 | |
888 | When a regexp can match a string in several different ways, we can use |
889 | the principles above to predict which way the regexp will match: |
890 | |
891 | =over 4 |
892 | |
893 | =item * |
551e1d92 |
894 | |
47f9c88b |
895 | Principle 0: Taken as a whole, any regexp will be matched at the |
896 | earliest possible position in the string. |
897 | |
898 | =item * |
551e1d92 |
899 | |
47f9c88b |
900 | Principle 1: In an alternation C<a|b|c...>, the leftmost alternative |
901 | that allows a match for the whole regexp will be the one used. |
902 | |
903 | =item * |
551e1d92 |
904 | |
47f9c88b |
905 | Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and |
906 | C<{n,m}> will in general match as much of the string as possible while |
907 | still allowing the whole regexp to match. |
908 | |
909 | =item * |
551e1d92 |
910 | |
47f9c88b |
911 | Principle 3: If there are two or more elements in a regexp, the |
912 | leftmost greedy quantifier, if any, will match as much of the string |
913 | as possible while still allowing the whole regexp to match. The next |
914 | leftmost greedy quantifier, if any, will try to match as much of the |
915 | string remaining available to it as possible, while still allowing the |
916 | whole regexp to match. And so on, until all the regexp elements are |
917 | satisfied. |
918 | |
919 | =back |
920 | |
921 | As we have seen above, Principle 0 overrides the others - the regexp |
922 | will be matched as early as possible, with the other principles |
923 | determining how the regexp matches at that earliest character |
924 | position. |
925 | |
926 | Here is an example of these principles in action: |
927 | |
928 | $x = "The programming republic of Perl"; |
929 | $x =~ /^(.+)(e|r)(.*)$/; # matches, |
930 | # $1 = 'The programming republic of Pe' |
931 | # $2 = 'r' |
932 | # $3 = 'l' |
933 | |
934 | This regexp matches at the earliest string position, C<'T'>. One |
935 | might think that C<e>, being leftmost in the alternation, would be |
936 | matched, but C<r> produces the longest string in the first quantifier. |
937 | |
938 | $x =~ /(m{1,2})(.*)$/; # matches, |
939 | # $1 = 'mm' |
940 | # $2 = 'ing republic of Perl' |
941 | |
942 | Here, The earliest possible match is at the first C<'m'> in |
943 | C<programming>. C<m{1,2}> is the first quantifier, so it gets to match |
944 | a maximal C<mm>. |
945 | |
946 | $x =~ /.*(m{1,2})(.*)$/; # matches, |
947 | # $1 = 'm' |
948 | # $2 = 'ing republic of Perl' |
949 | |
950 | Here, the regexp matches at the start of the string. The first |
951 | quantifier C<.*> grabs as much as possible, leaving just a single |
952 | C<'m'> for the second quantifier C<m{1,2}>. |
953 | |
954 | $x =~ /(.?)(m{1,2})(.*)$/; # matches, |
955 | # $1 = 'a' |
956 | # $2 = 'mm' |
957 | # $3 = 'ing republic of Perl' |
958 | |
959 | Here, C<.?> eats its maximal one character at the earliest possible |
960 | position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> |
961 | the opportunity to match both C<m>'s. Finally, |
962 | |
963 | "aXXXb" =~ /(X*)/; # matches with $1 = '' |
964 | |
965 | because it can match zero copies of C<'X'> at the beginning of the |
966 | string. If you definitely want to match at least one C<'X'>, use |
967 | C<X+>, not C<X*>. |
968 | |
969 | Sometimes greed is not good. At times, we would like quantifiers to |
970 | match a I<minimal> piece of string, rather than a maximal piece. For |
971 | this purpose, Larry Wall created the S<B<minimal match> > or |
972 | B<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>. These are |
973 | the usual quantifiers with a C<?> appended to them. They have the |
974 | following meanings: |
975 | |
976 | =over 4 |
977 | |
551e1d92 |
978 | =item * |
979 | |
980 | C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1. |
47f9c88b |
981 | |
551e1d92 |
982 | =item * |
983 | |
984 | C<a*?> = match 'a' 0 or more times, i.e., any number of times, |
47f9c88b |
985 | but as few times as possible |
986 | |
551e1d92 |
987 | =item * |
988 | |
989 | C<a+?> = match 'a' 1 or more times, i.e., at least once, but |
47f9c88b |
990 | as few times as possible |
991 | |
551e1d92 |
992 | =item * |
993 | |
994 | C<a{n,m}?> = match at least C<n> times, not more than C<m> |
47f9c88b |
995 | times, as few times as possible |
996 | |
551e1d92 |
997 | =item * |
998 | |
999 | C<a{n,}?> = match at least C<n> times, but as few times as |
47f9c88b |
1000 | possible |
1001 | |
551e1d92 |
1002 | =item * |
1003 | |
1004 | C<a{n}?> = match exactly C<n> times. Because we match exactly |
47f9c88b |
1005 | C<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there for |
1006 | notational consistency. |
1007 | |
1008 | =back |
1009 | |
1010 | Let's look at the example above, but with minimal quantifiers: |
1011 | |
1012 | $x = "The programming republic of Perl"; |
1013 | $x =~ /^(.+?)(e|r)(.*)$/; # matches, |
1014 | # $1 = 'Th' |
1015 | # $2 = 'e' |
1016 | # $3 = ' programming republic of Perl' |
1017 | |
1018 | The minimal string that will allow both the start of the string C<^> |
1019 | and the alternation to match is C<Th>, with the alternation C<e|r> |
1020 | matching C<e>. The second quantifier C<.*> is free to gobble up the |
1021 | rest of the string. |
1022 | |
1023 | $x =~ /(m{1,2}?)(.*?)$/; # matches, |
1024 | # $1 = 'm' |
1025 | # $2 = 'ming republic of Perl' |
1026 | |
1027 | The first string position that this regexp can match is at the first |
1028 | C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> |
1029 | matches just one C<'m'>. Although the second quantifier C<.*?> would |
1030 | prefer to match no characters, it is constrained by the end-of-string |
1031 | anchor C<$> to match the rest of the string. |
1032 | |
1033 | $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, |
1034 | # $1 = 'The progra' |
1035 | # $2 = 'm' |
1036 | # $3 = 'ming republic of Perl' |
1037 | |
1038 | In this regexp, you might expect the first minimal quantifier C<.*?> |
1039 | to match the empty string, because it is not constrained by a C<^> |
1040 | anchor to match the beginning of the word. Principle 0 applies here, |
1041 | however. Because it is possible for the whole regexp to match at the |
1042 | start of the string, it I<will> match at the start of the string. Thus |
1043 | the first quantifier has to match everything up to the first C<m>. The |
1044 | second minimal quantifier matches just one C<m> and the third |
1045 | quantifier matches the rest of the string. |
1046 | |
1047 | $x =~ /(.??)(m{1,2})(.*)$/; # matches, |
1048 | # $1 = 'a' |
1049 | # $2 = 'mm' |
1050 | # $3 = 'ing republic of Perl' |
1051 | |
1052 | Just as in the previous regexp, the first quantifier C<.??> can match |
1053 | earliest at position C<'a'>, so it does. The second quantifier is |
1054 | greedy, so it matches C<mm>, and the third matches the rest of the |
1055 | string. |
1056 | |
1057 | We can modify principle 3 above to take into account non-greedy |
1058 | quantifiers: |
1059 | |
1060 | =over 4 |
1061 | |
1062 | =item * |
551e1d92 |
1063 | |
47f9c88b |
1064 | Principle 3: If there are two or more elements in a regexp, the |
1065 | leftmost greedy (non-greedy) quantifier, if any, will match as much |
1066 | (little) of the string as possible while still allowing the whole |
1067 | regexp to match. The next leftmost greedy (non-greedy) quantifier, if |
1068 | any, will try to match as much (little) of the string remaining |
1069 | available to it as possible, while still allowing the whole regexp to |
1070 | match. And so on, until all the regexp elements are satisfied. |
1071 | |
1072 | =back |
1073 | |
1074 | Just like alternation, quantifiers are also susceptible to |
1075 | backtracking. Here is a step-by-step analysis of the example |
1076 | |
1077 | $x = "the cat in the hat"; |
1078 | $x =~ /^(.*)(at)(.*)$/; # matches, |
1079 | # $1 = 'the cat in the h' |
1080 | # $2 = 'at' |
1081 | # $3 = '' (0 matches) |
1082 | |
1083 | =over 4 |
1084 | |
551e1d92 |
1085 | =item 0 |
1086 | |
1087 | Start with the first letter in the string 't'. |
47f9c88b |
1088 | |
551e1d92 |
1089 | =item 1 |
1090 | |
1091 | The first quantifier '.*' starts out by matching the whole |
47f9c88b |
1092 | string 'the cat in the hat'. |
1093 | |
551e1d92 |
1094 | =item 2 |
1095 | |
1096 | 'a' in the regexp element 'at' doesn't match the end of the |
47f9c88b |
1097 | string. Backtrack one character. |
1098 | |
551e1d92 |
1099 | =item 3 |
1100 | |
1101 | 'a' in the regexp element 'at' still doesn't match the last |
47f9c88b |
1102 | letter of the string 't', so backtrack one more character. |
1103 | |
551e1d92 |
1104 | =item 4 |
1105 | |
1106 | Now we can match the 'a' and the 't'. |
47f9c88b |
1107 | |
551e1d92 |
1108 | =item 5 |
1109 | |
1110 | Move on to the third element '.*'. Since we are at the end of |
47f9c88b |
1111 | the string and '.*' can match 0 times, assign it the empty string. |
1112 | |
551e1d92 |
1113 | =item 6 |
1114 | |
1115 | We are done! |
47f9c88b |
1116 | |
1117 | =back |
1118 | |
1119 | Most of the time, all this moving forward and backtracking happens |
1120 | quickly and searching is fast. There are some pathological regexps, |
1121 | however, whose execution time exponentially grows with the size of the |
1122 | string. A typical structure that blows up in your face is of the form |
1123 | |
1124 | /(a|b+)*/; |
1125 | |
1126 | The problem is the nested indeterminate quantifiers. There are many |
1127 | different ways of partitioning a string of length n between the C<+> |
1128 | and C<*>: one repetition with C<b+> of length n, two repetitions with |
1129 | the first C<b+> length k and the second with length n-k, m repetitions |
1130 | whose bits add up to length n, etc. In fact there are an exponential |
1131 | number of ways to partition a string as a function of length. A |
1132 | regexp may get lucky and match early in the process, but if there is |
1133 | no match, perl will try I<every> possibility before giving up. So be |
1134 | careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book |
1135 | I<Mastering regular expressions> by Jeffrey Friedl gives a wonderful |
1136 | discussion of this and other efficiency issues. |
1137 | |
1138 | =head2 Building a regexp |
1139 | |
1140 | At this point, we have all the basic regexp concepts covered, so let's |
1141 | give a more involved example of a regular expression. We will build a |
1142 | regexp that matches numbers. |
1143 | |
1144 | The first task in building a regexp is to decide what we want to match |
1145 | and what we want to exclude. In our case, we want to match both |
1146 | integers and floating point numbers and we want to reject any string |
1147 | that isn't a number. |
1148 | |
1149 | The next task is to break the problem down into smaller problems that |
1150 | are easily converted into a regexp. |
1151 | |
1152 | The simplest case is integers. These consist of a sequence of digits, |
1153 | with an optional sign in front. The digits we can represent with |
1154 | C<\d+> and the sign can be matched with C<[+-]>. Thus the integer |
1155 | regexp is |
1156 | |
1157 | /[+-]?\d+/; # matches integers |
1158 | |
1159 | A floating point number potentially has a sign, an integral part, a |
1160 | decimal point, a fractional part, and an exponent. One or more of these |
1161 | parts is optional, so we need to check out the different |
1162 | possibilities. Floating point numbers which are in proper form include |
1163 | 123., 0.345, .34, -1e6, and 25.4E-72. As with integers, the sign out |
1164 | front is completely optional and can be matched by C<[+-]?>. We can |
1165 | see that if there is no exponent, floating point numbers must have a |
1166 | decimal point, otherwise they are integers. We might be tempted to |
1167 | model these with C<\d*\.\d*>, but this would also match just a single |
1168 | decimal point, which is not a number. So the three cases of floating |
1169 | point number sans exponent are |
1170 | |
1171 | /[+-]?\d+\./; # 1., 321., etc. |
1172 | /[+-]?\.\d+/; # .1, .234, etc. |
1173 | /[+-]?\d+\.\d+/; # 1.0, 30.56, etc. |
1174 | |
1175 | These can be combined into a single regexp with a three-way alternation: |
1176 | |
1177 | /[+-]?(\d+\.\d+|\d+\.|\.\d+)/; # floating point, no exponent |
1178 | |
1179 | In this alternation, it is important to put C<'\d+\.\d+'> before |
1180 | C<'\d+\.'>. If C<'\d+\.'> were first, the regexp would happily match that |
1181 | and ignore the fractional part of the number. |
1182 | |
1183 | Now consider floating point numbers with exponents. The key |
1184 | observation here is that I<both> integers and numbers with decimal |
1185 | points are allowed in front of an exponent. Then exponents, like the |
1186 | overall sign, are independent of whether we are matching numbers with |
1187 | or without decimal points, and can be 'decoupled' from the |
1188 | mantissa. The overall form of the regexp now becomes clear: |
1189 | |
1190 | /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; |
1191 | |
1192 | The exponent is an C<e> or C<E>, followed by an integer. So the |
1193 | exponent regexp is |
1194 | |
1195 | /[eE][+-]?\d+/; # exponent |
1196 | |
1197 | Putting all the parts together, we get a regexp that matches numbers: |
1198 | |
1199 | /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! |
1200 | |
1201 | Long regexps like this may impress your friends, but can be hard to |
1202 | decipher. In complex situations like this, the C<//x> modifier for a |
1203 | match is invaluable. It allows one to put nearly arbitrary whitespace |
1204 | and comments into a regexp without affecting their meaning. Using it, |
1205 | we can rewrite our 'extended' regexp in the more pleasing form |
1206 | |
1207 | /^ |
1208 | [+-]? # first, match an optional sign |
1209 | ( # then match integers or f.p. mantissas: |
1210 | \d+\.\d+ # mantissa of the form a.b |
1211 | |\d+\. # mantissa of the form a. |
1212 | |\.\d+ # mantissa of the form .b |
1213 | |\d+ # integer of the form a |
1214 | ) |
1215 | ([eE][+-]?\d+)? # finally, optionally match an exponent |
1216 | $/x; |
1217 | |
1218 | If whitespace is mostly irrelevant, how does one include space |
1219 | characters in an extended regexp? The answer is to backslash it |
1220 | S<C<'\ '> > or put it in a character class S<C<[ ]> >. The same thing |
1221 | goes for pound signs, use C<\#> or C<[#]>. For instance, Perl allows |
1222 | a space between the sign and the mantissa/integer, and we could add |
1223 | this to our regexp as follows: |
1224 | |
1225 | /^ |
1226 | [+-]?\ * # first, match an optional sign *and space* |
1227 | ( # then match integers or f.p. mantissas: |
1228 | \d+\.\d+ # mantissa of the form a.b |
1229 | |\d+\. # mantissa of the form a. |
1230 | |\.\d+ # mantissa of the form .b |
1231 | |\d+ # integer of the form a |
1232 | ) |
1233 | ([eE][+-]?\d+)? # finally, optionally match an exponent |
1234 | $/x; |
1235 | |
1236 | In this form, it is easier to see a way to simplify the |
1237 | alternation. Alternatives 1, 2, and 4 all start with C<\d+>, so it |
1238 | could be factored out: |
1239 | |
1240 | /^ |
1241 | [+-]?\ * # first, match an optional sign |
1242 | ( # then match integers or f.p. mantissas: |
1243 | \d+ # start out with a ... |
1244 | ( |
1245 | \.\d* # mantissa of the form a.b or a. |
1246 | )? # ? takes care of integers of the form a |
1247 | |\.\d+ # mantissa of the form .b |
1248 | ) |
1249 | ([eE][+-]?\d+)? # finally, optionally match an exponent |
1250 | $/x; |
1251 | |
1252 | or written in the compact form, |
1253 | |
1254 | /^[+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?$/; |
1255 | |
1256 | This is our final regexp. To recap, we built a regexp by |
1257 | |
1258 | =over 4 |
1259 | |
551e1d92 |
1260 | =item * |
1261 | |
1262 | specifying the task in detail, |
47f9c88b |
1263 | |
551e1d92 |
1264 | =item * |
1265 | |
1266 | breaking down the problem into smaller parts, |
1267 | |
1268 | =item * |
47f9c88b |
1269 | |
551e1d92 |
1270 | translating the small parts into regexps, |
47f9c88b |
1271 | |
551e1d92 |
1272 | =item * |
1273 | |
1274 | combining the regexps, |
1275 | |
1276 | =item * |
47f9c88b |
1277 | |
551e1d92 |
1278 | and optimizing the final combined regexp. |
47f9c88b |
1279 | |
1280 | =back |
1281 | |
1282 | These are also the typical steps involved in writing a computer |
1283 | program. This makes perfect sense, because regular expressions are |
1284 | essentially programs written a little computer language that specifies |
1285 | patterns. |
1286 | |
1287 | =head2 Using regular expressions in Perl |
1288 | |
1289 | The last topic of Part 1 briefly covers how regexps are used in Perl |
1290 | programs. Where do they fit into Perl syntax? |
1291 | |
1292 | We have already introduced the matching operator in its default |
1293 | C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used |
1294 | the binding operator C<=~> and its negation C<!~> to test for string |
1295 | matches. Associated with the matching operator, we have discussed the |
1296 | single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and |
1297 | extended C<//x> modifiers. |
1298 | |
1299 | There are a few more things you might want to know about matching |
1300 | operators. First, we pointed out earlier that variables in regexps are |
1301 | substituted before the regexp is evaluated: |
1302 | |
1303 | $pattern = 'Seuss'; |
1304 | while (<>) { |
1305 | print if /$pattern/; |
1306 | } |
1307 | |
1308 | This will print any lines containing the word C<Seuss>. It is not as |
1309 | efficient as it could be, however, because perl has to re-evaluate |
1310 | C<$pattern> each time through the loop. If C<$pattern> won't be |
1311 | changing over the lifetime of the script, we can add the C<//o> |
1312 | modifier, which directs perl to only perform variable substitutions |
1313 | once: |
1314 | |
1315 | #!/usr/bin/perl |
1316 | # Improved simple_grep |
1317 | $regexp = shift; |
1318 | while (<>) { |
1319 | print if /$regexp/o; # a good deal faster |
1320 | } |
1321 | |
1322 | If you change C<$pattern> after the first substitution happens, perl |
1323 | will ignore it. If you don't want any substitutions at all, use the |
1324 | special delimiter C<m''>: |
1325 | |
1326 | $pattern = 'Seuss'; |
1327 | while (<>) { |
1328 | print if m'$pattern'; # matches '$pattern', not 'Seuss' |
1329 | } |
1330 | |
1331 | C<m''> acts like single quotes on a regexp; all other C<m> delimiters |
1332 | act like double quotes. If the regexp evaluates to the empty string, |
1333 | the regexp in the I<last successful match> is used instead. So we have |
1334 | |
1335 | "dog" =~ /d/; # 'd' matches |
1336 | "dogbert =~ //; # this matches the 'd' regexp used before |
1337 | |
1338 | The final two modifiers C<//g> and C<//c> concern multiple matches. |
1339 | The modifier C<//g> stands for global matching and allows the the |
1340 | matching operator to match within a string as many times as possible. |
1341 | In scalar context, successive invocations against a string will have |
1342 | `C<//g> jump from match to match, keeping track of position in the |
1343 | string as it goes along. You can get or set the position with the |
1344 | C<pos()> function. |
1345 | |
1346 | The use of C<//g> is shown in the following example. Suppose we have |
1347 | a string that consists of words separated by spaces. If we know how |
1348 | many words there are in advance, we could extract the words using |
1349 | groupings: |
1350 | |
1351 | $x = "cat dog house"; # 3 words |
1352 | $x =~ /^\s*(\w+)\s+(\w+)\s+(\w+)\s*$/; # matches, |
1353 | # $1 = 'cat' |
1354 | # $2 = 'dog' |
1355 | # $3 = 'house' |
1356 | |
1357 | But what if we had an indeterminate number of words? This is the sort |
1358 | of task C<//g> was made for. To extract all words, form the simple |
1359 | regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: |
1360 | |
1361 | while ($x =~ /(\w+)/g) { |
1362 | print "Word is $1, ends at position ", pos $x, "\n"; |
1363 | } |
1364 | |
1365 | prints |
1366 | |
1367 | Word is cat, ends at position 3 |
1368 | Word is dog, ends at position 7 |
1369 | Word is house, ends at position 13 |
1370 | |
1371 | A failed match or changing the target string resets the position. If |
1372 | you don't want the position reset after failure to match, add the |
1373 | C<//c>, as in C</regexp/gc>. The current position in the string is |
1374 | associated with the string, not the regexp. This means that different |
1375 | strings have different positions and their respective positions can be |
1376 | set or read independently. |
1377 | |
1378 | In list context, C<//g> returns a list of matched groupings, or if |
1379 | there are no groupings, a list of matches to the whole regexp. So if |
1380 | we wanted just the words, we could use |
1381 | |
1382 | @words = ($x =~ /(\w+)/g); # matches, |
1383 | # $word[0] = 'cat' |
1384 | # $word[1] = 'dog' |
1385 | # $word[2] = 'house' |
1386 | |
1387 | Closely associated with the C<//g> modifier is the C<\G> anchor. The |
1388 | C<\G> anchor matches at the point where the previous C<//g> match left |
1389 | off. C<\G> allows us to easily do context-sensitive matching: |
1390 | |
1391 | $metric = 1; # use metric units |
1392 | ... |
1393 | $x = <FILE>; # read in measurement |
1394 | $x =~ /^([+-]?\d+)\s*/g; # get magnitude |
1395 | $weight = $1; |
1396 | if ($metric) { # error checking |
1397 | print "Units error!" unless $x =~ /\Gkg\./g; |
1398 | } |
1399 | else { |
1400 | print "Units error!" unless $x =~ /\Glbs\./g; |
1401 | } |
1402 | $x =~ /\G\s+(widget|sprocket)/g; # continue processing |
1403 | |
1404 | The combination of C<//g> and C<\G> allows us to process the string a |
1405 | bit at a time and use arbitrary Perl logic to decide what to do next. |
1406 | |
1407 | C<\G> is also invaluable in processing fixed length records with |
1408 | regexps. Suppose we have a snippet of coding region DNA, encoded as |
1409 | base pair letters C<ATCGTTGAAT...> and we want to find all the stop |
1410 | codons C<TGA>. In a coding region, codons are 3-letter sequences, so |
1411 | we can think of the DNA snippet as a sequence of 3-letter records. The |
1412 | naive regexp |
1413 | |
1414 | # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" |
1415 | $dna = "ATCGTTGAATGCAAATGACATGAC"; |
1416 | $dna =~ /TGA/; |
1417 | |
1418 | doesn't work; it may match an C<TGA>, but there is no guarantee that |
1419 | the match is aligned with codon boundaries, e.g., the substring |
1420 | S<C<GTT GAA> > gives a match. A better solution is |
1421 | |
1422 | while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? |
1423 | print "Got a TGA stop codon at position ", pos $dna, "\n"; |
1424 | } |
1425 | |
1426 | which prints |
1427 | |
1428 | Got a TGA stop codon at position 18 |
1429 | Got a TGA stop codon at position 23 |
1430 | |
1431 | Position 18 is good, but position 23 is bogus. What happened? |
1432 | |
1433 | The answer is that our regexp works well until we get past the last |
1434 | real match. Then the regexp will fail to match a synchronized C<TGA> |
1435 | and start stepping ahead one character position at a time, not what we |
1436 | want. The solution is to use C<\G> to anchor the match to the codon |
1437 | alignment: |
1438 | |
1439 | while ($dna =~ /\G(\w\w\w)*?TGA/g) { |
1440 | print "Got a TGA stop codon at position ", pos $dna, "\n"; |
1441 | } |
1442 | |
1443 | This prints |
1444 | |
1445 | Got a TGA stop codon at position 18 |
1446 | |
1447 | which is the correct answer. This example illustrates that it is |
1448 | important not only to match what is desired, but to reject what is not |
1449 | desired. |
1450 | |
1451 | B<search and replace> |
1452 | |
1453 | Regular expressions also play a big role in B<search and replace> |
1454 | operations in Perl. Search and replace is accomplished with the |
1455 | C<s///> operator. The general form is |
1456 | C<s/regexp/replacement/modifiers>, with everything we know about |
1457 | regexps and modifiers applying in this case as well. The |
1458 | C<replacement> is a Perl double quoted string that replaces in the |
1459 | string whatever is matched with the C<regexp>. The operator C<=~> is |
1460 | also used here to associate a string with C<s///>. If matching |
1461 | against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, |
1462 | C<s///> returns the number of substitutions made, otherwise it returns |
1463 | false. Here are a few examples: |
1464 | |
1465 | $x = "Time to feed the cat!"; |
1466 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" |
1467 | if ($x =~ s/^(Time.*hacker)!$/$1 now!/) { |
1468 | $more_insistent = 1; |
1469 | } |
1470 | $y = "'quoted words'"; |
1471 | $y =~ s/^'(.*)'$/$1/; # strip single quotes, |
1472 | # $y contains "quoted words" |
1473 | |
1474 | In the last example, the whole string was matched, but only the part |
1475 | inside the single quotes was grouped. With the C<s///> operator, the |
1476 | matched variables C<$1>, C<$2>, etc. are immediately available for use |
1477 | in the replacement expression, so we use C<$1> to replace the quoted |
1478 | string with just what was quoted. With the global modifier, C<s///g> |
1479 | will search and replace all occurrences of the regexp in the string: |
1480 | |
1481 | $x = "I batted 4 for 4"; |
1482 | $x =~ s/4/four/; # doesn't do it all: |
1483 | # $x contains "I batted four for 4" |
1484 | $x = "I batted 4 for 4"; |
1485 | $x =~ s/4/four/g; # does it all: |
1486 | # $x contains "I batted four for four" |
1487 | |
1488 | If you prefer 'regex' over 'regexp' in this tutorial, you could use |
1489 | the following program to replace it: |
1490 | |
1491 | % cat > simple_replace |
1492 | #!/usr/bin/perl |
1493 | $regexp = shift; |
1494 | $replacement = shift; |
1495 | while (<>) { |
1496 | s/$regexp/$replacement/go; |
1497 | print; |
1498 | } |
1499 | ^D |
1500 | |
1501 | % simple_replace regexp regex perlretut.pod |
1502 | |
1503 | In C<simple_replace> we used the C<s///g> modifier to replace all |
1504 | occurrences of the regexp on each line and the C<s///o> modifier to |
1505 | compile the regexp only once. As with C<simple_grep>, both the |
1506 | C<print> and the C<s/$regexp/$replacement/go> use C<$_> implicitly. |
1507 | |
1508 | A modifier available specifically to search and replace is the |
1509 | C<s///e> evaluation modifier. C<s///e> wraps an C<eval{...}> around |
1510 | the replacement string and the evaluated result is substituted for the |
1511 | matched substring. C<s///e> is useful if you need to do a bit of |
1512 | computation in the process of replacing text. This example counts |
1513 | character frequencies in a line: |
1514 | |
1515 | $x = "Bill the cat"; |
1516 | $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself |
1517 | print "frequency of '$_' is $chars{$_}\n" |
1518 | foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); |
1519 | |
1520 | This prints |
1521 | |
1522 | frequency of ' ' is 2 |
1523 | frequency of 't' is 2 |
1524 | frequency of 'l' is 2 |
1525 | frequency of 'B' is 1 |
1526 | frequency of 'c' is 1 |
1527 | frequency of 'e' is 1 |
1528 | frequency of 'h' is 1 |
1529 | frequency of 'i' is 1 |
1530 | frequency of 'a' is 1 |
1531 | |
1532 | As with the match C<m//> operator, C<s///> can use other delimiters, |
1533 | such as C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are |
1534 | used C<s'''>, then the regexp and replacement are treated as single |
1535 | quoted strings and there are no substitutions. C<s///> in list context |
1536 | returns the same thing as in scalar context, i.e., the number of |
1537 | matches. |
1538 | |
1539 | B<The split operator> |
1540 | |
1541 | The B<C<split> > function can also optionally use a matching operator |
1542 | C<m//> to split a string. C<split /regexp/, string, limit> splits |
1543 | C<string> into a list of substrings and returns that list. The regexp |
1544 | is used to match the character sequence that the C<string> is split |
1545 | with respect to. The C<limit>, if present, constrains splitting into |
1546 | no more than C<limit> number of strings. For example, to split a |
1547 | string into words, use |
1548 | |
1549 | $x = "Calvin and Hobbes"; |
1550 | @words = split /\s+/, $x; # $word[0] = 'Calvin' |
1551 | # $word[1] = 'and' |
1552 | # $word[2] = 'Hobbes' |
1553 | |
1554 | If the empty regexp C<//> is used, the regexp always matches and |
1555 | the string is split into individual characters. If the regexp has |
1556 | groupings, then list produced contains the matched substrings from the |
1557 | groupings as well. For instance, |
1558 | |
1559 | $x = "/usr/bin/perl"; |
1560 | @dirs = split m!/!, $x; # $dirs[0] = '' |
1561 | # $dirs[1] = 'usr' |
1562 | # $dirs[2] = 'bin' |
1563 | # $dirs[3] = 'perl' |
1564 | @parts = split m!(/)!, $x; # $parts[0] = '' |
1565 | # $parts[1] = '/' |
1566 | # $parts[2] = 'usr' |
1567 | # $parts[3] = '/' |
1568 | # $parts[4] = 'bin' |
1569 | # $parts[5] = '/' |
1570 | # $parts[6] = 'perl' |
1571 | |
1572 | Since the first character of $x matched the regexp, C<split> prepended |
1573 | an empty initial element to the list. |
1574 | |
1575 | If you have read this far, congratulations! You now have all the basic |
1576 | tools needed to use regular expressions to solve a wide range of text |
1577 | processing problems. If this is your first time through the tutorial, |
1578 | why not stop here and play around with regexps a while... S<Part 2> |
1579 | concerns the more esoteric aspects of regular expressions and those |
1580 | concepts certainly aren't needed right at the start. |
1581 | |
1582 | =head1 Part 2: Power tools |
1583 | |
1584 | OK, you know the basics of regexps and you want to know more. If |
1585 | matching regular expressions is analogous to a walk in the woods, then |
1586 | the tools discussed in Part 1 are analogous to topo maps and a |
1587 | compass, basic tools we use all the time. Most of the tools in part 2 |
1588 | are are analogous to flare guns and satellite phones. They aren't used |
1589 | too often on a hike, but when we are stuck, they can be invaluable. |
1590 | |
1591 | What follows are the more advanced, less used, or sometimes esoteric |
1592 | capabilities of perl regexps. In Part 2, we will assume you are |
1593 | comfortable with the basics and concentrate on the new features. |
1594 | |
1595 | =head2 More on characters, strings, and character classes |
1596 | |
1597 | There are a number of escape sequences and character classes that we |
1598 | haven't covered yet. |
1599 | |
1600 | There are several escape sequences that convert characters or strings |
1601 | between upper and lower case. C<\l> and C<\u> convert the next |
1602 | character to lower or upper case, respectively: |
1603 | |
1604 | $x = "perl"; |
1605 | $string =~ /\u$x/; # matches 'Perl' in $string |
1606 | $x = "M(rs?|s)\\."; # note the double backslash |
1607 | $string =~ /\l$x/; # matches 'mr.', 'mrs.', and 'ms.', |
1608 | |
1609 | C<\L> and C<\U> converts a whole substring, delimited by C<\L> or |
1610 | C<\U> and C<\E>, to lower or upper case: |
1611 | |
1612 | $x = "This word is in lower case:\L SHOUT\E"; |
1613 | $x =~ /shout/; # matches |
1614 | $x = "I STILL KEYPUNCH CARDS FOR MY 360" |
1615 | $x =~ /\Ukeypunch/; # matches punch card string |
1616 | |
1617 | If there is no C<\E>, case is converted until the end of the |
1618 | string. The regexps C<\L\u$word> or C<\u\L$word> convert the first |
1619 | character of C<$word> to uppercase and the rest of the characters to |
1620 | lowercase. |
1621 | |
1622 | Control characters can be escaped with C<\c>, so that a control-Z |
1623 | character would be matched with C<\cZ>. The escape sequence |
1624 | C<\Q>...C<\E> quotes, or protects most non-alphabetic characters. For |
1625 | instance, |
1626 | |
1627 | $x = "\QThat !^*&%~& cat!"; |
1628 | $x =~ /\Q!^*&%~&\E/; # check for rough language |
1629 | |
1630 | It does not protect C<$> or C<@>, so that variables can still be |
1631 | substituted. |
1632 | |
1633 | With the advent of 5.6.0, perl regexps can handle more than just the |
1634 | standard ASCII character set. Perl now supports B<Unicode>, a standard |
1635 | for encoding the character sets from many of the world's written |
1636 | languages. Unicode does this by allowing characters to be more than |
1637 | one byte wide. Perl uses the UTF-8 encoding, in which ASCII characters |
1638 | are still encoded as one byte, but characters greater than C<chr(127)> |
1639 | may be stored as two or more bytes. |
1640 | |
1641 | What does this mean for regexps? Well, regexp users don't need to know |
1642 | much about perl's internal representation of strings. But they do need |
1643 | to know 1) how to represent Unicode characters in a regexp and 2) when |
1644 | a matching operation will treat the string to be searched as a |
1645 | sequence of bytes (the old way) or as a sequence of Unicode characters |
1646 | (the new way). The answer to 1) is that Unicode characters greater |
1647 | than C<chr(127)> may be represented using the C<\x{hex}> notation, |
1648 | with C<hex> a hexadecimal integer: |
1649 | |
1650 | use utf8; # We will be doing Unicode processing |
1651 | /\x{263a}/; # match a Unicode smiley face :) |
1652 | |
1653 | Unicode characters in the range of 128-255 use two hexadecimal digits |
1654 | with braces: C<\x{ab}>. Note that this is different than C<\xab>, |
1655 | which is just a hexadecimal byte with no Unicode |
1656 | significance. |
1657 | |
1658 | Figuring out the hexadecimal sequence of a Unicode character you want |
1659 | or deciphering someone else's hexadecimal Unicode regexp is about as |
1660 | much fun as programming in machine code. So another way to specify |
1661 | Unicode characters is to use the S<B<named character> > escape |
1662 | sequence C<\N{name}>. C<name> is a name for the Unicode character, as |
55eda711 |
1663 | specified in the Unicode standard. For instance, if we wanted to |
1664 | represent or match the astrological sign for the planet Mercury, we |
1665 | could use |
47f9c88b |
1666 | |
1667 | use utf8; # We will be doing Unicode processing |
1668 | use charnames ":full"; # use named chars with Unicode full names |
1669 | $x = "abc\N{MERCURY}def"; |
1670 | $x =~ /\N{MERCURY}/; # matches |
1671 | |
1672 | One can also use short names or restrict names to a certain alphabet: |
1673 | |
1674 | use utf8; # We will be doing Unicode processing |
1675 | |
1676 | use charnames ':full'; |
1677 | print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n"; |
1678 | |
1679 | use charnames ":short"; |
1680 | print "\N{greek:Sigma} is an upper-case sigma.\n"; |
1681 | |
1682 | use charnames qw(greek); |
1683 | print "\N{sigma} is Greek sigma\n"; |
1684 | |
1685 | A list of full names is found in the file Names.txt in the |
55d7b906 |
1686 | lib/perl5/5.X.X/unicore directory. |
47f9c88b |
1687 | |
1688 | The answer to requirement 2), as of 5.6.0, is that if a regexp |
1689 | contains Unicode characters, the string is searched as a sequence of |
1690 | Unicode characters. Otherwise, the string is searched as a sequence of |
1691 | bytes. If the string is being searched as a sequence of Unicode |
1692 | characters, but matching a single byte is required, we can use the C<\C> |
1693 | escape sequence. C<\C> is a character class akin to C<.> except that |
1694 | it matches I<any> byte 0-255. So |
1695 | |
1696 | use utf8; # We will be doing Unicode processing |
1697 | use charnames ":full"; # use named chars with Unicode full names |
1698 | $x = "a"; |
1699 | $x =~ /\C/; # matches 'a', eats one byte |
1700 | $x = ""; |
1701 | $x =~ /\C/; # doesn't match, no bytes to match |
1702 | $x = "\N{MERCURY}"; # two-byte Unicode character |
1703 | $x =~ /\C/; # matches, but dangerous! |
1704 | |
1705 | The last regexp matches, but is dangerous because the string |
a6b2f353 |
1706 | I<character> position is no longer synchronized to the string I<byte> |
47f9c88b |
1707 | position. This generates the warning 'Malformed UTF-8 |
1708 | character'. C<\C> is best used for matching the binary data in strings |
1709 | with binary data intermixed with Unicode characters. |
1710 | |
1711 | Let us now discuss the rest of the character classes. Just as with |
1712 | Unicode characters, there are named Unicode character classes |
1713 | represented by the C<\p{name}> escape sequence. Closely associated is |
1714 | the C<\P{name}> character class, which is the negation of the |
1715 | C<\p{name}> class. For example, to match lower and uppercase |
1716 | characters, |
1717 | |
1718 | use utf8; # We will be doing Unicode processing |
1719 | use charnames ":full"; # use named chars with Unicode full names |
1720 | $x = "BOB"; |
1721 | $x =~ /^\p{IsUpper}/; # matches, uppercase char class |
1722 | $x =~ /^\P{IsUpper}/; # doesn't match, char class sans uppercase |
1723 | $x =~ /^\p{IsLower}/; # doesn't match, lowercase char class |
1724 | $x =~ /^\P{IsLower}/; # matches, char class sans lowercase |
1725 | |
86929931 |
1726 | Here is the association between some Perl named classes and the |
1727 | traditional Unicode classes: |
47f9c88b |
1728 | |
86929931 |
1729 | Perl class name Unicode class name or regular expression |
47f9c88b |
1730 | |
f5868911 |
1731 | IsAlpha /^[LM]/ |
1732 | IsAlnum /^[LMN]/ |
1733 | IsASCII $code <= 127 |
1734 | IsCntrl /^C/ |
1735 | IsBlank $code =~ /^(0020|0009)$/ || /^Z[^lp]/ |
47f9c88b |
1736 | IsDigit Nd |
f5868911 |
1737 | IsGraph /^([LMNPS]|Co)/ |
47f9c88b |
1738 | IsLower Ll |
f5868911 |
1739 | IsPrint /^([LMNPS]|Co|Zs)/ |
1740 | IsPunct /^P/ |
1741 | IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ |
1742 | IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/ |
1743 | IsUpper /^L[ut]/ |
1744 | IsWord /^[LMN]/ || $code eq "005F" |
47f9c88b |
1745 | IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ |
1746 | |
86929931 |
1747 | You can also use the official Unicode class names with the C<\p> and |
1748 | C<\P>, like C<\p{L}> for Unicode 'letters', or C<\p{Lu}> for uppercase |
1749 | letters, or C<\P{Nd}> for non-digits. If a C<name> is just one |
1750 | letter, the braces can be dropped. For instance, C<\pM> is the |
98f22ffc |
1751 | character class of Unicode 'marks', for example accent marks. |
32293815 |
1752 | For the full list see L<perlunicode>. |
1753 | |
5e42d7b4 |
1754 | The Unicode has also been separated into various sets of charaters |
1755 | which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in), |
1756 | for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>. |
1757 | For the full list see L<perlunicode>. |
47f9c88b |
1758 | |
1759 | C<\X> is an abbreviation for a character class sequence that includes |
1760 | the Unicode 'combining character sequences'. A 'combining character |
1761 | sequence' is a base character followed by any number of combining |
1762 | characters. An example of a combining character is an accent. Using |
1763 | the Unicode full names, e.g., S<C<A + COMBINING RING> > is a combining |
1764 | character sequence with base character C<A> and combining character |
1765 | S<C<COMBINING RING> >, which translates in Danish to A with the circle |
1766 | atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>, |
1767 | i.e., a non-mark followed by one or more marks. |
1768 | |
5e42d7b4 |
1769 | For the the full and latest information about Unicode see the latest |
1770 | Unicode standard, or the Unicode Consortium's website http://www.unicode.org/ |
1771 | |
47f9c88b |
1772 | As if all those classes weren't enough, Perl also defines POSIX style |
1773 | character classes. These have the form C<[:name:]>, with C<name> the |
aaa51d5e |
1774 | name of the POSIX class. The POSIX classes are C<alpha>, C<alnum>, |
1775 | C<ascii>, C<cntrl>, C<digit>, C<graph>, C<lower>, C<print>, C<punct>, |
1776 | C<space>, C<upper>, and C<xdigit>, and two extensions, C<word> (a Perl |
1777 | extension to match C<\w>), and C<blank> (a GNU extension). If C<utf8> |
1778 | is being used, then these classes are defined the same as their |
1779 | corresponding perl Unicode classes: C<[:upper:]> is the same as |
1780 | C<\p{IsUpper}>, etc. The POSIX character classes, however, don't |
1781 | require using C<utf8>. The C<[:digit:]>, C<[:word:]>, and |
47f9c88b |
1782 | C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> |
aaa51d5e |
1783 | character classes. To negate a POSIX class, put a C<^> in front of |
1784 | the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under |
47f9c88b |
1785 | C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can |
1786 | be used just like C<\d>, both inside and outside of character classes: |
1787 | |
1788 | /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit |
1789 | /^=item\s[:digit:]/; # match '=item', |
1790 | # followed by a space and a digit |
1791 | use utf8; |
1792 | use charnames ":full"; |
1793 | /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit |
1794 | /^=item\s\p{IsDigit}/; # match '=item', |
1795 | # followed by a space and a digit |
1796 | |
1797 | Whew! That is all the rest of the characters and character classes. |
1798 | |
1799 | =head2 Compiling and saving regular expressions |
1800 | |
1801 | In Part 1 we discussed the C<//o> modifier, which compiles a regexp |
1802 | just once. This suggests that a compiled regexp is some data structure |
1803 | that can be stored once and used again and again. The regexp quote |
1804 | C<qr//> does exactly that: C<qr/string/> compiles the C<string> as a |
1805 | regexp and transforms the result into a form that can be assigned to a |
1806 | variable: |
1807 | |
1808 | $reg = qr/foo+bar?/; # reg contains a compiled regexp |
1809 | |
1810 | Then C<$reg> can be used as a regexp: |
1811 | |
1812 | $x = "fooooba"; |
1813 | $x =~ $reg; # matches, just like /foo+bar?/ |
1814 | $x =~ /$reg/; # same thing, alternate form |
1815 | |
1816 | C<$reg> can also be interpolated into a larger regexp: |
1817 | |
1818 | $x =~ /(abc)?$reg/; # still matches |
1819 | |
1820 | As with the matching operator, the regexp quote can use different |
1821 | delimiters, e.g., C<qr!!>, C<qr{}> and C<qr~~>. The single quote |
1822 | delimiters C<qr''> prevent any interpolation from taking place. |
1823 | |
1824 | Pre-compiled regexps are useful for creating dynamic matches that |
1825 | don't need to be recompiled each time they are encountered. Using |
1826 | pre-compiled regexps, C<simple_grep> program can be expanded into a |
1827 | program that matches multiple patterns: |
1828 | |
1829 | % cat > multi_grep |
1830 | #!/usr/bin/perl |
1831 | # multi_grep - match any of <number> regexps |
1832 | # usage: multi_grep <number> regexp1 regexp2 ... file1 file2 ... |
1833 | |
1834 | $number = shift; |
1835 | $regexp[$_] = shift foreach (0..$number-1); |
1836 | @compiled = map qr/$_/, @regexp; |
1837 | while ($line = <>) { |
1838 | foreach $pattern (@compiled) { |
1839 | if ($line =~ /$pattern/) { |
1840 | print $line; |
1841 | last; # we matched, so move onto the next line |
1842 | } |
1843 | } |
1844 | } |
1845 | ^D |
1846 | |
1847 | % multi_grep 2 last for multi_grep |
1848 | $regexp[$_] = shift foreach (0..$number-1); |
1849 | foreach $pattern (@compiled) { |
1850 | last; |
1851 | |
1852 | Storing pre-compiled regexps in an array C<@compiled> allows us to |
1853 | simply loop through the regexps without any recompilation, thus gaining |
1854 | flexibility without sacrificing speed. |
1855 | |
1856 | =head2 Embedding comments and modifiers in a regular expression |
1857 | |
1858 | Starting with this section, we will be discussing Perl's set of |
1859 | B<extended patterns>. These are extensions to the traditional regular |
1860 | expression syntax that provide powerful new tools for pattern |
1861 | matching. We have already seen extensions in the form of the minimal |
1862 | matching constructs C<??>, C<*?>, C<+?>, C<{n,m}?>, and C<{n,}?>. The |
1863 | rest of the extensions below have the form C<(?char...)>, where the |
1864 | C<char> is a character that determines the type of extension. |
1865 | |
1866 | The first extension is an embedded comment C<(?#text)>. This embeds a |
1867 | comment into the regular expression without affecting its meaning. The |
1868 | comment should not have any closing parentheses in the text. An |
1869 | example is |
1870 | |
1871 | /(?# Match an integer:)[+-]?\d+/; |
1872 | |
1873 | This style of commenting has been largely superseded by the raw, |
1874 | freeform commenting that is allowed with the C<//x> modifier. |
1875 | |
1876 | The modifiers C<//i>, C<//m>, C<//s>, and C<//x> can also embedded in |
1877 | a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance, |
1878 | |
1879 | /(?i)yes/; # match 'yes' case insensitively |
1880 | /yes/i; # same thing |
1881 | /(?x)( # freeform version of an integer regexp |
1882 | [+-]? # match an optional sign |
1883 | \d+ # match a sequence of digits |
1884 | ) |
1885 | /x; |
1886 | |
1887 | Embedded modifiers can have two important advantages over the usual |
1888 | modifiers. Embedded modifiers allow a custom set of modifiers to |
1889 | I<each> regexp pattern. This is great for matching an array of regexps |
1890 | that must have different modifiers: |
1891 | |
1892 | $pattern[0] = '(?i)doctor'; |
1893 | $pattern[1] = 'Johnson'; |
1894 | ... |
1895 | while (<>) { |
1896 | foreach $patt (@pattern) { |
1897 | print if /$patt/; |
1898 | } |
1899 | } |
1900 | |
1901 | The second advantage is that embedded modifiers only affect the regexp |
1902 | inside the group the embedded modifier is contained in. So grouping |
1903 | can be used to localize the modifier's effects: |
1904 | |
1905 | /Answer: ((?i)yes)/; # matches 'Answer: yes', 'Answer: YES', etc. |
1906 | |
1907 | Embedded modifiers can also turn off any modifiers already present |
1908 | by using, e.g., C<(?-i)>. Modifiers can also be combined into |
1909 | a single expression, e.g., C<(?s-i)> turns on single line mode and |
1910 | turns off case insensitivity. |
1911 | |
1912 | =head2 Non-capturing groupings |
1913 | |
1914 | We noted in Part 1 that groupings C<()> had two distinct functions: 1) |
1915 | group regexp elements together as a single unit, and 2) extract, or |
1916 | capture, substrings that matched the regexp in the |
1917 | grouping. Non-capturing groupings, denoted by C<(?:regexp)>, allow the |
1918 | regexp to be treated as a single unit, but don't extract substrings or |
1919 | set matching variables C<$1>, etc. Both capturing and non-capturing |
1920 | groupings are allowed to co-exist in the same regexp. Because there is |
1921 | no extraction, non-capturing groupings are faster than capturing |
1922 | groupings. Non-capturing groupings are also handy for choosing exactly |
1923 | which parts of a regexp are to be extracted to matching variables: |
1924 | |
1925 | # match a number, $1-$4 are set, but we only want $1 |
1926 | /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; |
1927 | |
1928 | # match a number faster , only $1 is set |
1929 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; |
1930 | |
1931 | # match a number, get $1 = whole number, $2 = exponent |
1932 | /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/; |
1933 | |
1934 | Non-capturing groupings are also useful for removing nuisance |
1935 | elements gathered from a split operation: |
1936 | |
1937 | $x = '12a34b5'; |
1938 | @num = split /(a|b)/, $x; # @num = ('12','a','34','b','5') |
1939 | @num = split /(?:a|b)/, $x; # @num = ('12','34','5') |
1940 | |
1941 | Non-capturing groupings may also have embedded modifiers: |
1942 | C<(?i-m:regexp)> is a non-capturing grouping that matches C<regexp> |
1943 | case insensitively and turns off multi-line mode. |
1944 | |
1945 | =head2 Looking ahead and looking behind |
1946 | |
1947 | This section concerns the lookahead and lookbehind assertions. First, |
1948 | a little background. |
1949 | |
1950 | In Perl regular expressions, most regexp elements 'eat up' a certain |
1951 | amount of string when they match. For instance, the regexp element |
1952 | C<[abc}]> eats up one character of the string when it matches, in the |
1953 | sense that perl moves to the next character position in the string |
1954 | after the match. There are some elements, however, that don't eat up |
1955 | characters (advance the character position) if they match. The examples |
1956 | we have seen so far are the anchors. The anchor C<^> matches the |
1957 | beginning of the line, but doesn't eat any characters. Similarly, the |
1958 | word boundary anchor C<\b> matches, e.g., if the character to the left |
1959 | is a word character and the character to the right is a non-word |
1960 | character, but it doesn't eat up any characters itself. Anchors are |
1961 | examples of 'zero-width assertions'. Zero-width, because they consume |
1962 | no characters, and assertions, because they test some property of the |
1963 | string. In the context of our walk in the woods analogy to regexp |
1964 | matching, most regexp elements move us along a trail, but anchors have |
1965 | us stop a moment and check our surroundings. If the local environment |
1966 | checks out, we can proceed forward. But if the local environment |
1967 | doesn't satisfy us, we must backtrack. |
1968 | |
1969 | Checking the environment entails either looking ahead on the trail, |
1970 | looking behind, or both. C<^> looks behind, to see that there are no |
1971 | characters before. C<$> looks ahead, to see that there are no |
1972 | characters after. C<\b> looks both ahead and behind, to see if the |
1973 | characters on either side differ in their 'word'-ness. |
1974 | |
1975 | The lookahead and lookbehind assertions are generalizations of the |
1976 | anchor concept. Lookahead and lookbehind are zero-width assertions |
1977 | that let us specify which characters we want to test for. The |
1978 | lookahead assertion is denoted by C<(?=regexp)> and the lookbehind |
a6b2f353 |
1979 | assertion is denoted by C<< (?<=fixed-regexp) >>. Some examples are |
47f9c88b |
1980 | |
1981 | $x = "I catch the housecat 'Tom-cat' with catnip"; |
1982 | $x =~ /cat(?=\s+)/; # matches 'cat' in 'housecat' |
1983 | @catwords = ($x =~ /(?<=\s)cat\w+/g); # matches, |
1984 | # $catwords[0] = 'catch' |
1985 | # $catwords[1] = 'catnip' |
1986 | $x =~ /\bcat\b/; # matches 'cat' in 'Tom-cat' |
1987 | $x =~ /(?<=\s)cat(?=\s)/; # doesn't match; no isolated 'cat' in |
1988 | # middle of $x |
1989 | |
a6b2f353 |
1990 | Note that the parentheses in C<(?=regexp)> and C<< (?<=regexp) >> are |
47f9c88b |
1991 | non-capturing, since these are zero-width assertions. Thus in the |
1992 | second regexp, the substrings captured are those of the whole regexp |
a6b2f353 |
1993 | itself. Lookahead C<(?=regexp)> can match arbitrary regexps, but |
1994 | lookbehind C<< (?<=fixed-regexp) >> only works for regexps of fixed |
1995 | width, i.e., a fixed number of characters long. Thus |
1996 | C<< (?<=(ab|bc)) >> is fine, but C<< (?<=(ab)*) >> is not. The |
1997 | negated versions of the lookahead and lookbehind assertions are |
1998 | denoted by C<(?!regexp)> and C<< (?<!fixed-regexp) >> respectively. |
1999 | They evaluate true if the regexps do I<not> match: |
47f9c88b |
2000 | |
2001 | $x = "foobar"; |
2002 | $x =~ /foo(?!bar)/; # doesn't match, 'bar' follows 'foo' |
2003 | $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' |
2004 | $x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo' |
2005 | |
2006 | =head2 Using independent subexpressions to prevent backtracking |
2007 | |
2008 | The last few extended patterns in this tutorial are experimental as of |
2009 | 5.6.0. Play with them, use them in some code, but don't rely on them |
2010 | just yet for production code. |
2011 | |
2012 | S<B<Independent subexpressions> > are regular expressions, in the |
2013 | context of a larger regular expression, that function independently of |
2014 | the larger regular expression. That is, they consume as much or as |
2015 | little of the string as they wish without regard for the ability of |
2016 | the larger regexp to match. Independent subexpressions are represented |
2017 | by C<< (?>regexp) >>. We can illustrate their behavior by first |
2018 | considering an ordinary regexp: |
2019 | |
2020 | $x = "ab"; |
2021 | $x =~ /a*ab/; # matches |
2022 | |
2023 | This obviously matches, but in the process of matching, the |
2024 | subexpression C<a*> first grabbed the C<a>. Doing so, however, |
2025 | wouldn't allow the whole regexp to match, so after backtracking, C<a*> |
2026 | eventually gave back the C<a> and matched the empty string. Here, what |
2027 | C<a*> matched was I<dependent> on what the rest of the regexp matched. |
2028 | |
2029 | Contrast that with an independent subexpression: |
2030 | |
2031 | $x =~ /(?>a*)ab/; # doesn't match! |
2032 | |
2033 | The independent subexpression C<< (?>a*) >> doesn't care about the rest |
2034 | of the regexp, so it sees an C<a> and grabs it. Then the rest of the |
2035 | regexp C<ab> cannot match. Because C<< (?>a*) >> is independent, there |
2036 | is no backtracking and and the independent subexpression does not give |
2037 | up its C<a>. Thus the match of the regexp as a whole fails. A similar |
2038 | behavior occurs with completely independent regexps: |
2039 | |
2040 | $x = "ab"; |
2041 | $x =~ /a*/g; # matches, eats an 'a' |
2042 | $x =~ /\Gab/g; # doesn't match, no 'a' available |
2043 | |
2044 | Here C<//g> and C<\G> create a 'tag team' handoff of the string from |
2045 | one regexp to the other. Regexps with an independent subexpression are |
2046 | much like this, with a handoff of the string to the independent |
2047 | subexpression, and a handoff of the string back to the enclosing |
2048 | regexp. |
2049 | |
2050 | The ability of an independent subexpression to prevent backtracking |
2051 | can be quite useful. Suppose we want to match a non-empty string |
2052 | enclosed in parentheses up to two levels deep. Then the following |
2053 | regexp matches: |
2054 | |
2055 | $x = "abc(de(fg)h"; # unbalanced parentheses |
2056 | $x =~ /\( ( [^()]+ | \([^()]*\) )+ \)/x; |
2057 | |
2058 | The regexp matches an open parenthesis, one or more copies of an |
2059 | alternation, and a close parenthesis. The alternation is two-way, with |
2060 | the first alternative C<[^()]+> matching a substring with no |
2061 | parentheses and the second alternative C<\([^()]*\)> matching a |
2062 | substring delimited by parentheses. The problem with this regexp is |
2063 | that it is pathological: it has nested indeterminate quantifiers |
2064 | of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers |
2065 | like this could take an exponentially long time to execute if there |
2066 | was no match possible. To prevent the exponential blowup, we need to |
2067 | prevent useless backtracking at some point. This can be done by |
2068 | enclosing the inner quantifier as an independent subexpression: |
2069 | |
2070 | $x =~ /\( ( (?>[^()]+) | \([^()]*\) )+ \)/x; |
2071 | |
2072 | Here, C<< (?>[^()]+) >> breaks the degeneracy of string partitioning |
2073 | by gobbling up as much of the string as possible and keeping it. Then |
2074 | match failures fail much more quickly. |
2075 | |
2076 | =head2 Conditional expressions |
2077 | |
2078 | A S<B<conditional expression> > is a form of if-then-else statement |
2079 | that allows one to choose which patterns are to be matched, based on |
2080 | some condition. There are two types of conditional expression: |
2081 | C<(?(condition)yes-regexp)> and |
2082 | C<(?(condition)yes-regexp|no-regexp)>. C<(?(condition)yes-regexp)> is |
2083 | like an S<C<'if () {}'> > statement in Perl. If the C<condition> is true, |
2084 | the C<yes-regexp> will be matched. If the C<condition> is false, the |
2085 | C<yes-regexp> will be skipped and perl will move onto the next regexp |
2086 | element. The second form is like an S<C<'if () {} else {}'> > statement |
2087 | in Perl. If the C<condition> is true, the C<yes-regexp> will be |
2088 | matched, otherwise the C<no-regexp> will be matched. |
2089 | |
2090 | The C<condition> can have two forms. The first form is simply an |
2091 | integer in parentheses C<(integer)>. It is true if the corresponding |
2092 | backreference C<\integer> matched earlier in the regexp. The second |
2093 | form is a bare zero width assertion C<(?...)>, either a |
2094 | lookahead, a lookbehind, or a code assertion (discussed in the next |
2095 | section). |
2096 | |
2097 | The integer form of the C<condition> allows us to choose, with more |
2098 | flexibility, what to match based on what matched earlier in the |
2099 | regexp. This searches for words of the form C<"$x$x"> or |
2100 | C<"$x$y$y$x">: |
2101 | |
2102 | % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words |
2103 | beriberi |
2104 | coco |
2105 | couscous |
2106 | deed |
2107 | ... |
2108 | toot |
2109 | toto |
2110 | tutu |
2111 | |
2112 | The lookbehind C<condition> allows, along with backreferences, |
2113 | an earlier part of the match to influence a later part of the |
2114 | match. For instance, |
2115 | |
2116 | /[ATGC]+(?(?<=AA)G|C)$/; |
2117 | |
2118 | matches a DNA sequence such that it either ends in C<AAG>, or some |
2119 | other base pair combination and C<C>. Note that the form is |
a6b2f353 |
2120 | C<< (?(?<=AA)G|C) >> and not C<< (?((?<=AA))G|C) >>; for the |
2121 | lookahead, lookbehind or code assertions, the parentheses around the |
2122 | conditional are not needed. |
47f9c88b |
2123 | |
2124 | =head2 A bit of magic: executing Perl code in a regular expression |
2125 | |
2126 | Normally, regexps are a part of Perl expressions. |
2127 | S<B<Code evaluation> > expressions turn that around by allowing |
2128 | arbitrary Perl code to be a part of of a regexp. A code evaluation |
2129 | expression is denoted C<(?{code})>, with C<code> a string of Perl |
2130 | statements. |
2131 | |
2132 | Code expressions are zero-width assertions, and the value they return |
2133 | depends on their environment. There are two possibilities: either the |
2134 | code expression is used as a conditional in a conditional expression |
2135 | C<(?(condition)...)>, or it is not. If the code expression is a |
2136 | conditional, the code is evaluated and the result (i.e., the result of |
2137 | the last statement) is used to determine truth or falsehood. If the |
2138 | code expression is not used as a conditional, the assertion always |
2139 | evaluates true and the result is put into the special variable |
2140 | C<$^R>. The variable C<$^R> can then be used in code expressions later |
2141 | in the regexp. Here are some silly examples: |
2142 | |
2143 | $x = "abcdef"; |
2144 | $x =~ /abc(?{print "Hi Mom!";})def/; # matches, |
2145 | # prints 'Hi Mom!' |
2146 | $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match, |
2147 | # no 'Hi Mom!' |
745e1e41 |
2148 | |
2149 | Pay careful attention to the next example: |
2150 | |
47f9c88b |
2151 | $x =~ /abc(?{print "Hi Mom!";})ddd/; # doesn't match, |
2152 | # no 'Hi Mom!' |
745e1e41 |
2153 | # but why not? |
2154 | |
2155 | At first glance, you'd think that it shouldn't print, because obviously |
2156 | the C<ddd> isn't going to match the target string. But look at this |
2157 | example: |
2158 | |
2159 | $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match, |
2160 | # but _does_ print |
2161 | |
2162 | Hmm. What happened here? If you've been following along, you know that |
2163 | the above pattern should be effectively the same as the last one -- |
2164 | enclosing the d in a character class isn't going to change what it |
2165 | matches. So why does the first not print while the second one does? |
2166 | |
2167 | The answer lies in the optimizations the REx engine makes. In the first |
2168 | case, all the engine sees are plain old characters (aside from the |
2169 | C<?{}> construct). It's smart enough to realize that the string 'ddd' |
2170 | doesn't occur in our target string before actually running the pattern |
2171 | through. But in the second case, we've tricked it into thinking that our |
2172 | pattern is more complicated than it is. It takes a look, sees our |
2173 | character class, and decides that it will have to actually run the |
2174 | pattern to determine whether or not it matches, and in the process of |
2175 | running it hits the print statement before it discovers that we don't |
2176 | have a match. |
2177 | |
2178 | To take a closer look at how the engine does optimizations, see the |
2179 | section L<"Pragmas and debugging"> below. |
2180 | |
2181 | More fun with C<?{}>: |
2182 | |
47f9c88b |
2183 | $x =~ /(?{print "Hi Mom!";})/; # matches, |
2184 | # prints 'Hi Mom!' |
2185 | $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches, |
2186 | # prints '1' |
2187 | $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches, |
2188 | # prints '1' |
2189 | |
2190 | The bit of magic mentioned in the section title occurs when the regexp |
2191 | backtracks in the process of searching for a match. If the regexp |
2192 | backtracks over a code expression and if the variables used within are |
2193 | localized using C<local>, the changes in the variables produced by the |
2194 | code expression are undone! Thus, if we wanted to count how many times |
2195 | a character got matched inside a group, we could use, e.g., |
2196 | |
2197 | $x = "aaaa"; |
2198 | $count = 0; # initialize 'a' count |
2199 | $c = "bob"; # test if $c gets clobbered |
2200 | $x =~ /(?{local $c = 0;}) # initialize count |
2201 | ( a # match 'a' |
2202 | (?{local $c = $c + 1;}) # increment count |
2203 | )* # do this any number of times, |
2204 | aa # but match 'aa' at the end |
2205 | (?{$count = $c;}) # copy local $c var into $count |
2206 | /x; |
2207 | print "'a' count is $count, \$c variable is '$c'\n"; |
2208 | |
2209 | This prints |
2210 | |
2211 | 'a' count is 2, $c variable is 'bob' |
2212 | |
2213 | If we replace the S<C< (?{local $c = $c + 1;})> > with |
2214 | S<C< (?{$c = $c + 1;})> >, the variable changes are I<not> undone |
2215 | during backtracking, and we get |
2216 | |
2217 | 'a' count is 4, $c variable is 'bob' |
2218 | |
2219 | Note that only localized variable changes are undone. Other side |
2220 | effects of code expression execution are permanent. Thus |
2221 | |
2222 | $x = "aaaa"; |
2223 | $x =~ /(a(?{print "Yow\n";}))*aa/; |
2224 | |
2225 | produces |
2226 | |
2227 | Yow |
2228 | Yow |
2229 | Yow |
2230 | Yow |
2231 | |
2232 | The result C<$^R> is automatically localized, so that it will behave |
2233 | properly in the presence of backtracking. |
2234 | |
2235 | This example uses a code expression in a conditional to match the |
2236 | article 'the' in either English or German: |
2237 | |
47f9c88b |
2238 | $lang = 'DE'; # use German |
2239 | ... |
2240 | $text = "das"; |
2241 | print "matched\n" |
2242 | if $text =~ /(?(?{ |
2243 | $lang eq 'EN'; # is the language English? |
2244 | }) |
2245 | the | # if so, then match 'the' |
2246 | (die|das|der) # else, match 'die|das|der' |
2247 | ) |
2248 | /xi; |
2249 | |
2250 | Note that the syntax here is C<(?(?{...})yes-regexp|no-regexp)>, not |
2251 | C<(?((?{...}))yes-regexp|no-regexp)>. In other words, in the case of a |
2252 | code expression, we don't need the extra parentheses around the |
2253 | conditional. |
2254 | |
a6b2f353 |
2255 | If you try to use code expressions with interpolating variables, perl |
2256 | may surprise you: |
2257 | |
2258 | $bar = 5; |
2259 | $pat = '(?{ 1 })'; |
2260 | /foo(?{ $bar })bar/; # compiles ok, $bar not interpolated |
2261 | /foo(?{ 1 })$bar/; # compile error! |
2262 | /foo${pat}bar/; # compile error! |
2263 | |
2264 | $pat = qr/(?{ $foo = 1 })/; # precompile code regexp |
2265 | /foo${pat}bar/; # compiles ok |
2266 | |
2267 | If a regexp has (1) code expressions and interpolating variables,or |
2268 | (2) a variable that interpolates a code expression, perl treats the |
2269 | regexp as an error. If the code expression is precompiled into a |
2270 | variable, however, interpolating is ok. The question is, why is this |
2271 | an error? |
2272 | |
2273 | The reason is that variable interpolation and code expressions |
2274 | together pose a security risk. The combination is dangerous because |
2275 | many programmers who write search engines often take user input and |
2276 | plug it directly into a regexp: |
47f9c88b |
2277 | |
2278 | $regexp = <>; # read user-supplied regexp |
2279 | $chomp $regexp; # get rid of possible newline |
2280 | $text =~ /$regexp/; # search $text for the $regexp |
2281 | |
a6b2f353 |
2282 | If the C<$regexp> variable contains a code expression, the user could |
2283 | then execute arbitrary Perl code. For instance, some joker could |
47f9c88b |
2284 | search for S<C<system('rm -rf *');> > to erase your files. In this |
2285 | sense, the combination of interpolation and code expressions B<taints> |
2286 | your regexp. So by default, using both interpolation and code |
a6b2f353 |
2287 | expressions in the same regexp is not allowed. If you're not |
2288 | concerned about malicious users, it is possible to bypass this |
2289 | security check by invoking S<C<use re 'eval'> >: |
2290 | |
2291 | use re 'eval'; # throw caution out the door |
2292 | $bar = 5; |
2293 | $pat = '(?{ 1 })'; |
2294 | /foo(?{ 1 })$bar/; # compiles ok |
2295 | /foo${pat}bar/; # compiles ok |
47f9c88b |
2296 | |
2297 | Another form of code expression is the S<B<pattern code expression> >. |
2298 | The pattern code expression is like a regular code expression, except |
2299 | that the result of the code evaluation is treated as a regular |
2300 | expression and matched immediately. A simple example is |
2301 | |
2302 | $length = 5; |
2303 | $char = 'a'; |
2304 | $x = 'aaaaabb'; |
2305 | $x =~ /(??{$char x $length})/x; # matches, there are 5 of 'a' |
2306 | |
2307 | |
2308 | This final example contains both ordinary and pattern code |
2309 | expressions. It detects if a binary string C<1101010010001...> has a |
2310 | Fibonacci spacing 0,1,1,2,3,5,... of the C<1>'s: |
2311 | |
47f9c88b |
2312 | $s0 = 0; $s1 = 1; # initial conditions |
2313 | $x = "1101010010001000001"; |
2314 | print "It is a Fibonacci sequence\n" |
2315 | if $x =~ /^1 # match an initial '1' |
2316 | ( |
2317 | (??{'0' x $s0}) # match $s0 of '0' |
2318 | 1 # and then a '1' |
2319 | (?{ |
2320 | $largest = $s0; # largest seq so far |
2321 | $s2 = $s1 + $s0; # compute next term |
2322 | $s0 = $s1; # in Fibonacci sequence |
2323 | $s1 = $s2; |
2324 | }) |
2325 | )+ # repeat as needed |
2326 | $ # that is all there is |
2327 | /x; |
2328 | print "Largest sequence matched was $largest\n"; |
2329 | |
2330 | This prints |
2331 | |
2332 | It is a Fibonacci sequence |
2333 | Largest sequence matched was 5 |
2334 | |
2335 | Ha! Try that with your garden variety regexp package... |
2336 | |
2337 | Note that the variables C<$s0> and C<$s1> are not substituted when the |
2338 | regexp is compiled, as happens for ordinary variables outside a code |
2339 | expression. Rather, the code expressions are evaluated when perl |
2340 | encounters them during the search for a match. |
2341 | |
2342 | The regexp without the C<//x> modifier is |
2343 | |
2344 | /^1((??{'0'x$s0})1(?{$largest=$s0;$s2=$s1+$s0$s0=$s1;$s1=$s2;}))+$/; |
2345 | |
2346 | and is a great start on an Obfuscated Perl entry :-) When working with |
2347 | code and conditional expressions, the extended form of regexps is |
2348 | almost necessary in creating and debugging regexps. |
2349 | |
2350 | =head2 Pragmas and debugging |
2351 | |
2352 | Speaking of debugging, there are several pragmas available to control |
2353 | and debug regexps in Perl. We have already encountered one pragma in |
2354 | the previous section, S<C<use re 'eval';> >, that allows variable |
a6b2f353 |
2355 | interpolation and code expressions to coexist in a regexp. The other |
2356 | pragmas are |
47f9c88b |
2357 | |
2358 | use re 'taint'; |
2359 | $tainted = <>; |
2360 | @parts = ($tainted =~ /(\w+)\s+(\w+)/; # @parts is now tainted |
2361 | |
2362 | The C<taint> pragma causes any substrings from a match with a tainted |
2363 | variable to be tainted as well. This is not normally the case, as |
2364 | regexps are often used to extract the safe bits from a tainted |
2365 | variable. Use C<taint> when you are not extracting safe bits, but are |
2366 | performing some other processing. Both C<taint> and C<eval> pragmas |
a6b2f353 |
2367 | are lexically scoped, which means they are in effect only until |
47f9c88b |
2368 | the end of the block enclosing the pragmas. |
2369 | |
2370 | use re 'debug'; |
2371 | /^(.*)$/s; # output debugging info |
2372 | |
2373 | use re 'debugcolor'; |
2374 | /^(.*)$/s; # output debugging info in living color |
2375 | |
2376 | The global C<debug> and C<debugcolor> pragmas allow one to get |
2377 | detailed debugging info about regexp compilation and |
2378 | execution. C<debugcolor> is the same as debug, except the debugging |
2379 | information is displayed in color on terminals that can display |
2380 | termcap color sequences. Here is example output: |
2381 | |
2382 | % perl -e 'use re "debug"; "abc" =~ /a*b+c/;' |
2383 | Compiling REx `a*b+c' |
2384 | size 9 first at 1 |
2385 | 1: STAR(4) |
2386 | 2: EXACT <a>(0) |
2387 | 4: PLUS(7) |
2388 | 5: EXACT <b>(0) |
2389 | 7: EXACT <c>(9) |
2390 | 9: END(0) |
2391 | floating `bc' at 0..2147483647 (checking floating) minlen 2 |
2392 | Guessing start of match, REx `a*b+c' against `abc'... |
2393 | Found floating substr `bc' at offset 1... |
2394 | Guessed: match at offset 0 |
2395 | Matching REx `a*b+c' against `abc' |
2396 | Setting an EVAL scope, savestack=3 |
2397 | 0 <> <abc> | 1: STAR |
2398 | EXACT <a> can match 1 times out of 32767... |
2399 | Setting an EVAL scope, savestack=3 |
2400 | 1 <a> <bc> | 4: PLUS |
2401 | EXACT <b> can match 1 times out of 32767... |
2402 | Setting an EVAL scope, savestack=3 |
2403 | 2 <ab> <c> | 7: EXACT <c> |
2404 | 3 <abc> <> | 9: END |
2405 | Match successful! |
2406 | Freeing REx: `a*b+c' |
2407 | |
2408 | If you have gotten this far into the tutorial, you can probably guess |
2409 | what the different parts of the debugging output tell you. The first |
2410 | part |
2411 | |
2412 | Compiling REx `a*b+c' |
2413 | size 9 first at 1 |
2414 | 1: STAR(4) |
2415 | 2: EXACT <a>(0) |
2416 | 4: PLUS(7) |
2417 | 5: EXACT <b>(0) |
2418 | 7: EXACT <c>(9) |
2419 | 9: END(0) |
2420 | |
2421 | describes the compilation stage. C<STAR(4)> means that there is a |
2422 | starred object, in this case C<'a'>, and if it matches, goto line 4, |
2423 | i.e., C<PLUS(7)>. The middle lines describe some heuristics and |
2424 | optimizations performed before a match: |
2425 | |
2426 | floating `bc' at 0..2147483647 (checking floating) minlen 2 |
2427 | Guessing start of match, REx `a*b+c' against `abc'... |
2428 | Found floating substr `bc' at offset 1... |
2429 | Guessed: match at offset 0 |
2430 | |
2431 | Then the match is executed and the remaining lines describe the |
2432 | process: |
2433 | |
2434 | Matching REx `a*b+c' against `abc' |
2435 | Setting an EVAL scope, savestack=3 |
2436 | 0 <> <abc> | 1: STAR |
2437 | EXACT <a> can match 1 times out of 32767... |
2438 | Setting an EVAL scope, savestack=3 |
2439 | 1 <a> <bc> | 4: PLUS |
2440 | EXACT <b> can match 1 times out of 32767... |
2441 | Setting an EVAL scope, savestack=3 |
2442 | 2 <ab> <c> | 7: EXACT <c> |
2443 | 3 <abc> <> | 9: END |
2444 | Match successful! |
2445 | Freeing REx: `a*b+c' |
2446 | |
2447 | Each step is of the form S<C<< n <x> <y> >> >, with C<< <x> >> the |
2448 | part of the string matched and C<< <y> >> the part not yet |
2449 | matched. The S<C<< | 1: STAR >> > says that perl is at line number 1 |
2450 | n the compilation list above. See |
2451 | L<perldebguts/"Debugging regular expressions"> for much more detail. |
2452 | |
2453 | An alternative method of debugging regexps is to embed C<print> |
2454 | statements within the regexp. This provides a blow-by-blow account of |
2455 | the backtracking in an alternation: |
2456 | |
2457 | "that this" =~ m@(?{print "Start at position ", pos, "\n";}) |
2458 | t(?{print "t1\n";}) |
2459 | h(?{print "h1\n";}) |
2460 | i(?{print "i1\n";}) |
2461 | s(?{print "s1\n";}) |
2462 | | |
2463 | t(?{print "t2\n";}) |
2464 | h(?{print "h2\n";}) |
2465 | a(?{print "a2\n";}) |
2466 | t(?{print "t2\n";}) |
2467 | (?{print "Done at position ", pos, "\n";}) |
2468 | @x; |
2469 | |
2470 | prints |
2471 | |
2472 | Start at position 0 |
2473 | t1 |
2474 | h1 |
2475 | t2 |
2476 | h2 |
2477 | a2 |
2478 | t2 |
2479 | Done at position 4 |
2480 | |
2481 | =head1 BUGS |
2482 | |
2483 | Code expressions, conditional expressions, and independent expressions |
2484 | are B<experimental>. Don't use them in production code. Yet. |
2485 | |
2486 | =head1 SEE ALSO |
2487 | |
2488 | This is just a tutorial. For the full story on perl regular |
2489 | expressions, see the L<perlre> regular expressions reference page. |
2490 | |
2491 | For more information on the matching C<m//> and substitution C<s///> |
2492 | operators, see L<perlop/"Regexp Quote-Like Operators">. For |
2493 | information on the C<split> operation, see L<perlfunc/split>. |
2494 | |
2495 | For an excellent all-around resource on the care and feeding of |
2496 | regular expressions, see the book I<Mastering Regular Expressions> by |
2497 | Jeffrey Friedl (published by O'Reilly, ISBN 1556592-257-3). |
2498 | |
2499 | =head1 AUTHOR AND COPYRIGHT |
2500 | |
2501 | Copyright (c) 2000 Mark Kvale |
2502 | All rights reserved. |
2503 | |
2504 | This document may be distributed under the same terms as Perl itself. |
2505 | |
2506 | =head2 Acknowledgments |
2507 | |
2508 | The inspiration for the stop codon DNA example came from the ZIP |
2509 | code example in chapter 7 of I<Mastering Regular Expressions>. |
2510 | |
a6b2f353 |
2511 | The author would like to thank Jeff Pinyan, Andrew Johnson, Peter |
2512 | Haworth, Ronald J Kimball, and Joe Smith for all their helpful |
2513 | comments. |
47f9c88b |
2514 | |
2515 | =cut |
a6b2f353 |
2516 | |