Commit | Line | Data |
47f9c88b |
1 | =head1 NAME |
2 | |
3 | perlrequick - Perl regular expressions quick start |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | This page covers the very basics of understanding, creating and |
8 | using regular expressions ('regexps') in Perl. |
9 | |
10 | =head1 The Guide |
11 | |
12 | =head2 Simple word matching |
13 | |
14 | The simplest regexp is simply a word, or more generally, a string of |
15 | characters. A regexp consisting of a word matches any string that |
16 | contains that word: |
17 | |
18 | "Hello World" =~ /World/; # matches |
19 | |
20 | In this statement, C<World> is a regexp and the C<//> enclosing |
21 | C</World/> tells perl to search a string for a match. The operator |
22 | C<=~> associates the string with the regexp match and produces a true |
23 | value if the regexp matched, or false if the regexp did not match. In |
24 | our case, C<World> matches the second word in C<"Hello World">, so the |
25 | expression is true. This idea has several variations. |
26 | |
27 | Expressions like this are useful in conditionals: |
28 | |
29 | print "It matches\n" if "Hello World" =~ /World/; |
30 | |
31 | The sense of the match can be reversed by using C<!~> operator: |
32 | |
33 | print "It doesn't match\n" if "Hello World" !~ /World/; |
34 | |
35 | The literal string in the regexp can be replaced by a variable: |
36 | |
37 | $greeting = "World"; |
38 | print "It matches\n" if "Hello World" =~ /$greeting/; |
39 | |
40 | If you're matching against C<$_>, the C<$_ =~> part can be omitted: |
41 | |
42 | $_ = "Hello World"; |
43 | print "It matches\n" if /World/; |
44 | |
45 | Finally, the C<//> default delimiters for a match can be changed to |
46 | arbitrary delimiters by putting an C<'m'> out front: |
47 | |
48 | "Hello World" =~ m!World!; # matches, delimited by '!' |
49 | "Hello World" =~ m{World}; # matches, note the matching '{}' |
50 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', |
51 | # '/' becomes an ordinary char |
52 | |
53 | Regexps must match a part of the string I<exactly> in order for the |
54 | statement to be true: |
55 | |
56 | "Hello World" =~ /world/; # doesn't match, case sensitive |
57 | "Hello World" =~ /o W/; # matches, ' ' is an ordinary char |
58 | "Hello World" =~ /World /; # doesn't match, no ' ' at end |
59 | |
60 | perl will always match at the earliest possible point in the string: |
61 | |
62 | "Hello World" =~ /o/; # matches 'o' in 'Hello' |
63 | "That hat is red" =~ /hat/; # matches 'hat' in 'That' |
64 | |
65 | Not all characters can be used 'as is' in a match. Some characters, |
66 | called B<metacharacters>, are reserved for use in regexp notation. |
67 | The metacharacters are |
68 | |
69 | {}[]()^$.|*+?\ |
70 | |
71 | A metacharacter can be matched by putting a backslash before it: |
72 | |
73 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter |
74 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + |
75 | 'C:\WIN32' =~ /C:\\WIN/; # matches |
76 | "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches |
77 | |
78 | In the last regexp, the forward slash C<'/'> is also backslashed, |
79 | because it is used to delimit the regexp. |
80 | |
81 | Non-printable ASCII characters are represented by B<escape sequences>. |
82 | Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> |
83 | for a carriage return. Arbitrary bytes are represented by octal |
84 | escape sequences, e.g., C<\033>, or hexadecimal escape sequences, |
85 | e.g., C<\x1B>: |
86 | |
87 | "1000\t2000" =~ m(0\t2) # matches |
88 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat |
89 | |
90 | Regexps are treated mostly as double quoted strings, so variable |
91 | substitution works: |
92 | |
93 | $foo = 'house'; |
94 | 'cathouse' =~ /cat$foo/; # matches |
95 | 'housecat' =~ /${foo}cat/; # matches |
96 | |
97 | With all of the regexps above, if the regexp matched anywhere in the |
98 | string, it was considered a match. To specify I<where> it should |
99 | match, we would use the B<anchor> metacharacters C<^> and C<$>. The |
100 | anchor C<^> means match at the beginning of the string and the anchor |
101 | C<$> means match at the end of the string, or before a newline at the |
102 | end of the string. Some examples: |
103 | |
104 | "housekeeper" =~ /keeper/; # matches |
105 | "housekeeper" =~ /^keeper/; # doesn't match |
106 | "housekeeper" =~ /keeper$/; # matches |
107 | "housekeeper\n" =~ /keeper$/; # matches |
108 | |
109 | =head2 Using character classes |
110 | |
111 | A B<character class> allows a set of possible characters, rather than |
112 | just a single character, to match at a particular point in a regexp. |
113 | Character classes are denoted by brackets C<[...]>, with the set of |
114 | characters to be possibly matched inside. Here are some examples: |
115 | |
116 | /cat/; # matches 'cat' |
117 | /[bcr]at/; # matches 'bat, 'cat', or 'rat' |
118 | "abc" =~ /[cab]/; # matches 'a' |
119 | |
120 | In the last statement, even though C<'c'> is the first character in |
121 | the class, the earliest point at which the regexp can match is C<'a'>. |
122 | |
123 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way |
124 | # 'yes', 'Yes', 'YES', etc. |
125 | /yes/i; # also match 'yes' in a case-insensitive way |
126 | |
127 | The last example shows a match with an C<'i'> B<modifier>, which makes |
128 | the match case-insensitive. |
129 | |
130 | Character classes also have ordinary and special characters, but the |
131 | sets of ordinary and special characters inside a character class are |
132 | different than those outside a character class. The special |
133 | characters for a character class are C<-]\^$> and are matched using an |
134 | escape: |
135 | |
136 | /[\]c]def/; # matches ']def' or 'cdef' |
137 | $x = 'bcr'; |
138 | /[$x]at/; # matches 'bat, 'cat', or 'rat' |
139 | /[\$x]at/; # matches '$at' or 'xat' |
140 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' |
141 | |
142 | The special character C<'-'> acts as a range operator within character |
143 | classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> |
144 | become the svelte C<[0-9]> and C<[a-z]>: |
145 | |
146 | /item[0-9]/; # matches 'item0' or ... or 'item9' |
147 | /[0-9a-fA-F]/; # matches a hexadecimal digit |
148 | |
149 | If C<'-'> is the first or last character in a character class, it is |
150 | treated as an ordinary character. |
151 | |
152 | The special character C<^> in the first position of a character class |
153 | denotes a B<negated character class>, which matches any character but |
154 | those in the bracket. Both C<[...]> and C<[^...]> must match a |
155 | character, or the match fails. Then |
156 | |
157 | /[^a]at/; # doesn't match 'aat' or 'at', but matches |
158 | # all other 'bat', 'cat, '0at', '%at', etc. |
159 | /[^0-9]/; # matches a non-numeric character |
160 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary |
161 | |
162 | Perl has several abbreviations for common character classes: |
163 | |
164 | =over 4 |
165 | |
166 | =item * |
167 | \d is a digit and represents [0-9] |
168 | |
169 | =item * |
170 | \s is a whitespace character and represents [\ \t\r\n\f] |
171 | |
172 | =item * |
173 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] |
174 | |
175 | =item * |
176 | \D is a negated \d; it represents any character but a digit [^0-9] |
177 | |
178 | =item * |
179 | \S is a negated \s; it represents any non-whitespace character [^\s] |
180 | |
181 | =item * |
182 | \W is a negated \w; it represents any non-word character [^\w] |
183 | |
184 | =item * |
185 | The period '.' matches any character but "\n" |
186 | |
187 | =back |
188 | |
189 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside |
190 | of character classes. Here are some in use: |
191 | |
192 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format |
193 | /[\d\s]/; # matches any digit or whitespace character |
194 | /\w\W\w/; # matches a word char, followed by a |
195 | # non-word char, followed by a word char |
196 | /..rt/; # matches any two chars, followed by 'rt' |
197 | /end\./; # matches 'end.' |
198 | /end[.]/; # same thing, matches 'end.' |
199 | |
200 | The S<B<word anchor> > C<\b> matches a boundary between a word |
201 | character and a non-word character C<\w\W> or C<\W\w>: |
202 | |
203 | $x = "Housecat catenates house and cat"; |
204 | $x =~ /\bcat/; # matches cat in 'catenates' |
205 | $x =~ /cat\b/; # matches cat in 'housecat' |
206 | $x =~ /\bcat\b/; # matches 'cat' at end of string |
207 | |
208 | In the last example, the end of the string is considered a word |
209 | boundary. |
210 | |
211 | =head2 Matching this or that |
212 | |
213 | We can match match different character strings with the B<alternation> |
214 | metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regexp |
215 | C<dog|cat>. As before, perl will try to match the regexp at the |
216 | earliest possible point in the string. At each character position, |
217 | perl will first try to match the the first alternative, C<dog>. If |
218 | C<dog> doesn't match, perl will then try the next alternative, C<cat>. |
219 | If C<cat> doesn't match either, then the match fails and perl moves to |
220 | the next position in the string. Some examples: |
221 | |
222 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat" |
223 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat" |
224 | |
225 | Even though C<dog> is the first alternative in the second regexp, |
226 | C<cat> is able to match earlier in the string. |
227 | |
228 | "cats" =~ /c|ca|cat|cats/; # matches "c" |
229 | "cats" =~ /cats|cat|ca|c/; # matches "cats" |
230 | |
231 | At a given character position, the first alternative that allows the |
232 | regexp match to succeed wil be the one that matches. Here, all the |
233 | alternatives match at the first string position, so th first matches. |
234 | |
235 | =head2 Grouping things and hierarchical matching |
236 | |
237 | The B<grouping> metacharacters C<()> allow a part of a regexp to be |
238 | treated as a single unit. Parts of a regexp are grouped by enclosing |
239 | them in parentheses. The regexp C<house(cat|keeper)> means match |
240 | C<house> followed by either C<cat> or C<keeper>. Some more examples |
241 | are |
242 | |
243 | /(a|b)b/; # matches 'ab' or 'bb' |
244 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere |
245 | |
246 | /house(cat|)/; # matches either 'housecat' or 'house' |
247 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or |
248 | # 'house'. Note groups can be nested. |
249 | |
250 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', |
251 | # because '20\d\d' can't match |
252 | |
253 | =head2 Extracting matches |
254 | |
255 | The grouping metacharacters C<()> also allow the extraction of the |
256 | parts of a string that matched. For each grouping, the part that |
257 | matched inside goes into the special variables C<$1>, C<$2>, etc. |
258 | They can be used just as ordinary variables: |
259 | |
260 | # extract hours, minutes, seconds |
261 | $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format |
262 | $hours = $1; |
263 | $minutes = $2; |
264 | $seconds = $3; |
265 | |
266 | In list context, a match C</regexp/ with groupings will return the |
267 | list of matched values C<($1,$2,...)>. So we could rewrite it as |
268 | |
269 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); |
270 | |
271 | If the groupings in a regexp are nested, C<$1> gets the group with the |
272 | leftmost opening parenthesis, C<$2> the next opening parenthesis, |
273 | etc. For example, here is a complex regexp and the matching variables |
274 | indicated below it: |
275 | |
276 | /(ab(cd|ef)((gi)|j))/; |
277 | 1 2 34 |
278 | |
279 | Associated with the matching variables C<$1>, C<$2>, ... are |
280 | the B<backreferences> C<\1>, C<\2>, ... Backreferences are |
281 | matching variables that can be used I<inside> a regexp: |
282 | |
283 | /(\w\w\w)\s\1/; # find sequences like 'the the' in string |
284 | |
285 | C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>, |
286 | C<\2>, ... only inside a regexp. |
287 | |
288 | =head2 Matching repetitions |
289 | |
290 | The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us |
291 | to determine the number of repeats of a portion of a regexp we |
292 | consider to be a match. Quantifiers are put immediately after the |
293 | character, character class, or grouping that we want to specify. They |
294 | have the following meanings: |
295 | |
296 | =over 4 |
297 | |
298 | =item * C<a?> = match 'a' 1 or 0 times |
299 | |
300 | =item * C<a*> = match 'a' 0 or more times, i.e., any number of times |
301 | |
302 | =item * C<a+> = match 'a' 1 or more times, i.e., at least once |
303 | |
304 | =item * C<a{n,m}> = match at least C<n> times, but not more than C<m> |
305 | times. |
306 | |
307 | =item * C<a{n,}> = match at least C<n> or more times |
308 | |
309 | =item * C<a{n}> = match exactly C<n> times |
310 | |
311 | =back |
312 | |
313 | Here are some examples: |
314 | |
315 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and |
316 | # any number of digits |
317 | /(\w+)\s+\1/; # match doubled words of arbitrary length |
318 | $year =~ /\d{2,4}/; # make sure year is at least 2 but not more |
319 | # than 4 digits |
320 | $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates |
321 | |
322 | These quantifiers will try to match as much of the string as possible, |
323 | while still allowing the regexp to match. So we have |
324 | |
325 | $x =~ /^(.*)(at)(.*)$/; # matches, |
326 | # $1 = 'the cat in the h' |
327 | # $2 = 'at' |
328 | # $3 = '' (0 matches) |
329 | |
330 | The first quantifier C<.*> grabs as much of the string as possible |
331 | while still having the regexp match. The second quantifier C<.*> has |
332 | no string left to it, so it matches 0 times. |
333 | |
334 | =head2 More matching |
335 | |
336 | There are a few more things you might want to know about matching |
337 | operators. In the code |
338 | |
339 | $pattern = 'Seuss'; |
340 | while (<>) { |
341 | print if /$pattern/; |
342 | } |
343 | |
344 | perl has to re-evaluate C<$pattern> each time through the loop. If |
345 | C<$pattern> won't be changing, use the C<//o> modifier, to only |
346 | perform variable substitutions once. If you don't want any |
347 | substitutions at all, use the special delimiter C<m''>: |
348 | |
349 | $pattern = 'Seuss'; |
350 | m'$pattern'; # matches '$pattern', not 'Seuss' |
351 | |
352 | The global modifier C<//g> allows the matching operator to match |
353 | within a string as many times as possible. In scalar context, |
354 | successive matches against a string will have C<//g> jump from match |
355 | to match, keeping track of position in the string as it goes along. |
356 | You can get or set the position with the C<pos()> function. |
357 | For example, |
358 | |
359 | $x = "cat dog house"; # 3 words |
360 | while ($x =~ /(\w+)/g) { |
361 | print "Word is $1, ends at position ", pos $x, "\n"; |
362 | } |
363 | |
364 | prints |
365 | |
366 | Word is cat, ends at position 3 |
367 | Word is dog, ends at position 7 |
368 | Word is house, ends at position 13 |
369 | |
370 | A failed match or changing the target string resets the position. If |
371 | you don't want the position reset after failure to match, add the |
372 | C<//c>, as in C</regexp/gc>. |
373 | |
374 | In list context, C<//g> returns a list of matched groupings, or if |
375 | there are no groupings, a list of matches to the whole regexp. So |
376 | |
377 | @words = ($x =~ /(\w+)/g); # matches, |
378 | # $word[0] = 'cat' |
379 | # $word[1] = 'dog' |
380 | # $word[2] = 'house' |
381 | |
382 | =head2 Search and replace |
383 | |
384 | Search and replace is perform using C<s/regexp/replacement/modifiers>. |
385 | The C<replacement> is a Perl double quoted string that replaces in the |
386 | string whatever is matched with the C<regexp>. The operator C<=~> is |
387 | also used here to associate a string with C<s///>. If matching |
388 | against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, |
389 | C<s///> returns the number of substitutions made, otherwise it returns |
390 | false. Here are a few examples: |
391 | |
392 | $x = "Time to feed the cat!"; |
393 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" |
394 | $y = "'quoted words'"; |
395 | $y =~ s/^'(.*)'$/$1/; # strip single quotes, |
396 | # $y contains "quoted words" |
397 | |
398 | With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. |
399 | are immediately available for use in the replacement expression. With |
400 | the global modifier, C<s///g> will search and replace all occurrences |
401 | of the regexp in the string: |
402 | |
403 | $x = "I batted 4 for 4"; |
404 | $x =~ s/4/four/; # $x contains "I batted four for 4" |
405 | $x = "I batted 4 for 4"; |
406 | $x =~ s/4/four/g; # $x contains "I batted four for four" |
407 | |
408 | The evaluation modifier C<s///e> wraps an C<eval{...}> around the |
409 | replacement string and the evaluated result is substituted for the |
410 | matched substring. This counts character frequencies in a line: |
411 | |
412 | $x = "the cat"; |
413 | $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself |
414 | print "frequency of '$_' is $chars{$_}\n" |
415 | foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars); |
416 | |
417 | This prints |
418 | |
419 | frequency of 't' is 2 |
420 | frequency of 'e' is 1 |
421 | frequency of ' ' is 1 |
422 | frequency of 'h' is 1 |
423 | frequency of 'a' is 1 |
424 | frequency of 'c' is 1 |
425 | |
426 | C<s///> can use other delimiters, such as C<s!!!> and C<s{}{}>, and |
427 | even C<s{}//>. If single quotes are used C<s'''>, then the regexp and |
428 | replacement are treated as single quoted strings. |
429 | |
430 | =head2 The split operator |
431 | |
432 | C<split /regexp/, string> splits C<string> into a list of substrings |
433 | and returns that list. The regexp determines the character sequence |
434 | that C<string> is split with respect to. For example, to split a |
435 | string into words, use |
436 | |
437 | $x = "Calvin and Hobbes"; |
438 | @words = split /\s+/, $x; # $word[0] = 'Calvin' |
439 | # $word[1] = 'and' |
440 | # $word[2] = 'Hobbes' |
441 | |
442 | If the empty regexp C<//> is used, the string is split into individual |
443 | characters. If the regexp has groupings, then list produced contains |
444 | the matched substrings from the groupings as well: |
445 | |
446 | $x = "/usr/bin"; |
447 | @parts = split m!(/)!, $x; # $parts[0] = '' |
448 | # $parts[1] = '/' |
449 | # $parts[2] = 'usr' |
450 | # $parts[3] = '/' |
451 | # $parts[4] = 'bin' |
452 | |
453 | Since the first character of $x matched the regexp, C<split> prepended |
454 | an empty initial element to the list. |
455 | |
456 | =head1 BUGS |
457 | |
458 | None. |
459 | |
460 | =head1 SEE ALSO |
461 | |
462 | This is just a quick start guide. For a more in-depth tutorial on |
463 | regexps, see L<perlretut> and for the reference page, see L<perlre>. |
464 | |
465 | =head1 AUTHOR AND COPYRIGHT |
466 | |
467 | Copyright (c) 2000 Mark Kvale |
468 | All rights reserved. |
469 | |
470 | This document may be distributed under the same terms as Perl itself. |
471 | |
472 | =cut |
473 | |