3 perlrequick - Perl regular expressions quick start
7 This page covers the very basics of understanding, creating and
8 using regular expressions ('regexps') in Perl.
12 =head2 Simple word matching
14 The simplest regexp is simply a word, or more generally, a string of
15 characters. A regexp consisting of a word matches any string that
18 "Hello World" =~ /World/; # matches
20 In this statement, C<World> is a regexp and the C<//> enclosing
21 C</World/> tells perl to search a string for a match. The operator
22 C<=~> associates the string with the regexp match and produces a true
23 value if the regexp matched, or false if the regexp did not match. In
24 our case, C<World> matches the second word in C<"Hello World">, so the
25 expression is true. This idea has several variations.
27 Expressions like this are useful in conditionals:
29 print "It matches\n" if "Hello World" =~ /World/;
31 The sense of the match can be reversed by using C<!~> operator:
33 print "It doesn't match\n" if "Hello World" !~ /World/;
35 The literal string in the regexp can be replaced by a variable:
38 print "It matches\n" if "Hello World" =~ /$greeting/;
40 If you're matching against C<$_>, the C<$_ =~> part can be omitted:
43 print "It matches\n" if /World/;
45 Finally, the C<//> default delimiters for a match can be changed to
46 arbitrary delimiters by putting an C<'m'> out front:
48 "Hello World" =~ m!World!; # matches, delimited by '!'
49 "Hello World" =~ m{World}; # matches, note the matching '{}'
50 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51 # '/' becomes an ordinary char
53 Regexps must match a part of the string I<exactly> in order for the
56 "Hello World" =~ /world/; # doesn't match, case sensitive
57 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
58 "Hello World" =~ /World /; # doesn't match, no ' ' at end
60 perl will always match at the earliest possible point in the string:
62 "Hello World" =~ /o/; # matches 'o' in 'Hello'
63 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
65 Not all characters can be used 'as is' in a match. Some characters,
66 called B<metacharacters>, are reserved for use in regexp notation.
67 The metacharacters are
71 A metacharacter can be matched by putting a backslash before it:
73 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
74 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
75 'C:\WIN32' =~ /C:\\WIN/; # matches
76 "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches
78 In the last regexp, the forward slash C<'/'> is also backslashed,
79 because it is used to delimit the regexp.
81 Non-printable ASCII characters are represented by B<escape sequences>.
82 Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
83 for a carriage return. Arbitrary bytes are represented by octal
84 escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
87 "1000\t2000" =~ m(0\t2) # matches
88 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
90 Regexps are treated mostly as double quoted strings, so variable
94 'cathouse' =~ /cat$foo/; # matches
95 'housecat' =~ /${foo}cat/; # matches
97 With all of the regexps above, if the regexp matched anywhere in the
98 string, it was considered a match. To specify I<where> it should
99 match, we would use the B<anchor> metacharacters C<^> and C<$>. The
100 anchor C<^> means match at the beginning of the string and the anchor
101 C<$> means match at the end of the string, or before a newline at the
102 end of the string. Some examples:
104 "housekeeper" =~ /keeper/; # matches
105 "housekeeper" =~ /^keeper/; # doesn't match
106 "housekeeper" =~ /keeper$/; # matches
107 "housekeeper\n" =~ /keeper$/; # matches
109 =head2 Using character classes
111 A B<character class> allows a set of possible characters, rather than
112 just a single character, to match at a particular point in a regexp.
113 Character classes are denoted by brackets C<[...]>, with the set of
114 characters to be possibly matched inside. Here are some examples:
116 /cat/; # matches 'cat'
117 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
118 "abc" =~ /[cab]/; # matches 'a'
120 In the last statement, even though C<'c'> is the first character in
121 the class, the earliest point at which the regexp can match is C<'a'>.
123 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
124 # 'yes', 'Yes', 'YES', etc.
125 /yes/i; # also match 'yes' in a case-insensitive way
127 The last example shows a match with an C<'i'> B<modifier>, which makes
128 the match case-insensitive.
130 Character classes also have ordinary and special characters, but the
131 sets of ordinary and special characters inside a character class are
132 different than those outside a character class. The special
133 characters for a character class are C<-]\^$> and are matched using an
136 /[\]c]def/; # matches ']def' or 'cdef'
138 /[$x]at/; # matches 'bat, 'cat', or 'rat'
139 /[\$x]at/; # matches '$at' or 'xat'
140 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
142 The special character C<'-'> acts as a range operator within character
143 classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
144 become the svelte C<[0-9]> and C<[a-z]>:
146 /item[0-9]/; # matches 'item0' or ... or 'item9'
147 /[0-9a-fA-F]/; # matches a hexadecimal digit
149 If C<'-'> is the first or last character in a character class, it is
150 treated as an ordinary character.
152 The special character C<^> in the first position of a character class
153 denotes a B<negated character class>, which matches any character but
154 those in the bracket. Both C<[...]> and C<[^...]> must match a
155 character, or the match fails. Then
157 /[^a]at/; # doesn't match 'aat' or 'at', but matches
158 # all other 'bat', 'cat, '0at', '%at', etc.
159 /[^0-9]/; # matches a non-numeric character
160 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
162 Perl has several abbreviations for common character classes:
167 \d is a digit and represents [0-9]
170 \s is a whitespace character and represents [\ \t\r\n\f]
173 \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
176 \D is a negated \d; it represents any character but a digit [^0-9]
179 \S is a negated \s; it represents any non-whitespace character [^\s]
182 \W is a negated \w; it represents any non-word character [^\w]
185 The period '.' matches any character but "\n"
189 The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
190 of character classes. Here are some in use:
192 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
193 /[\d\s]/; # matches any digit or whitespace character
194 /\w\W\w/; # matches a word char, followed by a
195 # non-word char, followed by a word char
196 /..rt/; # matches any two chars, followed by 'rt'
197 /end\./; # matches 'end.'
198 /end[.]/; # same thing, matches 'end.'
200 The S<B<word anchor> > C<\b> matches a boundary between a word
201 character and a non-word character C<\w\W> or C<\W\w>:
203 $x = "Housecat catenates house and cat";
204 $x =~ /\bcat/; # matches cat in 'catenates'
205 $x =~ /cat\b/; # matches cat in 'housecat'
206 $x =~ /\bcat\b/; # matches 'cat' at end of string
208 In the last example, the end of the string is considered a word
211 =head2 Matching this or that
213 We can match match different character strings with the B<alternation>
214 metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regexp
215 C<dog|cat>. As before, perl will try to match the regexp at the
216 earliest possible point in the string. At each character position,
217 perl will first try to match the the first alternative, C<dog>. If
218 C<dog> doesn't match, perl will then try the next alternative, C<cat>.
219 If C<cat> doesn't match either, then the match fails and perl moves to
220 the next position in the string. Some examples:
222 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
223 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
225 Even though C<dog> is the first alternative in the second regexp,
226 C<cat> is able to match earlier in the string.
228 "cats" =~ /c|ca|cat|cats/; # matches "c"
229 "cats" =~ /cats|cat|ca|c/; # matches "cats"
231 At a given character position, the first alternative that allows the
232 regexp match to succeed wil be the one that matches. Here, all the
233 alternatives match at the first string position, so th first matches.
235 =head2 Grouping things and hierarchical matching
237 The B<grouping> metacharacters C<()> allow a part of a regexp to be
238 treated as a single unit. Parts of a regexp are grouped by enclosing
239 them in parentheses. The regexp C<house(cat|keeper)> means match
240 C<house> followed by either C<cat> or C<keeper>. Some more examples
243 /(a|b)b/; # matches 'ab' or 'bb'
244 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
246 /house(cat|)/; # matches either 'housecat' or 'house'
247 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
248 # 'house'. Note groups can be nested.
250 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
251 # because '20\d\d' can't match
253 =head2 Extracting matches
255 The grouping metacharacters C<()> also allow the extraction of the
256 parts of a string that matched. For each grouping, the part that
257 matched inside goes into the special variables C<$1>, C<$2>, etc.
258 They can be used just as ordinary variables:
260 # extract hours, minutes, seconds
261 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
266 In list context, a match C</regexp/ with groupings will return the
267 list of matched values C<($1,$2,...)>. So we could rewrite it as
269 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
271 If the groupings in a regexp are nested, C<$1> gets the group with the
272 leftmost opening parenthesis, C<$2> the next opening parenthesis,
273 etc. For example, here is a complex regexp and the matching variables
276 /(ab(cd|ef)((gi)|j))/;
279 Associated with the matching variables C<$1>, C<$2>, ... are
280 the B<backreferences> C<\1>, C<\2>, ... Backreferences are
281 matching variables that can be used I<inside> a regexp:
283 /(\w\w\w)\s\1/; # find sequences like 'the the' in string
285 C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>,
286 C<\2>, ... only inside a regexp.
288 =head2 Matching repetitions
290 The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
291 to determine the number of repeats of a portion of a regexp we
292 consider to be a match. Quantifiers are put immediately after the
293 character, character class, or grouping that we want to specify. They
294 have the following meanings:
298 =item * C<a?> = match 'a' 1 or 0 times
300 =item * C<a*> = match 'a' 0 or more times, i.e., any number of times
302 =item * C<a+> = match 'a' 1 or more times, i.e., at least once
304 =item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
307 =item * C<a{n,}> = match at least C<n> or more times
309 =item * C<a{n}> = match exactly C<n> times
313 Here are some examples:
315 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
316 # any number of digits
317 /(\w+)\s+\1/; # match doubled words of arbitrary length
318 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
320 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
322 These quantifiers will try to match as much of the string as possible,
323 while still allowing the regexp to match. So we have
325 $x =~ /^(.*)(at)(.*)$/; # matches,
326 # $1 = 'the cat in the h'
328 # $3 = '' (0 matches)
330 The first quantifier C<.*> grabs as much of the string as possible
331 while still having the regexp match. The second quantifier C<.*> has
332 no string left to it, so it matches 0 times.
336 There are a few more things you might want to know about matching
337 operators. In the code
344 perl has to re-evaluate C<$pattern> each time through the loop. If
345 C<$pattern> won't be changing, use the C<//o> modifier, to only
346 perform variable substitutions once. If you don't want any
347 substitutions at all, use the special delimiter C<m''>:
350 m'$pattern'; # matches '$pattern', not 'Seuss'
352 The global modifier C<//g> allows the matching operator to match
353 within a string as many times as possible. In scalar context,
354 successive matches against a string will have C<//g> jump from match
355 to match, keeping track of position in the string as it goes along.
356 You can get or set the position with the C<pos()> function.
359 $x = "cat dog house"; # 3 words
360 while ($x =~ /(\w+)/g) {
361 print "Word is $1, ends at position ", pos $x, "\n";
366 Word is cat, ends at position 3
367 Word is dog, ends at position 7
368 Word is house, ends at position 13
370 A failed match or changing the target string resets the position. If
371 you don't want the position reset after failure to match, add the
372 C<//c>, as in C</regexp/gc>.
374 In list context, C<//g> returns a list of matched groupings, or if
375 there are no groupings, a list of matches to the whole regexp. So
377 @words = ($x =~ /(\w+)/g); # matches,
382 =head2 Search and replace
384 Search and replace is perform using C<s/regexp/replacement/modifiers>.
385 The C<replacement> is a Perl double quoted string that replaces in the
386 string whatever is matched with the C<regexp>. The operator C<=~> is
387 also used here to associate a string with C<s///>. If matching
388 against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,
389 C<s///> returns the number of substitutions made, otherwise it returns
390 false. Here are a few examples:
392 $x = "Time to feed the cat!";
393 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
394 $y = "'quoted words'";
395 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
396 # $y contains "quoted words"
398 With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
399 are immediately available for use in the replacement expression. With
400 the global modifier, C<s///g> will search and replace all occurrences
401 of the regexp in the string:
403 $x = "I batted 4 for 4";
404 $x =~ s/4/four/; # $x contains "I batted four for 4"
405 $x = "I batted 4 for 4";
406 $x =~ s/4/four/g; # $x contains "I batted four for four"
408 The evaluation modifier C<s///e> wraps an C<eval{...}> around the
409 replacement string and the evaluated result is substituted for the
410 matched substring. This counts character frequencies in a line:
413 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
414 print "frequency of '$_' is $chars{$_}\n"
415 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
419 frequency of 't' is 2
420 frequency of 'e' is 1
421 frequency of ' ' is 1
422 frequency of 'h' is 1
423 frequency of 'a' is 1
424 frequency of 'c' is 1
426 C<s///> can use other delimiters, such as C<s!!!> and C<s{}{}>, and
427 even C<s{}//>. If single quotes are used C<s'''>, then the regexp and
428 replacement are treated as single quoted strings.
430 =head2 The split operator
432 C<split /regexp/, string> splits C<string> into a list of substrings
433 and returns that list. The regexp determines the character sequence
434 that C<string> is split with respect to. For example, to split a
435 string into words, use
437 $x = "Calvin and Hobbes";
438 @words = split /\s+/, $x; # $word[0] = 'Calvin'
440 # $word[2] = 'Hobbes'
442 If the empty regexp C<//> is used, the string is split into individual
443 characters. If the regexp has groupings, then list produced contains
444 the matched substrings from the groupings as well:
447 @parts = split m!(/)!, $x; # $parts[0] = ''
453 Since the first character of $x matched the regexp, C<split> prepended
454 an empty initial element to the list.
462 This is just a quick start guide. For a more in-depth tutorial on
463 regexps, see L<perlretut> and for the reference page, see L<perlre>.
465 =head1 AUTHOR AND COPYRIGHT
467 Copyright (c) 2000 Mark Kvale
470 This document may be distributed under the same terms as Perl itself.