Integrate with Sarathy.
[p5sagit/p5-mst-13.2.git] / pod / perlrequick.pod
CommitLineData
47f9c88b 1=head1 NAME
2
3perlrequick - Perl regular expressions quick start
4
5=head1 DESCRIPTION
6
7This page covers the very basics of understanding, creating and
8using regular expressions ('regexps') in Perl.
9
10=head1 The Guide
11
12=head2 Simple word matching
13
14The simplest regexp is simply a word, or more generally, a string of
15characters. A regexp consisting of a word matches any string that
16contains that word:
17
18 "Hello World" =~ /World/; # matches
19
20In this statement, C<World> is a regexp and the C<//> enclosing
21C</World/> tells perl to search a string for a match. The operator
22C<=~> associates the string with the regexp match and produces a true
23value if the regexp matched, or false if the regexp did not match. In
24our case, C<World> matches the second word in C<"Hello World">, so the
25expression is true. This idea has several variations.
26
27Expressions like this are useful in conditionals:
28
29 print "It matches\n" if "Hello World" =~ /World/;
30
31The sense of the match can be reversed by using C<!~> operator:
32
33 print "It doesn't match\n" if "Hello World" !~ /World/;
34
35The literal string in the regexp can be replaced by a variable:
36
37 $greeting = "World";
38 print "It matches\n" if "Hello World" =~ /$greeting/;
39
40If you're matching against C<$_>, the C<$_ =~> part can be omitted:
41
42 $_ = "Hello World";
43 print "It matches\n" if /World/;
44
45Finally, the C<//> default delimiters for a match can be changed to
46arbitrary delimiters by putting an C<'m'> out front:
47
48 "Hello World" =~ m!World!; # matches, delimited by '!'
49 "Hello World" =~ m{World}; # matches, note the matching '{}'
50 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
51 # '/' becomes an ordinary char
52
53Regexps must match a part of the string I<exactly> in order for the
54statement to be true:
55
56 "Hello World" =~ /world/; # doesn't match, case sensitive
57 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char
58 "Hello World" =~ /World /; # doesn't match, no ' ' at end
59
60perl will always match at the earliest possible point in the string:
61
62 "Hello World" =~ /o/; # matches 'o' in 'Hello'
63 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
64
65Not all characters can be used 'as is' in a match. Some characters,
66called B<metacharacters>, are reserved for use in regexp notation.
67The metacharacters are
68
69 {}[]()^$.|*+?\
70
71A metacharacter can be matched by putting a backslash before it:
72
73 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
74 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
75 'C:\WIN32' =~ /C:\\WIN/; # matches
76 "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches
77
78In the last regexp, the forward slash C<'/'> is also backslashed,
79because it is used to delimit the regexp.
80
81Non-printable ASCII characters are represented by B<escape sequences>.
82Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
83for a carriage return. Arbitrary bytes are represented by octal
84escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
85e.g., C<\x1B>:
86
87 "1000\t2000" =~ m(0\t2) # matches
88 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
89
90Regexps are treated mostly as double quoted strings, so variable
91substitution works:
92
93 $foo = 'house';
94 'cathouse' =~ /cat$foo/; # matches
95 'housecat' =~ /${foo}cat/; # matches
96
97With all of the regexps above, if the regexp matched anywhere in the
98string, it was considered a match. To specify I<where> it should
99match, we would use the B<anchor> metacharacters C<^> and C<$>. The
100anchor C<^> means match at the beginning of the string and the anchor
101C<$> means match at the end of the string, or before a newline at the
102end of the string. Some examples:
103
104 "housekeeper" =~ /keeper/; # matches
105 "housekeeper" =~ /^keeper/; # doesn't match
106 "housekeeper" =~ /keeper$/; # matches
107 "housekeeper\n" =~ /keeper$/; # matches
108
109=head2 Using character classes
110
111A B<character class> allows a set of possible characters, rather than
112just a single character, to match at a particular point in a regexp.
113Character classes are denoted by brackets C<[...]>, with the set of
114characters to be possibly matched inside. Here are some examples:
115
116 /cat/; # matches 'cat'
117 /[bcr]at/; # matches 'bat, 'cat', or 'rat'
118 "abc" =~ /[cab]/; # matches 'a'
119
120In the last statement, even though C<'c'> is the first character in
121the class, the earliest point at which the regexp can match is C<'a'>.
122
123 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
124 # 'yes', 'Yes', 'YES', etc.
125 /yes/i; # also match 'yes' in a case-insensitive way
126
127The last example shows a match with an C<'i'> B<modifier>, which makes
128the match case-insensitive.
129
130Character classes also have ordinary and special characters, but the
131sets of ordinary and special characters inside a character class are
132different than those outside a character class. The special
133characters for a character class are C<-]\^$> and are matched using an
134escape:
135
136 /[\]c]def/; # matches ']def' or 'cdef'
137 $x = 'bcr';
138 /[$x]at/; # matches 'bat, 'cat', or 'rat'
139 /[\$x]at/; # matches '$at' or 'xat'
140 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
141
142The special character C<'-'> acts as a range operator within character
143classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
144become the svelte C<[0-9]> and C<[a-z]>:
145
146 /item[0-9]/; # matches 'item0' or ... or 'item9'
147 /[0-9a-fA-F]/; # matches a hexadecimal digit
148
149If C<'-'> is the first or last character in a character class, it is
150treated as an ordinary character.
151
152The special character C<^> in the first position of a character class
153denotes a B<negated character class>, which matches any character but
154those in the bracket. Both C<[...]> and C<[^...]> must match a
155character, or the match fails. Then
156
157 /[^a]at/; # doesn't match 'aat' or 'at', but matches
158 # all other 'bat', 'cat, '0at', '%at', etc.
159 /[^0-9]/; # matches a non-numeric character
160 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
161
162Perl has several abbreviations for common character classes:
163
164=over 4
165
166=item *
167\d is a digit and represents [0-9]
168
169=item *
170\s is a whitespace character and represents [\ \t\r\n\f]
171
172=item *
173\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
174
175=item *
176\D is a negated \d; it represents any character but a digit [^0-9]
177
178=item *
179\S is a negated \s; it represents any non-whitespace character [^\s]
180
181=item *
182\W is a negated \w; it represents any non-word character [^\w]
183
184=item *
185The period '.' matches any character but "\n"
186
187=back
188
189The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
190of character classes. Here are some in use:
191
192 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
193 /[\d\s]/; # matches any digit or whitespace character
194 /\w\W\w/; # matches a word char, followed by a
195 # non-word char, followed by a word char
196 /..rt/; # matches any two chars, followed by 'rt'
197 /end\./; # matches 'end.'
198 /end[.]/; # same thing, matches 'end.'
199
200The S<B<word anchor> > C<\b> matches a boundary between a word
201character and a non-word character C<\w\W> or C<\W\w>:
202
203 $x = "Housecat catenates house and cat";
204 $x =~ /\bcat/; # matches cat in 'catenates'
205 $x =~ /cat\b/; # matches cat in 'housecat'
206 $x =~ /\bcat\b/; # matches 'cat' at end of string
207
208In the last example, the end of the string is considered a word
209boundary.
210
211=head2 Matching this or that
212
213We can match match different character strings with the B<alternation>
214metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regexp
215C<dog|cat>. As before, perl will try to match the regexp at the
216earliest possible point in the string. At each character position,
217perl will first try to match the the first alternative, C<dog>. If
218C<dog> doesn't match, perl will then try the next alternative, C<cat>.
219If C<cat> doesn't match either, then the match fails and perl moves to
220the next position in the string. Some examples:
221
222 "cats and dogs" =~ /cat|dog|bird/; # matches "cat"
223 "cats and dogs" =~ /dog|cat|bird/; # matches "cat"
224
225Even though C<dog> is the first alternative in the second regexp,
226C<cat> is able to match earlier in the string.
227
228 "cats" =~ /c|ca|cat|cats/; # matches "c"
229 "cats" =~ /cats|cat|ca|c/; # matches "cats"
230
231At a given character position, the first alternative that allows the
232regexp match to succeed wil be the one that matches. Here, all the
233alternatives match at the first string position, so th first matches.
234
235=head2 Grouping things and hierarchical matching
236
237The B<grouping> metacharacters C<()> allow a part of a regexp to be
238treated as a single unit. Parts of a regexp are grouped by enclosing
239them in parentheses. The regexp C<house(cat|keeper)> means match
240C<house> followed by either C<cat> or C<keeper>. Some more examples
241are
242
243 /(a|b)b/; # matches 'ab' or 'bb'
244 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
245
246 /house(cat|)/; # matches either 'housecat' or 'house'
247 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
248 # 'house'. Note groups can be nested.
249
250 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d',
251 # because '20\d\d' can't match
252
253=head2 Extracting matches
254
255The grouping metacharacters C<()> also allow the extraction of the
256parts of a string that matched. For each grouping, the part that
257matched inside goes into the special variables C<$1>, C<$2>, etc.
258They can be used just as ordinary variables:
259
260 # extract hours, minutes, seconds
261 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
262 $hours = $1;
263 $minutes = $2;
264 $seconds = $3;
265
266In list context, a match C</regexp/ with groupings will return the
267list of matched values C<($1,$2,...)>. So we could rewrite it as
268
269 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
270
271If the groupings in a regexp are nested, C<$1> gets the group with the
272leftmost opening parenthesis, C<$2> the next opening parenthesis,
273etc. For example, here is a complex regexp and the matching variables
274indicated below it:
275
276 /(ab(cd|ef)((gi)|j))/;
277 1 2 34
278
279Associated with the matching variables C<$1>, C<$2>, ... are
280the B<backreferences> C<\1>, C<\2>, ... Backreferences are
281matching variables that can be used I<inside> a regexp:
282
283 /(\w\w\w)\s\1/; # find sequences like 'the the' in string
284
285C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>,
286C<\2>, ... only inside a regexp.
287
288=head2 Matching repetitions
289
290The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
291to determine the number of repeats of a portion of a regexp we
292consider to be a match. Quantifiers are put immediately after the
293character, character class, or grouping that we want to specify. They
294have the following meanings:
295
296=over 4
297
298=item * C<a?> = match 'a' 1 or 0 times
299
300=item * C<a*> = match 'a' 0 or more times, i.e., any number of times
301
302=item * C<a+> = match 'a' 1 or more times, i.e., at least once
303
304=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
305times.
306
307=item * C<a{n,}> = match at least C<n> or more times
308
309=item * C<a{n}> = match exactly C<n> times
310
311=back
312
313Here are some examples:
314
315 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
316 # any number of digits
317 /(\w+)\s+\1/; # match doubled words of arbitrary length
318 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more
319 # than 4 digits
320 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
321
322These quantifiers will try to match as much of the string as possible,
323while still allowing the regexp to match. So we have
324
325 $x =~ /^(.*)(at)(.*)$/; # matches,
326 # $1 = 'the cat in the h'
327 # $2 = 'at'
328 # $3 = '' (0 matches)
329
330The first quantifier C<.*> grabs as much of the string as possible
331while still having the regexp match. The second quantifier C<.*> has
332no string left to it, so it matches 0 times.
333
334=head2 More matching
335
336There are a few more things you might want to know about matching
337operators. In the code
338
339 $pattern = 'Seuss';
340 while (<>) {
341 print if /$pattern/;
342 }
343
344perl has to re-evaluate C<$pattern> each time through the loop. If
345C<$pattern> won't be changing, use the C<//o> modifier, to only
346perform variable substitutions once. If you don't want any
347substitutions at all, use the special delimiter C<m''>:
348
349 $pattern = 'Seuss';
350 m'$pattern'; # matches '$pattern', not 'Seuss'
351
352The global modifier C<//g> allows the matching operator to match
353within a string as many times as possible. In scalar context,
354successive matches against a string will have C<//g> jump from match
355to match, keeping track of position in the string as it goes along.
356You can get or set the position with the C<pos()> function.
357For example,
358
359 $x = "cat dog house"; # 3 words
360 while ($x =~ /(\w+)/g) {
361 print "Word is $1, ends at position ", pos $x, "\n";
362 }
363
364prints
365
366 Word is cat, ends at position 3
367 Word is dog, ends at position 7
368 Word is house, ends at position 13
369
370A failed match or changing the target string resets the position. If
371you don't want the position reset after failure to match, add the
372C<//c>, as in C</regexp/gc>.
373
374In list context, C<//g> returns a list of matched groupings, or if
375there are no groupings, a list of matches to the whole regexp. So
376
377 @words = ($x =~ /(\w+)/g); # matches,
378 # $word[0] = 'cat'
379 # $word[1] = 'dog'
380 # $word[2] = 'house'
381
382=head2 Search and replace
383
384Search and replace is perform using C<s/regexp/replacement/modifiers>.
385The C<replacement> is a Perl double quoted string that replaces in the
386string whatever is matched with the C<regexp>. The operator C<=~> is
387also used here to associate a string with C<s///>. If matching
388against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,
389C<s///> returns the number of substitutions made, otherwise it returns
390false. Here are a few examples:
391
392 $x = "Time to feed the cat!";
393 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
394 $y = "'quoted words'";
395 $y =~ s/^'(.*)'$/$1/; # strip single quotes,
396 # $y contains "quoted words"
397
398With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
399are immediately available for use in the replacement expression. With
400the global modifier, C<s///g> will search and replace all occurrences
401of the regexp in the string:
402
403 $x = "I batted 4 for 4";
404 $x =~ s/4/four/; # $x contains "I batted four for 4"
405 $x = "I batted 4 for 4";
406 $x =~ s/4/four/g; # $x contains "I batted four for four"
407
408The evaluation modifier C<s///e> wraps an C<eval{...}> around the
409replacement string and the evaluated result is substituted for the
410matched substring. This counts character frequencies in a line:
411
412 $x = "the cat";
413 $x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
414 print "frequency of '$_' is $chars{$_}\n"
415 foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
416
417This prints
418
419 frequency of 't' is 2
420 frequency of 'e' is 1
421 frequency of ' ' is 1
422 frequency of 'h' is 1
423 frequency of 'a' is 1
424 frequency of 'c' is 1
425
426C<s///> can use other delimiters, such as C<s!!!> and C<s{}{}>, and
427even C<s{}//>. If single quotes are used C<s'''>, then the regexp and
428replacement are treated as single quoted strings.
429
430=head2 The split operator
431
432C<split /regexp/, string> splits C<string> into a list of substrings
433and returns that list. The regexp determines the character sequence
434that C<string> is split with respect to. For example, to split a
435string into words, use
436
437 $x = "Calvin and Hobbes";
438 @words = split /\s+/, $x; # $word[0] = 'Calvin'
439 # $word[1] = 'and'
440 # $word[2] = 'Hobbes'
441
442If the empty regexp C<//> is used, the string is split into individual
443characters. If the regexp has groupings, then list produced contains
444the matched substrings from the groupings as well:
445
446 $x = "/usr/bin";
447 @parts = split m!(/)!, $x; # $parts[0] = ''
448 # $parts[1] = '/'
449 # $parts[2] = 'usr'
450 # $parts[3] = '/'
451 # $parts[4] = 'bin'
452
453Since the first character of $x matched the regexp, C<split> prepended
454an empty initial element to the list.
455
456=head1 BUGS
457
458None.
459
460=head1 SEE ALSO
461
462This is just a quick start guide. For a more in-depth tutorial on
463regexps, see L<perlretut> and for the reference page, see L<perlre>.
464
465=head1 AUTHOR AND COPYRIGHT
466
467Copyright (c) 2000 Mark Kvale
468All rights reserved.
469
470This document may be distributed under the same terms as Perl itself.
471
472=cut
473