[p5sagit/p5-mst-13.2.git] / pod / perlrequick.pod

=head1 NAME

perlrequick - Perl regular expressions quick start

=head1 DESCRIPTION

This page covers the very basics of understanding, creating and
using regular expressions ('regexps') in Perl.

=head1 The Guide

=head2 Simple word matching

The simplest regexp is simply a word, or more generally, a string of
characters.  A regexp consisting of a word matches any string that
contains that word:

    "Hello World" =~ /World/;  # matches

In this statement, C<World> is a regexp and the C<//> enclosing
C</World/> tells perl to search a string for a match.  The operator
C<=~> associates the string with the regexp match and produces a true
value if the regexp matched, or false if the regexp did not match.  In
our case, C<World> matches the second word in C<"Hello World">, so the
expression is true.  This idea has several variations.

Expressions like this are useful in conditionals:

    print "It matches\n" if "Hello World" =~ /World/;

The sense of the match can be reversed by using C<!~> operator:

    print "It doesn't match\n" if "Hello World" !~ /World/;

The literal string in the regexp can be replaced by a variable:

    $greeting = "World";
    print "It matches\n" if "Hello World" =~ /$greeting/;

If you're matching against C<$_>, the C<$_ =~> part can be omitted:

    $_ = "Hello World";
    print "It matches\n" if /World/;

Finally, the C<//> default delimiters for a match can be changed to
arbitrary delimiters by putting an C<'m'> out front:

    "Hello World" =~ m!World!;   # matches, delimited by '!'
    "Hello World" =~ m{World};   # matches, note the matching '{}'
    "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
                                 # '/' becomes an ordinary char

Regexps must match a part of the string I<exactly> in order for the
statement to be true:

    "Hello World" =~ /world/;  # doesn't match, case sensitive
    "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
    "Hello World" =~ /World /; # doesn't match, no ' ' at end

perl will always match at the earliest possible point in the string:

    "Hello World" =~ /o/;       # matches 'o' in 'Hello'
    "That hat is red" =~ /hat/; # matches 'hat' in 'That'

Not all characters can be used 'as is' in a match.  Some characters,
called B<metacharacters>, are reserved for use in regexp notation.
The metacharacters are

    {}[]()^$.|*+?\

A metacharacter can be matched by putting a backslash before it:

    "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
    "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
    'C:\WIN32' =~ /C:\\WIN/;                       # matches
    "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/;  # matches

In the last regexp, the forward slash C<'/'> is also backslashed,
because it is used to delimit the regexp.

Non-printable ASCII characters are represented by B<escape sequences>.
Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
for a carriage return.  Arbitrary bytes are represented by octal
escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
e.g., C<\x1B>:

    "1000\t2000" =~ m(0\t2)        # matches
    "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat

Regexps are treated mostly as double quoted strings, so variable
substitution works:

    $foo = 'house';
    'cathouse' =~ /cat$foo/;   # matches
    'housecat' =~ /${foo}cat/; # matches

With all of the regexps above, if the regexp matched anywhere in the
string, it was considered a match.  To specify I<where> it should
match, we would use the B<anchor> metacharacters C<^> and C<$>.  The
anchor C<^> means match at the beginning of the string and the anchor
C<$> means match at the end of the string, or before a newline at the
end of the string.  Some examples:

    "housekeeper" =~ /keeper/;    # matches
    "housekeeper" =~ /^keeper/;   # doesn't match
    "housekeeper" =~ /keeper$/;   # matches
    "housekeeper\n" =~ /keeper$/; # matches

=head2 Using character classes

A B<character class> allows a set of possible characters, rather than
just a single character, to match at a particular point in a regexp.
Character classes are denoted by brackets C<[...]>, with the set of
characters to be possibly matched inside.  Here are some examples:

    /cat/;            # matches 'cat'
    /[bcr]at/;        # matches 'bat, 'cat', or 'rat'
    "abc" =~ /[cab]/; # matches 'a'

In the last statement, even though C<'c'> is the first character in
the class, the earliest point at which the regexp can match is C<'a'>.

    /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
                    # 'yes', 'Yes', 'YES', etc.
    /yes/i;         # also match 'yes' in a case-insensitive way

The last example shows a match with an C<'i'> B<modifier>, which makes
the match case-insensitive.

Character classes also have ordinary and special characters, but the
sets of ordinary and special characters inside a character class are
different than those outside a character class.  The special
characters for a character class are C<-]\^$> and are matched using an
escape:

   /[\]c]def/; # matches ']def' or 'cdef'
   $x = 'bcr';
   /[$x]at/;   # matches 'bat, 'cat', or 'rat'
   /[\$x]at/;  # matches '$at' or 'xat'
   /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'

The special character C<'-'> acts as a range operator within character
classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
become the svelte C<[0-9]> and C<[a-z]>:

    /item[0-9]/;  # matches 'item0' or ... or 'item9'
    /[0-9a-fA-F]/;  # matches a hexadecimal digit

If C<'-'> is the first or last character in a character class, it is
treated as an ordinary character.

The special character C<^> in the first position of a character class
denotes a B<negated character class>, which matches any character but
those in the bracket.  Both C<[...]> and C<[^...]> must match a
character, or the match fails.  Then

    /[^a]at/;  # doesn't match 'aat' or 'at', but matches
               # all other 'bat', 'cat, '0at', '%at', etc.
    /[^0-9]/;  # matches a non-numeric character
    /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary

Perl has several abbreviations for common character classes:

=over 4

=item *
\d is a digit and represents [0-9]

=item *
\s is a whitespace character and represents [\ \t\r\n\f]

=item *
\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]

=item *
\D is a negated \d; it represents any character but a digit [^0-9]

=item *
\S is a negated \s; it represents any non-whitespace character [^\s]

=item *
\W is a negated \w; it represents any non-word character [^\w]

=item *
The period '.' matches any character but "\n"

=back

The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
of character classes.  Here are some in use:

    /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
    /[\d\s]/;         # matches any digit or whitespace character
    /\w\W\w/;         # matches a word char, followed by a
                      # non-word char, followed by a word char
    /..rt/;           # matches any two chars, followed by 'rt'
    /end\./;          # matches 'end.'
    /end[.]/;         # same thing, matches 'end.'

The S<B<word anchor> > C<\b> matches a boundary between a word
character and a non-word character C<\w\W> or C<\W\w>:

    $x = "Housecat catenates house and cat";
    $x =~ /\bcat/;  # matches cat in 'catenates'
    $x =~ /cat\b/;  # matches cat in 'housecat'
    $x =~ /\bcat\b/;  # matches 'cat' at end of string

In the last example, the end of the string is considered a word
boundary.

=head2 Matching this or that

We can match match different character strings with the B<alternation>
metacharacter C<'|'>.  To match C<dog> or C<cat>, we form the regexp
C<dog|cat>.  As before, perl will try to match the regexp at the
earliest possible point in the string.  At each character position,
perl will first try to match the the first alternative, C<dog>.  If
C<dog> doesn't match, perl will then try the next alternative, C<cat>.
If C<cat> doesn't match either, then the match fails and perl moves to
the next position in the string.  Some examples:

    "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
    "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"

Even though C<dog> is the first alternative in the second regexp,
C<cat> is able to match earlier in the string.

    "cats"          =~ /c|ca|cat|cats/; # matches "c"
    "cats"          =~ /cats|cat|ca|c/; # matches "cats"

At a given character position, the first alternative that allows the
regexp match to succeed wil be the one that matches. Here, all the
alternatives match at the first string position, so th first matches.

=head2 Grouping things and hierarchical matching

The B<grouping> metacharacters C<()> allow a part of a regexp to be
treated as a single unit.  Parts of a regexp are grouped by enclosing
them in parentheses.  The regexp C<house(cat|keeper)> means match
C<house> followed by either C<cat> or C<keeper>.  Some more examples
are

    /(a|b)b/;    # matches 'ab' or 'bb'
    /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere

    /house(cat|)/;  # matches either 'housecat' or 'house'
    /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
                        # 'house'.  Note groups can be nested.

    "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
                             # because '20\d\d' can't match

=head2 Extracting matches

The grouping metacharacters C<()> also allow the extraction of the
parts of a string that matched.  For each grouping, the part that
matched inside goes into the special variables C<$1>, C<$2>, etc.
They can be used just as ordinary variables:

    # extract hours, minutes, seconds
    $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
    $hours = $1;
    $minutes = $2;
    $seconds = $3;

In list context, a match C</regexp/ with groupings will return the
list of matched values C<($1,$2,...)>.  So we could rewrite it as

    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

If the groupings in a regexp are nested, C<$1> gets the group with the
leftmost opening parenthesis, C<$2> the next opening parenthesis,
etc.  For example, here is a complex regexp and the matching variables
indicated below it:

    /(ab(cd|ef)((gi)|j))/;
     1  2      34

Associated with the matching variables C<$1>, C<$2>, ... are
the B<backreferences> C<\1>, C<\2>, ...  Backreferences are
matching variables that can be used I<inside> a regexp:

    /(\w\w\w)\s\1/; # find sequences like 'the the' in string

C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>,
C<\2>, ... only inside a regexp.

=head2 Matching repetitions

The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
to determine the number of repeats of a portion of a regexp we
consider to be a match.  Quantifiers are put immediately after the
character, character class, or grouping that we want to specify.  They
have the following meanings:

=over 4

=item * C<a?> = match 'a' 1 or 0 times

=item * C<a*> = match 'a' 0 or more times, i.e., any number of times

=item * C<a+> = match 'a' 1 or more times, i.e., at least once

=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
times.

=item * C<a{n,}> = match at least C<n> or more times

=item * C<a{n}> = match exactly C<n> times

=back

Here are some examples:

    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
                     # any number of digits
    /(\w+)\s+\1/;    # match doubled words of arbitrary length
    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
                         # than 4 digits
    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates

These quantifiers will try to match as much of the string as possible,
while still allowing the regexp to match.  So we have

    $x =~ /^(.*)(at)(.*)$/; # matches,
                            # $1 = 'the cat in the h'
                            # $2 = 'at'
                            # $3 = ''   (0 matches)

The first quantifier C<.*> grabs as much of the string as possible
while still having the regexp match. The second quantifier C<.*> has
no string left to it, so it matches 0 times.

=head2 More matching

There are a few more things you might want to know about matching
operators.  In the code

    $pattern = 'Seuss';
    while (<>) {
        print if /$pattern/;
    }

perl has to re-evaluate C<$pattern> each time through the loop.  If
C<$pattern> won't be changing, use the C<//o> modifier, to only
perform variable substitutions once.  If you don't want any
substitutions at all, use the special delimiter C<m''>:

    $pattern = 'Seuss';
    m'$pattern'; # matches '$pattern', not 'Seuss'

The global modifier C<//g> allows the matching operator to match
within a string as many times as possible.  In scalar context,
successive matches against a string will have C<//g> jump from match
to match, keeping track of position in the string as it goes along.
You can get or set the position with the C<pos()> function.
For example,

    $x = "cat dog house"; # 3 words
    while ($x =~ /(\w+)/g) {
        print "Word is $1, ends at position ", pos $x, "\n";
    }

prints

    Word is cat, ends at position 3
    Word is dog, ends at position 7
    Word is house, ends at position 13

A failed match or changing the target string resets the position.  If
you don't want the position reset after failure to match, add the
C<//c>, as in C</regexp/gc>.

In list context, C<//g> returns a list of matched groupings, or if
there are no groupings, a list of matches to the whole regexp.  So

    @words = ($x =~ /(\w+)/g);  # matches,
                                # $word[0] = 'cat'
                                # $word[1] = 'dog'
                                # $word[2] = 'house'

=head2 Search and replace

Search and replace is perform using C<s/regexp/replacement/modifiers>.
The C<replacement> is a Perl double quoted string that replaces in the
string whatever is matched with the C<regexp>.  The operator C<=~> is
also used here to associate a string with C<s///>.  If matching
against C<$_>, the S<C<$_ =~> > can be dropped.  If there is a match,
C<s///> returns the number of substitutions made, otherwise it returns
false.  Here are a few examples:

    $x = "Time to feed the cat!";
    $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
    $y = "'quoted words'";
    $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
                           # $y contains "quoted words"

With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
are immediately available for use in the replacement expression. With
the global modifier, C<s///g> will search and replace all occurrences
of the regexp in the string:

    $x = "I batted 4 for 4";
    $x =~ s/4/four/;   # $x contains "I batted four for 4"
    $x = "I batted 4 for 4";
    $x =~ s/4/four/g;  # $x contains "I batted four for four"

The evaluation modifier C<s///e> wraps an C<eval{...}> around the
replacement string and the evaluated result is substituted for the
matched substring.  This counts character frequencies in a line:

    $x = "the cat";
    $x =~ s/(.)/$chars{$1}++;$1/eg;  # final $1 replaces char with itself
    print "frequency of '$_' is $chars{$_}\n"
        foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);

This prints

    frequency of 't' is 2
    frequency of 'e' is 1
    frequency of ' ' is 1
    frequency of 'h' is 1
    frequency of 'a' is 1
    frequency of 'c' is 1

C<s///> can use other delimiters, such as C<s!!!> and C<s{}{}>, and
even C<s{}//>.  If single quotes are used C<s'''>, then the regexp and
replacement are treated as single quoted strings.

=head2 The split operator

C<split /regexp/, string> splits C<string> into a list of substrings
and returns that list.  The regexp determines the character sequence
that C<string> is split with respect to.  For example, to split a
string into words, use

    $x = "Calvin and Hobbes";
    @words = split /\s+/, $x;  # $word[0] = 'Calvin'
                               # $word[1] = 'and'
                               # $word[2] = 'Hobbes'

If the empty regexp C<//> is used, the string is split into individual
characters.  If the regexp has groupings, then list produced contains
the matched substrings from the groupings as well:

    $x = "/usr/bin";
    @parts = split m!(/)!, $x;  # $parts[0] = ''
                                # $parts[1] = '/'
                                # $parts[2] = 'usr'
                                # $parts[3] = '/'
                                # $parts[4] = 'bin'

Since the first character of $x matched the regexp, C<split> prepended
an empty initial element to the list.

=head1 BUGS

None.

=head1 SEE ALSO

This is just a quick start guide.  For a more in-depth tutorial on
regexps, see L<perlretut> and for the reference page, see L<perlre>.

=head1 AUTHOR AND COPYRIGHT

Copyright (c) 2000 Mark Kvale
All rights reserved.

This document may be distributed under the same terms as Perl itself.

=cut
Commit	Line	Data
47f9c88b	1	=head1 NAME
	2
	3	perlrequick - Perl regular expressions quick start
	4
	5	=head1 DESCRIPTION
	6
	7	This page covers the very basics of understanding, creating and
	8	using regular expressions ('regexps') in Perl.
	9
	10	=head1 The Guide
	11
	12	=head2 Simple word matching
	13
	14	The simplest regexp is simply a word, or more generally, a string of
	15	characters. A regexp consisting of a word matches any string that
	16	contains that word:
	17
	18	"Hello World" =~ /World/; # matches
	19
	20	In this statement, C<World> is a regexp and the C<//> enclosing
	21	C</World/> tells perl to search a string for a match. The operator
	22	C<=~> associates the string with the regexp match and produces a true
	23	value if the regexp matched, or false if the regexp did not match. In
	24	our case, C<World> matches the second word in C<"Hello World">, so the
	25	expression is true. This idea has several variations.
	26
	27	Expressions like this are useful in conditionals:
	28
	29	print "It matches\n" if "Hello World" =~ /World/;
	30
	31	The sense of the match can be reversed by using C<!~> operator:
	32
	33	print "It doesn't match\n" if "Hello World" !~ /World/;
	34
	35	The literal string in the regexp can be replaced by a variable:
	36
	37	$greeting = "World";
	38	print "It matches\n" if "Hello World" =~ /$greeting/;
	39
	40	If you're matching against C<$_>, the C<$_ =~> part can be omitted:
	41
	42	$_ = "Hello World";
	43	print "It matches\n" if /World/;
	44
	45	Finally, the C<//> default delimiters for a match can be changed to
	46	arbitrary delimiters by putting an C<'m'> out front:
	47
	48	"Hello World" =~ m!World!; # matches, delimited by '!'
	49	"Hello World" =~ m{World}; # matches, note the matching '{}'
	50	"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
	51	# '/' becomes an ordinary char
	52
	53	Regexps must match a part of the string I<exactly> in order for the
	54	statement to be true:
	55
	56	"Hello World" =~ /world/; # doesn't match, case sensitive
	57	"Hello World" =~ /o W/; # matches, ' ' is an ordinary char
	58	"Hello World" =~ /World /; # doesn't match, no ' ' at end
	59
	60	perl will always match at the earliest possible point in the string:
	61
	62	"Hello World" =~ /o/; # matches 'o' in 'Hello'
	63	"That hat is red" =~ /hat/; # matches 'hat' in 'That'
	64
65	Not all characters can be used 'as is' in a match. Some characters,
66	called B<metacharacters>, are reserved for use in regexp notation.
67	The metacharacters are
68
69	{}[]()^$.\|*+?\
70
71	A metacharacter can be matched by putting a backslash before it:
72
73	"2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter
74	"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
75	'C:\WIN32' =~ /C:\\WIN/; # matches
76	"/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches
77
78	In the last regexp, the forward slash C<'/'> is also backslashed,
79	because it is used to delimit the regexp.
80
81	Non-printable ASCII characters are represented by B<escape sequences>.
82	Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
83	for a carriage return. Arbitrary bytes are represented by octal
84	escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
85	e.g., C<\x1B>:
86
87	"1000\t2000" =~ m(0\t2) # matches
88	"cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat
89
90	Regexps are treated mostly as double quoted strings, so variable
91	substitution works:
92
93	$foo = 'house';
94	'cathouse' =~ /cat$foo/; # matches
95	'housecat' =~ /${foo}cat/; # matches
96
97	With all of the regexps above, if the regexp matched anywhere in the
98	string, it was considered a match. To specify I<where> it should
99	match, we would use the B<anchor> metacharacters C<^> and C<$>. The
100	anchor C<^> means match at the beginning of the string and the anchor
101	C<$> means match at the end of the string, or before a newline at the
102	end of the string. Some examples:
103
104	"housekeeper" =~ /keeper/; # matches
105	"housekeeper" =~ /^keeper/; # doesn't match
106	"housekeeper" =~ /keeper$/; # matches
107	"housekeeper\n" =~ /keeper$/; # matches
108
109	=head2 Using character classes
110
111	A B<character class> allows a set of possible characters, rather than
112	just a single character, to match at a particular point in a regexp.
113	Character classes are denoted by brackets C<[...]>, with the set of
114	characters to be possibly matched inside. Here are some examples:
115
116	/cat/; # matches 'cat'
117	/[bcr]at/; # matches 'bat, 'cat', or 'rat'
118	"abc" =~ /[cab]/; # matches 'a'
119
120	In the last statement, even though C<'c'> is the first character in
121	the class, the earliest point at which the regexp can match is C<'a'>.
122
123	/[yY][eE][sS]/; # match 'yes' in a case-insensitive way
124	# 'yes', 'Yes', 'YES', etc.
125	/yes/i; # also match 'yes' in a case-insensitive way
126
127	The last example shows a match with an C<'i'> B<modifier>, which makes
128	the match case-insensitive.
129
130	Character classes also have ordinary and special characters, but the
131	sets of ordinary and special characters inside a character class are
132	different than those outside a character class. The special
133	characters for a character class are C<-]\^$> and are matched using an
134	escape:
135
136	/[\]c]def/; # matches ']def' or 'cdef'
137	$x = 'bcr';
138	/[$x]at/; # matches 'bat, 'cat', or 'rat'
139	/[\$x]at/; # matches '$at' or 'xat'
140	/[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
141
142	The special character C<'-'> acts as a range operator within character
143	classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
144	become the svelte C<[0-9]> and C<[a-z]>:
145
146	/item[0-9]/; # matches 'item0' or ... or 'item9'
147	/[0-9a-fA-F]/; # matches a hexadecimal digit
148
149	If C<'-'> is the first or last character in a character class, it is
150	treated as an ordinary character.
151
152	The special character C<^> in the first position of a character class
153	denotes a B<negated character class>, which matches any character but
154	those in the bracket. Both C<[...]> and C<[^...]> must match a
155	character, or the match fails. Then
156
157	/[^a]at/; # doesn't match 'aat' or 'at', but matches
158	# all other 'bat', 'cat, '0at', '%at', etc.
159	/[^0-9]/; # matches a non-numeric character
160	/[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
161
162	Perl has several abbreviations for common character classes:
163
164	=over 4
165
166	=item *
167	\d is a digit and represents [0-9]
168
169	=item *
170	\s is a whitespace character and represents [\ \t\r\n\f]
171
172	=item *
173	\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_]
174
175	=item *
176	\D is a negated \d; it represents any character but a digit [^0-9]
177
178	=item *
179	\S is a negated \s; it represents any non-whitespace character [^\s]
180
181	=item *
182	\W is a negated \w; it represents any non-word character [^\w]
183
184	=item *
185	The period '.' matches any character but "\n"
186
187	=back
188
189	The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
190	of character classes. Here are some in use:
191
192	/\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
193	/[\d\s]/; # matches any digit or whitespace character
194	/\w\W\w/; # matches a word char, followed by a
195	# non-word char, followed by a word char
196	/..rt/; # matches any two chars, followed by 'rt'
197	/end\./; # matches 'end.'
198	/end[.]/; # same thing, matches 'end.'
199
200	The S<B<word anchor> > C<\b> matches a boundary between a word
201	character and a non-word character C<\w\W> or C<\W\w>:
202
203	$x = "Housecat catenates house and cat";
204	$x =~ /\bcat/; # matches cat in 'catenates'
205	$x =~ /cat\b/; # matches cat in 'housecat'
206	$x =~ /\bcat\b/; # matches 'cat' at end of string
207
208	In the last example, the end of the string is considered a word
209	boundary.
210
211	=head2 Matching this or that
212
213	We can match match different character strings with the B<alternation>
214	metacharacter C<'\|'>. To match C<dog> or C<cat>, we form the regexp
215	C<dog\|cat>. As before, perl will try to match the regexp at the
216	earliest possible point in the string. At each character position,
217	perl will first try to match the the first alternative, C<dog>. If
218	C<dog> doesn't match, perl will then try the next alternative, C<cat>.
219	If C<cat> doesn't match either, then the match fails and perl moves to
220	the next position in the string. Some examples:
221
222	"cats and dogs" =~ /cat\|dog\|bird/; # matches "cat"
223	"cats and dogs" =~ /dog\|cat\|bird/; # matches "cat"
224
225	Even though C<dog> is the first alternative in the second regexp,
226	C<cat> is able to match earlier in the string.
227
228	"cats" =~ /c\|ca\|cat\|cats/; # matches "c"
229	"cats" =~ /cats\|cat\|ca\|c/; # matches "cats"
230
231	At a given character position, the first alternative that allows the
232	regexp match to succeed wil be the one that matches. Here, all the
233	alternatives match at the first string position, so th first matches.
234
235	=head2 Grouping things and hierarchical matching
236
237	The B<grouping> metacharacters C<()> allow a part of a regexp to be
238	treated as a single unit. Parts of a regexp are grouped by enclosing
239	them in parentheses. The regexp C<house(cat\|keeper)> means match
240	C<house> followed by either C<cat> or C<keeper>. Some more examples
241	are
242
243	/(a\|b)b/; # matches 'ab' or 'bb'
244	/(^a\|b)c/; # matches 'ac' at start of string or 'bc' anywhere
245
246	/house(cat\|)/; # matches either 'housecat' or 'house'
247	/house(cat(s\|)\|)/; # matches either 'housecats' or 'housecat' or
248	# 'house'. Note groups can be nested.
249
250	"20" =~ /(19\|20\|)\d\d/; # matches the null alternative '()\d\d',
251	# because '20\d\d' can't match
252
253	=head2 Extracting matches
254
255	The grouping metacharacters C<()> also allow the extraction of the
256	parts of a string that matched. For each grouping, the part that
257	matched inside goes into the special variables C<$1>, C<$2>, etc.
258	They can be used just as ordinary variables:
259
260	# extract hours, minutes, seconds
261	$time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
262	$hours = $1;
263	$minutes = $2;
264	$seconds = $3;
265
266	In list context, a match C</regexp/ with groupings will return the
267	list of matched values C<($1,$2,...)>. So we could rewrite it as
268
269	($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
270
271	If the groupings in a regexp are nested, C<$1> gets the group with the
272	leftmost opening parenthesis, C<$2> the next opening parenthesis,
273	etc. For example, here is a complex regexp and the matching variables
274	indicated below it:
275
276	/(ab(cd\|ef)((gi)\|j))/;
277	1 2 34
278
279	Associated with the matching variables C<$1>, C<$2>, ... are
280	the B<backreferences> C<\1>, C<\2>, ... Backreferences are
281	matching variables that can be used I<inside> a regexp:
282
283	/(\w\w\w)\s\1/; # find sequences like 'the the' in string
284
285	C<$1>, C<$2>, ... should only be used outside of a regexp, and C<\1>,
286	C<\2>, ... only inside a regexp.
287
288	=head2 Matching repetitions
289
290	The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
291	to determine the number of repeats of a portion of a regexp we
292	consider to be a match. Quantifiers are put immediately after the
293	character, character class, or grouping that we want to specify. They
294	have the following meanings:
295
296	=over 4
297
298	=item * C<a?> = match 'a' 1 or 0 times
299
300	=item * C<a*> = match 'a' 0 or more times, i.e., any number of times
301
302	=item * C<a+> = match 'a' 1 or more times, i.e., at least once
303
304	=item * C<a{n,m}> = match at least C<n> times, but not more than C<m>
305	times.
306
307	=item * C<a{n,}> = match at least C<n> or more times
308
309	=item * C<a{n}> = match exactly C<n> times
310
311	=back
312
313	Here are some examples:
314
315	/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
316	# any number of digits
317	/(\w+)\s+\1/; # match doubled words of arbitrary length
318	$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
319	# than 4 digits
320	$year =~ /\d{4}\|\d{2}/; # better match; throw out 3 digit dates
321
322	These quantifiers will try to match as much of the string as possible,
323	while still allowing the regexp to match. So we have
324
325	$x =~ /^(.)(at)(.)$/; # matches,
326	# $1 = 'the cat in the h'
327	# $2 = 'at'
328	# $3 = '' (0 matches)
329
330	The first quantifier C<.*> grabs as much of the string as possible
331	while still having the regexp match. The second quantifier C<.*> has
332	no string left to it, so it matches 0 times.
333
334	=head2 More matching
335
336	There are a few more things you might want to know about matching
337	operators. In the code
338
339	$pattern = 'Seuss';
340	while (<>) {
341	print if /$pattern/;
342	}
343
344	perl has to re-evaluate C<$pattern> each time through the loop. If
345	C<$pattern> won't be changing, use the C<//o> modifier, to only
346	perform variable substitutions once. If you don't want any
347	substitutions at all, use the special delimiter C<m''>:
348
349	$pattern = 'Seuss';
350	m'$pattern'; # matches '$pattern', not 'Seuss'
351
352	The global modifier C<//g> allows the matching operator to match
353	within a string as many times as possible. In scalar context,
354	successive matches against a string will have C<//g> jump from match
355	to match, keeping track of position in the string as it goes along.
356	You can get or set the position with the C<pos()> function.
357	For example,
358
359	$x = "cat dog house"; # 3 words
360	while ($x =~ /(\w+)/g) {
361	print "Word is $1, ends at position ", pos $x, "\n";
362	}
363
364	prints
365
366	Word is cat, ends at position 3
367	Word is dog, ends at position 7
368	Word is house, ends at position 13
369
370	A failed match or changing the target string resets the position. If
371	you don't want the position reset after failure to match, add the
372	C<//c>, as in C</regexp/gc>.
373
374	In list context, C<//g> returns a list of matched groupings, or if
375	there are no groupings, a list of matches to the whole regexp. So
376
377	@words = ($x =~ /(\w+)/g); # matches,
378	# $word[0] = 'cat'
379	# $word[1] = 'dog'
380	# $word[2] = 'house'
381
382	=head2 Search and replace
383
384	Search and replace is perform using C<s/regexp/replacement/modifiers>.
385	The C<replacement> is a Perl double quoted string that replaces in the
386	string whatever is matched with the C<regexp>. The operator C<=~> is
387	also used here to associate a string with C<s///>. If matching
388	against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match,
389	C<s///> returns the number of substitutions made, otherwise it returns
390	false. Here are a few examples:
391
392	$x = "Time to feed the cat!";
393	$x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!"
394	$y = "'quoted words'";
395	$y =~ s/^'(.*)'$/$1/; # strip single quotes,
396	# $y contains "quoted words"
397
398	With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
399	are immediately available for use in the replacement expression. With
400	the global modifier, C<s///g> will search and replace all occurrences
401	of the regexp in the string:
402
403	$x = "I batted 4 for 4";
404	$x =~ s/4/four/; # $x contains "I batted four for 4"
405	$x = "I batted 4 for 4";
406	$x =~ s/4/four/g; # $x contains "I batted four for four"
407
408	The evaluation modifier C<s///e> wraps an C<eval{...}> around the
409	replacement string and the evaluated result is substituted for the
410	matched substring. This counts character frequencies in a line:
411
412	$x = "the cat";
413	$x =~ s/(.)/$chars{$1}++;$1/eg; # final $1 replaces char with itself
414	print "frequency of '$_' is $chars{$_}\n"
415	foreach (sort {$chars{$b} <=> $chars{$a}} keys %chars);
416
417	This prints
418
419	frequency of 't' is 2
420	frequency of 'e' is 1
421	frequency of ' ' is 1
422	frequency of 'h' is 1
423	frequency of 'a' is 1
424	frequency of 'c' is 1
425
426	C<s///> can use other delimiters, such as C<s!!!> and C<s{}{}>, and
427	even C<s{}//>. If single quotes are used C<s'''>, then the regexp and
428	replacement are treated as single quoted strings.
429
430	=head2 The split operator
431
432	C<split /regexp/, string> splits C<string> into a list of substrings
433	and returns that list. The regexp determines the character sequence
434	that C<string> is split with respect to. For example, to split a
435	string into words, use
436
437	$x = "Calvin and Hobbes";
438	@words = split /\s+/, $x; # $word[0] = 'Calvin'
439	# $word[1] = 'and'
440	# $word[2] = 'Hobbes'
441
442	If the empty regexp C<//> is used, the string is split into individual
443	characters. If the regexp has groupings, then list produced contains
444	the matched substrings from the groupings as well:
445
446	$x = "/usr/bin";
447	@parts = split m!(/)!, $x; # $parts[0] = ''
448	# $parts[1] = '/'
449	# $parts[2] = 'usr'
450	# $parts[3] = '/'
451	# $parts[4] = 'bin'
452
453	Since the first character of $x matched the regexp, C<split> prepended
454	an empty initial element to the list.
455
456	=head1 BUGS
457
458	None.
459
460	=head1 SEE ALSO
461
462	This is just a quick start guide. For a more in-depth tutorial on
463	regexps, see L<perlretut> and for the reference page, see L<perlre>.
464
465	=head1 AUTHOR AND COPYRIGHT
466
467	Copyright (c) 2000 Mark Kvale
468	All rights reserved.
469
470	This document may be distributed under the same terms as Perl itself.
471
472	=cut
473