in the
@@ -327,59 +368,90 @@ original string.
To escape the special meaning of C<.>, we use C<\Q>:
- $string = "Placido P. Octopus";
- $regex = "P.";
+ $string = "Placido P. Octopus";
+ $regex = "P.";
- $string =~ s/\Q$regex/Polyp/;
- # $string is now "Placido Polyp Octopus"
+ $string =~ s/\Q$regex/Polyp/;
+ # $string is now "Placido Polyp Octopus"
The use of C<\Q> causes the <.> in the regex to be treated as a
regular character, so that C matches a C followed by a dot.
=head2 What is C really for?
-X
-
-Using a variable in a regular expression match forces a re-evaluation
-(and perhaps recompilation) each time the regular expression is
-encountered. The C modifier locks in the regex the first time
-it's used. This always happens in a constant regular expression, and
-in fact, the pattern was compiled into the internal format at the same
-time your entire program was.
-
-Use of C is irrelevant unless variable interpolation is used in
-the pattern, and if so, the regex engine will neither know nor care
-whether the variables change after the pattern is evaluated the I time.
-
-C is often used to gain an extra measure of efficiency by not
-performing subsequent evaluations when you know it won't matter
-(because you know the variables won't change), or more rarely, when
-you don't want the regex to notice if they do.
-
-For example, here's a "paragrep" program:
-
- $/ = ''; # paragraph mode
- $pat = shift;
- while (<>) {
- print if /$pat/o;
- }
+X X
+
+(contributed by brian d foy)
+
+The C option for regular expressions (documented in L and
+L) tells Perl to compile the regular expression only once.
+This is only useful when the pattern contains a variable. Perls 5.6
+and later handle this automatically if the pattern does not change.
+
+Since the match operator C, the substitution operator C,
+and the regular expression quoting operator C are double-quotish
+constructs, you can interpolate variables into the pattern. See the
+answer to "How can I quote a variable to use in a regex?" for more
+details.
+
+This example takes a regular expression from the argument list and
+prints the lines of input that match it:
+
+ my $pattern = shift @ARGV;
+
+ while( <> ) {
+ print if m/$pattern/;
+ }
+
+Versions of Perl prior to 5.6 would recompile the regular expression
+for each iteration, even if C<$pattern> had not changed. The C
+would prevent this by telling Perl to compile the pattern the first
+time, then reuse that for subsequent iterations:
+
+ my $pattern = shift @ARGV;
+
+ while( <> ) {
+ print if m/$pattern/o; # useful for Perl < 5.6
+ }
+
+In versions 5.6 and later, Perl won't recompile the regular expression
+if the variable hasn't changed, so you probably don't need the C
+option. It doesn't hurt, but it doesn't help either. If you want any
+version of Perl to compile the regular expression only once even if
+the variable changes (thus, only using its initial value), you still
+need the C.
+
+You can watch Perl's regular expression engine at work to verify for
+yourself if Perl is recompiling a regular expression. The C
modifier.
+
=head2 What good is C<\G> in a regular expression?
X<\G>
@@ -677,14 +861,14 @@ string where the last match left off. The regular
expression engine cannot skip over any characters to find
the next match with this anchor, so C<\G> is similar to the
beginning of string anchor, C<^>. The C<\G> anchor is typically
-used with the C flag. It uses the value of pos()
+used with the C flag. It uses the value of C
as the position to start the next match. As the match
-operator makes successive matches, it updates pos() with the
+operator makes successive matches, it updates C with the
position of the next character past the last match (or the
first character of the next match, depending on how you like
-to look at it). Each string has its own pos() value.
+to look at it). Each string has its own C value.
-Suppose you want to match all of consective pairs of digits
+Suppose you want to match all of consecutive pairs of digits
in a string like "1122a44" and stop matching when you
encounter non-digits. You want to match C<11> and C<22> but
the letter shows up between C<22> and C<44> and you want
@@ -694,7 +878,7 @@ the C and still matches C<44>.
$_ = "1122a44";
my @pairs = m/(\d\d)/g; # qw( 11 22 44 )
-If you use the \G anchor, you force the match after C<22> to
+If you use the C<\G> anchor, you force the match after C<22> to
start with the C. The regular expression cannot match
there since it does not find a digit, so the next match
fails and the match operator returns the pairs it already
@@ -712,7 +896,7 @@ still need the C flag.
print "Found $1\n";
}
-After the match fails at the letter C, perl resets pos()
+After the match fails at the letter C, perl resets C
and the next match on the same string starts at the beginning.
$_ = "1122a44";
@@ -723,13 +907,13 @@ and the next match on the same string starts at the beginning.
print "Found $1 after while" if m/(\d\d)/g; # finds "11"
-You can disable pos() resets on fail with the C flag.
-Subsequent matches start where the last successful match
-ended (the value of pos()) even if a match on the same
-string as failed in the meantime. In this case, the match
-after the while() loop starts at the C (where the last
-match stopped), and since it does not use any anchor it can
-skip over the C to find "44".
+You can disable C resets on fail with the C flag, documented
+in L and L. Subsequent matches start where the last
+successful match ended (the value of C) even if a match on the
+same string has failed in the meantime. In this case, the match after
+the C loop starts at the C (where the last match stopped),
+and since it does not use any anchor it can skip over the C to find
+C<44>.
$_ = "1122a44";
while( m/\G(\d\d)/gc )
@@ -744,17 +928,17 @@ when you want to try a different match if one fails,
such as in a tokenizer. Jeffrey Friedl offers this example
which works in 5.004 or later.
- while (<>) {
- chomp;
- PARSER: {
- m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
- m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
- m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
- m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
- }
- }
+ while (<>) {
+ chomp;
+ PARSER: {
+ m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
+ m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
+ m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
+ m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
+ }
+ }
-For each line, the PARSER loop first tries to match a series
+For each line, the C loop first tries to match a series
of digits followed by a word boundary. This match has to
start at the place the last match left off (or the beginning
of the string on the first match). Since C X
-X
+X X X
Starting from Perl 5.6 Perl has had some level of multibyte character
support. Perl 5.8 or later is recommended. Supported multibyte
@@ -826,32 +1010,33 @@ looks like it is because "SG" is next to "XX", but there's no real
Here are a few ways, all painful, to deal with it:
- $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent "martian"
- # bytes are no longer adjacent.
- print "found GX!\n" if $martian =~ /GX/;
+ # Make sure adjacent "martian" bytes are no longer adjacent.
+ $martian =~ s/([A-Z][A-Z])/ $1 /g;
+
+ print "found GX!\n" if $martian =~ /GX/;
Or like this:
- @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
- # above is conceptually similar to: @chars = $text =~ m/(.)/g;
- #
- foreach $char (@chars) {
- print "found GX!\n", last if $char eq 'GX';
- }
+ @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
+ # above is conceptually similar to: @chars = $text =~ m/(.)/g;
+ #
+ foreach $char (@chars) {
+ print "found GX!\n", last if $char eq 'GX';
+ }
Or like this:
- while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
- print "found GX!\n", last if $1 eq 'GX';
- }
+ while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
+ print "found GX!\n", last if $1 eq 'GX';
+ }
Here's another, slightly less painful, way to do it from Benjamin
Goldberg, who uses a zero-width negative look-behind assertion.
print "found GX!\n" if $martian =~ m/
- (? X X X X<\Q, regex>
+X<\E, regex>, X
-Well, if it's really a pattern, then just use
+(contributed by brian d foy)
- chomp($pattern = );
- if ($line =~ /$pattern/) { }
+We don't have to hard-code patterns into the match operator (or
+anything else that works with regular expressions). We can put the
+pattern in a variable for later use.
-Alternatively, since you have no guarantee that your user entered
-a valid regular expression, trap the exception this way:
+The match operator is a double quote context, so you can interpolate
+your variable just like a double quoted string. In this case, you
+read the regular expression as user input and store it in C<$regex>.
+Once you have the pattern in C<$regex>, you use that variable in the
+match operator.
- if (eval { $line =~ /$pattern/ }) { }
+ chomp( my $regex = );
-If all you really want is to search for a string, not a pattern,
-then you should either use the index() function, which is made for
-string searching, or, if you can't be disabused of using a pattern
-match on a non-pattern, then be sure to use C<\Q>...C<\E>, documented
-in L.
+ if( $string =~ m/$regex/ ) { ... }
- $pattern = ;
+Any regular expression special characters in C<$regex> are still
+special, and the pattern still has to be valid or Perl will complain.
+For instance, in this pattern there is an unpaired parenthesis.
- open (FILE, $input) or die "Couldn't open input $input: $!; aborting";
- while () {
- print if /\Q$pattern\E/;
- }
- close FILE;
+ my $regex = "Unmatched ( paren";
+
+ "Two parens to bind them all" =~ m/$regex/;
+
+When Perl compiles the regular expression, it treats the parenthesis
+as the start of a memory match. When it doesn't find the closing
+parenthesis, it complains:
+
+ Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
+
+You can get around this in several ways depending on our situation.
+First, if you don't want any of the characters in the string to be
+special, you can escape them with C before you use the string.
+
+ chomp( my $regex = );
+ $regex = quotemeta( $regex );
+
+ if( $string =~ m/$regex/ ) { ... }
+
+You can also do this directly in the match operator using the C<\Q>
+and C<\E> sequences. The C<\Q> tells Perl where to start escaping
+special characters, and the C<\E> tells it where to stop (see L
+for more details).
+
+ chomp( my $regex = );
+
+ if( $string =~ m/\Q$regex\E/ ) { ... }
+
+Alternately, you can use C, the regular expression quote operator (see
+L for more details). It quotes and perhaps compiles the pattern,
+and you can apply regular expression flags to the pattern.
+
+ chomp( my $input = );
+
+ my $regex = qr/$input/is;
+
+ $string =~ m/$regex/ # same as m/$input/is;
+
+You might also want to trap any errors by wrapping an C block
+around the whole thing.
+
+ chomp( my $input = );
+
+ eval {
+ if( $string =~ m/\Q$input\E/ ) { ... }
+ };
+ warn $@ if $@;
+
+Or...
+
+ my $regex = eval { qr/$input/is };
+ if( defined $regex ) {
+ $string =~ m/$regex/;
+ }
+ else {
+ warn $@;
+ }
=head1 REVISION
-Revision: $Revision: 3606 $
+Revision: $Revision$
-Date: $Date: 2006-03-06 12:05:47 +0100 (lun, 06 mar 2006) $
+Date: $Date$
See L for source control details and availability.
=head1 AUTHOR AND COPYRIGHT
-Copyright (c) 1997-2006 Tom Christiansen, Nathan Torkington, and
+Copyright (c) 1997-2009 Tom Christiansen, Nathan Torkington, and
other authors as noted. All rights reserved.
This documentation is free; you can redistribute it and/or modify it