X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlfaq6.pod;h=40965d09fcb8779125c70d743c4cccb578104795;hb=2cbeaf93ebca0381b43ae9b18f99fecb381ee394;hp=d21a11157b11c80dea2ae6361fda4836226ff05d;hpb=fc36a67e8855d031b2a6921819d899eb149eee2d;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod
index d21a111..40965d0 100644
--- a/pod/perlfaq6.pod
+++ b/pod/perlfaq6.pod
@@ -1,6 +1,6 @@
 =head1 NAME
 
-perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $)
+perlfaq6 - Regular Expressions
 
 =head1 DESCRIPTION
 
@@ -8,48 +8,51 @@ This section is surprisingly small because the rest of the FAQ is
 littered with answers involving regular expressions.  For example,
 decoding a URL and checking whether something is a number are handled
 with regular expressions, but those answers are found elsewhere in
-this document (in the section on Data and the Networking one on
-networking, to be precise).
+this document (in L<perlfaq9>: "How do I decode or create those %-encodings
+on the web" and L<perlfaq4>: "How do I determine whether a scalar is
+a number/whole/integer/float", to be precise).
 
 =head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
+X<regex, legibility> X<regexp, legibility>
+X<regular expression, legibility> X</x>
 
 Three techniques can make regular expressions maintainable and
 understandable.
 
 =over 4
 
-=item Comments Outside the Regexp
+=item Comments Outside the Regex
 
 Describe what you're doing and how you're doing it, using normal Perl
 comments.
 
-    # turn the line into the first word, a colon, and the
-    # number of characters on the rest of the line
-    s/^(\w+)(.*)/ lc($1) . ":" . length($2) /ge;
+	# turn the line into the first word, a colon, and the
+	# number of characters on the rest of the line
+	s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
 
-=item Comments Inside the Regexp
+=item Comments Inside the Regex
 
-The C</x> modifier causes whitespace to be ignored in a regexp pattern
-(except in a character class), and also allows you to use normal
-comments there, too.  As you can imagine, whitespace and comments help
-a lot.
+The C</x> modifier causes whitespace to be ignored in a regex pattern
+(except in a character class and a few other places), and also allows you to
+use normal comments there, too.  As you can imagine, whitespace and comments
+help a lot.
 
 C</x> lets you turn this:
 
-    s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
+	s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
 
 into this:
 
-    s{ <                    # opening angle bracket
-        (?:                 # Non-backreffing grouping paren
-             [^>'"] *       # 0 or more things that are neither > nor ' nor "
-                |           #    or else
-             ".*?"          # a section between double quotes (stingy match)
-                |           #    or else
-             '.*?'          # a section between single quotes (stingy match)
-        ) +                 #   all occurring one or more times
-       >                    # closing angle bracket
-    }{}gsx;                 # replace with nothing, i.e. delete
+	s{ <                    # opening angle bracket
+		(?:                 # Non-backreffing grouping paren
+			[^>'"] *        # 0 or more things that are neither > nor ' nor "
+				|           #    or else
+			".*?"           # a section between double quotes (stingy match)
+				|           #    or else
+			'.*?'           # a section between single quotes (stingy match)
+		) +                 #   all occurring one or more times
+		>                   # closing angle bracket
+	}{}gsx;                 # replace with nothing, i.e. delete
 
 It's still not quite so clear as prose, but it is very useful for
 describing the meaning of each part of the pattern.
@@ -62,15 +65,17 @@ describes this.  For example, the C<s///> above uses braces as
 delimiters.  Selecting another delimiter can avoid quoting the
 delimiter within the pattern:
 
-    s/\/usr\/local/\/usr\/share/g;	# bad delimiter choice
-    s#/usr/local#/usr/share#g;		# better
+	s/\/usr\/local/\/usr\/share/g;	# bad delimiter choice
+	s#/usr/local#/usr/share#g;		# better
 
 =back
 
 =head2 I'm having trouble matching over more than one line.  What's wrong?
+X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
 
-Either you don't have newlines in your string, or you aren't using the
-correct modifier(s) on your pattern.
+Either you don't have more than one line in the string you're looking
+at (probably), or else you aren't using the correct modifier(s) on
+your pattern (possibly).
 
 There are many ways to get multiline data into a string.  If you want
 it to happen automatically while reading input, you'll want to set $/
@@ -92,218 +97,539 @@ to newlines.  But it's imperative that $/ be set to something other
 than the default, or else we won't actually ever have a multiline
 record read in.
 
-    $/ = '';  		# read in more whole paragraph, not just one line
-    while ( <> ) {
-	while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
-	    print "Duplicate $1 at paragraph $.\n";
+	$/ = '';  		# read in whole paragraph, not just one line
+	while ( <> ) {
+		while ( /\b([\w'-]+)(\s+\1)+\b/gi ) {  	# word starts alpha
+			print "Duplicate $1 at paragraph $.\n";
+		}
 	}
-    }
 
 Here's code that finds sentences that begin with "From " (which would
 be mangled by many mailers):
 
-    $/ = '';  		# read in more whole paragraph, not just one line
-    while ( <> ) {
-	while ( /^From /gm ) { # /m makes ^ match next to \n
-	    print "leading from in paragraph $.\n";
+	$/ = '';  		# read in whole paragraph, not just one line
+	while ( <> ) {
+		while ( /^From /gm ) { # /m makes ^ match next to \n
+		print "leading from in paragraph $.\n";
+		}
 	}
-    }
 
 Here's code that finds everything between START and END in a paragraph:
 
-    undef $/;  		# read in whole file, not just one line or paragraph
-    while ( <> ) {
-	while ( /START(.*?)END/sm ) { # /s makes . cross line boundaries
-	    print "$1\n";
+	undef $/;  		# read in whole file, not just one line or paragraph
+	while ( <> ) {
+		while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
+		    print "$1\n";
+		}
 	}
-    }
 
 =head2 How can I pull out lines between two patterns that are themselves on different lines?
+X<..>
 
 You can use Perl's somewhat exotic C<..> operator (documented in
 L<perlop>):
 
-    perl -ne 'print if /START/ .. /END/' file1 file2 ...
+	perl -ne 'print if /START/ .. /END/' file1 file2 ...
 
 If you wanted text and not lines, you would use
 
-    perl -0777 -pe 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
+	perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
 
 But if you want nested occurrences of C<START> through C<END>, you'll
 run up against the problem described in the question in this section
 on matching balanced text.
 
+Here's another example of using C<..>:
+
+	while (<>) {
+		$in_header =   1  .. /^$/;
+		$in_body   = /^$/ .. eof;
+	# now choose between them
+	} continue {
+		$. = 0 if eof;	# fix $.
+	}
+
+=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
+X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
+X<sucking out, will to live>
+
+(contributed by brian d foy)
+
+If you just want to get work done, use a module and forget about the
+regular expressions. The C<XML::Parser> and C<HTML::Parser> modules
+are good starts, although each namespace has other parsing modules
+specialized for certain tasks and different ways of doing it. Start at
+CPAN Search ( http://search.cpan.org ) and wonder at all the work people
+have done for you already! :)
+
+The problem with things such as XML is that they have balanced text
+containing multiple levels of balanced text, but sometimes it isn't
+balanced text, as in an empty tag (C<< <br/> >>, for instance). Even then,
+things can occur out-of-order. Just when you think you've got a
+pattern that matches your input, someone throws you a curveball.
+
+If you'd like to do it the hard way, scratching and clawing your way
+toward a right answer but constantly being disappointed, besieged by
+bug reports, and weary from the inordinate amount of time you have to
+spend reinventing a triangular wheel, then there are several things
+you can try before you give up in frustration:
+
+=over 4
+
+=item * Solve the balanced text problem from another question in L<perlfaq6>
+
+=item * Try the recursive regex features in Perl 5.10 and later. See L<perlre>
+
+=item * Try defining a grammar using Perl 5.10's C<(?DEFINE)> feature.
+
+=item * Break the problem down into sub-problems instead of trying to use a single regex
+
+=item * Convince everyone not to use XML or HTML in the first place
+
+=back
+
+Good luck!
+
 =head2 I put a regular expression into $/ but it didn't work. What's wrong?
+X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
+X<$RS, regexes in>
+
+$/ has to be a string.  You can use these examples if you really need to
+do this.
+
+If you have File::Stream, this is easy.
+
+	use File::Stream;
+
+	my $stream = File::Stream->new(
+		$filehandle,
+		separator => qr/\s*,\s*/,
+		);
+
+	print "$_\n" while <$stream>;
+
+If you don't have File::Stream, you have to do a little more work.
+
+You can use the four-argument form of sysread to continually add to
+a buffer.  After you add to the buffer, you check if you have a
+complete line (using your regular expression).
+
+	local $_ = "";
+	while( sysread FH, $_, 8192, length ) {
+		while( s/^((?s).*?)your_pattern// ) {
+			my $record = $1;
+			# do stuff here.
+		}
+	}
+
+You can do the same thing with foreach and a match using the
+c flag and the \G anchor, if you do not mind your entire file
+being in memory at the end.
 
-$/ must be a string, not a regular expression.  Awk has to be better
-for something. :-)
-
-Actually, you could do this if you don't mind reading the whole file
-into memory:
-
-    undef $/;
-    @records = split /your_pattern/, <FH>;
-
-The Net::Telnet module (available from CPAN) has the capability to
-wait for a pattern in the input stream, or timeout if it doesn't
-appear within a certain time.
-
-    ## Create a file with three lines.
-    open FH, ">file";
-    print FH "The first line\nThe second line\nThe third line\n";
-    close FH;
-
-    ## Get a read/write filehandle to it.
-    $fh = new FileHandle "+<file";
-
-    ## Attach it to a "stream" object.
-    use Net::Telnet;
-    $file = new Net::Telnet (-fhopen => $fh);
-
-    ## Search for the second line and print out the third.
-    $file->waitfor('/second line\n/');
-    print $file->getline;
-
-=head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS?
-
-It depends on what you mean by "preserving case".  The following
-script makes the substitution have the same case, letter by letter, as
-the original.  If the substitution has more characters than the string
-being substituted, the case of the last character is used for the rest
-of the substitution.
-
-    # Original by Nathan Torkington, massaged by Jeffrey Friedl
-    #
-    sub preserve_case($$)
-    {
-        my ($old, $new) = @_;
-        my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
-        my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
-        my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
-
-        for ($i = 0; $i < $len; $i++) {
-            if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
-                $state = 0;
-            } elsif (lc $c eq $c) {
-                substr($new, $i, 1) = lc(substr($new, $i, 1));
-                $state = 1;
-            } else {
-                substr($new, $i, 1) = uc(substr($new, $i, 1));
-                $state = 2;
-            }
-        }
-        # finish up with any remaining new (for when new is longer than old)
-        if ($newlen > $oldlen) {
-            if ($state == 1) {
-                substr($new, $oldlen) = lc(substr($new, $oldlen));
-            } elsif ($state == 2) {
-                substr($new, $oldlen) = uc(substr($new, $oldlen));
-            }
-        }
-        return $new;
+	local $_ = "";
+	while( sysread FH, $_, 8192, length ) {
+		foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
+			# do stuff here.
+		}
+	substr( $_, 0, pos ) = "" if pos;
+	}
+
+
+=head2 How do I substitute case insensitively on the LHS while preserving case on the RHS?
+X<replace, case preserving> X<substitute, case preserving>
+X<substitution, case preserving> X<s, case preserving>
+
+Here's a lovely Perlish solution by Larry Rosler.  It exploits
+properties of bitwise xor on ASCII strings.
+
+	$_= "this is a TEsT case";
+
+	$old = 'test';
+	$new = 'success';
+
+	s{(\Q$old\E)}
+	{ uc $new | (uc $1 ^ $1) .
+		(uc(substr $1, -1) ^ substr $1, -1) x
+		(length($new) - length $1)
+	}egi;
+
+	print;
+
+And here it is as a subroutine, modeled after the above:
+
+	sub preserve_case($$) {
+		my ($old, $new) = @_;
+		my $mask = uc $old ^ $old;
+
+		uc $new | $mask .
+			substr($mask, -1) x (length($new) - length($old))
     }
 
-    $a = "this is a TEsT case";
-    $a =~ s/(test)/preserve_case($1, "success")/gie;
-    print "$a\n";
+	$string = "this is a TEsT case";
+	$string =~ s/(test)/preserve_case($1, "success")/egi;
+	print "$string\n";
 
 This prints:
 
-    this is a SUcCESS case
+	this is a SUcCESS case
+
+As an alternative, to keep the case of the replacement word if it is
+longer than the original, you can use this code, by Jeff Pinyan:
+
+	sub preserve_case {
+		my ($from, $to) = @_;
+		my ($lf, $lt) = map length, @_;
+
+		if ($lt < $lf) { $from = substr $from, 0, $lt }
+		else { $from .= substr $to, $lf }
+
+		return uc $to | ($from ^ uc $from);
+		}
+
+This changes the sentence to "this is a SUcCess case."
+
+Just to show that C programmers can write C in any programming language,
+if you prefer a more C-like solution, the following script makes the
+substitution have the same case, letter by letter, as the original.
+(It also happens to run about 240% slower than the Perlish solution runs.)
+If the substitution has more characters than the string being substituted,
+the case of the last character is used for the rest of the substitution.
+
+	# Original by Nathan Torkington, massaged by Jeffrey Friedl
+	#
+	sub preserve_case($$)
+	{
+		my ($old, $new) = @_;
+		my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
+		my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
+		my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
+
+		for ($i = 0; $i < $len; $i++) {
+			if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
+				$state = 0;
+			} elsif (lc $c eq $c) {
+				substr($new, $i, 1) = lc(substr($new, $i, 1));
+				$state = 1;
+			} else {
+				substr($new, $i, 1) = uc(substr($new, $i, 1));
+				$state = 2;
+			}
+		}
+		# finish up with any remaining new (for when new is longer than old)
+		if ($newlen > $oldlen) {
+			if ($state == 1) {
+				substr($new, $oldlen) = lc(substr($new, $oldlen));
+			} elsif ($state == 2) {
+				substr($new, $oldlen) = uc(substr($new, $oldlen));
+			}
+		}
+		return $new;
+	}
 
-=head2 How can I make C<\w> match accented characters?
+=head2 How can I make C<\w> match national character sets?
+X<\w>
 
-See L<perllocale>.
+Put C<use locale;> in your script.  The \w character class is taken
+from the current locale.
+
+See L<perllocale> for details.
 
 =head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
+X<alpha>
+
+You can use the POSIX character class syntax C</[[:alpha:]]/>
+documented in L<perlre>.
 
-One alphabetic character would be C</[^\W\d_]/>, no matter what locale
-you're in.  Non-alphabetics would be C</[\W\d_]/> (assuming you don't
-consider an underscore a letter).
+No matter which locale you are in, the alphabetic characters are
+the characters in \w without the digits and the underscore.
+As a regex, that looks like C</[^\W\d_]/>.  Its complement,
+the non-alphabetics, is then everything in \W along with
+the digits and the underscore, or C</[\W\d_]/>.
 
-=head2 How can I quote a variable to use in a regexp?
+=head2 How can I quote a variable to use in a regex?
+X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
 
 The Perl parser will expand $variable and @variable references in
 regular expressions unless the delimiter is a single quote.  Remember,
 too, that the right-hand side of a C<s///> substitution is considered
 a double-quoted string (see L<perlop> for more details).  Remember
-also that any regexp special characters will be acted on unless you
+also that any regex special characters will be acted on unless you
 precede the substitution with \Q.  Here's an example:
 
-    $string = "to die?";
-    $lhs = "die?";
-    $rhs = "sleep no more";
+	$string = "Placido P. Octopus";
+	$regex  = "P.";
+
+	$string =~ s/$regex/Polyp/;
+	# $string is now "Polypacido P. Octopus"
 
-    $string =~ s/\Q$lhs/$rhs/;
-    # $string is now "to sleep no more"
+Because C<.> is special in regular expressions, and can match any
+single character, the regex C<P.> here has matched the <Pl> in the
+original string.
 
-Without the \Q, the regexp would also spuriously match "di".
+To escape the special meaning of C<.>, we use C<\Q>:
+
+	$string = "Placido P. Octopus";
+	$regex  = "P.";
+
+	$string =~ s/\Q$regex/Polyp/;
+	# $string is now "Placido Polyp Octopus"
+
+The use of C<\Q> causes the <.> in the regex to be treated as a
+regular character, so that C<P.> matches a C<P> followed by a dot.
 
 =head2 What is C</o> really for?
+X</o, regular expressions> X<compile, regular expressions>
 
-Using a variable in a regular expression match forces a re-evaluation
-(and perhaps recompilation) each time through.  The C</o> modifier
-locks in the regexp the first time it's used.  This always happens in a
-constant regular expression, and in fact, the pattern was compiled
-into the internal format at the same time your entire program was.
+(contributed by brian d foy)
 
-Use of C</o> is irrelevant unless variable interpolation is used in
-the pattern, and if so, the regexp engine will neither know nor care
-whether the variables change after the pattern is evaluated the I<very
-first> time.
+The C</o> option for regular expressions (documented in L<perlop> and
+L<perlreref>) tells Perl to compile the regular expression only once.
+This is only useful when the pattern contains a variable. Perls 5.6
+and later handle this automatically if the pattern does not change.
 
-C</o> is often used to gain an extra measure of efficiency by not
-performing subsequent evaluations when you know it won't matter
-(because you know the variables won't change), or more rarely, when
-you don't want the regexp to notice if they do.
+Since the match operator C<m//>, the substitution operator C<s///>,
+and the regular expression quoting operator C<qr//> are double-quotish
+constructs, you can interpolate variables into the pattern. See the
+answer to "How can I quote a variable to use in a regex?" for more
+details.
 
-For example, here's a "paragrep" program:
+This example takes a regular expression from the argument list and
+prints the lines of input that match it:
 
-    $/ = '';  # paragraph mode
-    $pat = shift;
-    while (<>) {
-        print if /$pat/o;
-    }
+	my $pattern = shift @ARGV;
+
+	while( <> ) {
+		print if m/$pattern/;
+		}
+
+Versions of Perl prior to 5.6 would recompile the regular expression
+for each iteration, even if C<$pattern> had not changed. The C</o>
+would prevent this by telling Perl to compile the pattern the first
+time, then reuse that for subsequent iterations:
+
+	my $pattern = shift @ARGV;
+
+	while( <> ) {
+		print if m/$pattern/o; # useful for Perl < 5.6
+		}
+
+In versions 5.6 and later, Perl won't recompile the regular expression
+if the variable hasn't changed, so you probably don't need the C</o>
+option. It doesn't hurt, but it doesn't help either. If you want any
+version of Perl to compile the regular expression only once even if
+the variable changes (thus, only using its initial value), you still
+need the C</o>.
+
+You can watch Perl's regular expression engine at work to verify for
+yourself if Perl is recompiling a regular expression. The C<use re
+'debug'> pragma (comes with Perl 5.005 and later) shows the details.
+With Perls before 5.6, you should see C<re> reporting that its
+compiling the regular expression on each iteration. With Perl 5.6 or
+later, you should only see C<re> report that for the first iteration.
+
+	use re 'debug';
+
+	$regex = 'Perl';
+	foreach ( qw(Perl Java Ruby Python) ) {
+		print STDERR "-" x 73, "\n";
+		print STDERR "Trying $_...\n";
+		print STDERR "\t$_ is good!\n" if m/$regex/;
+		}
 
 =head2 How do I use a regular expression to strip C style comments from a file?
 
 While this actually can be done, it's much harder than you'd think.
 For example, this one-liner
 
-    perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
+	perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
 
 will work in many but not all cases.  You see, it's too simple-minded for
 certain kinds of C programs, in particular, those with what appear to be
 comments in quoted strings.  For that, you'd need something like this,
-created by Jeffrey Friedl:
+created by Jeffrey Friedl and later modified by Fred Curtis.
 
-    $/ = undef;
-    $_ = <>;
-    s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|\n+|.[^/"'\\]*)#$2#g;
-    print;
+	$/ = undef;
+	$_ = <>;
+	s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
+	print;
 
 This could, of course, be more legibly written with the C</x> modifier, adding
-whitespace and comments.
+whitespace and comments.  Here it is expanded, courtesy of Fred Curtis.
+
+    s{
+       /\*         ##  Start of /* ... */ comment
+       [^*]*\*+    ##  Non-* followed by 1-or-more *'s
+       (
+         [^/*][^*]*\*+
+       )*          ##  0-or-more things which don't start with /
+                   ##    but do end with '*'
+       /           ##  End of /* ... */ comment
+
+     |         ##     OR  various things which aren't comments:
+
+       (
+         "           ##  Start of " ... " string
+         (
+           \\.           ##  Escaped char
+         |               ##    OR
+           [^"\\]        ##  Non "\
+         )*
+         "           ##  End of " ... " string
+
+       |         ##     OR
+
+         '           ##  Start of ' ... ' string
+         (
+           \\.           ##  Escaped char
+         |               ##    OR
+           [^'\\]        ##  Non '\
+         )*
+         '           ##  End of ' ... ' string
+
+       |         ##     OR
+
+         .           ##  Anything other char
+         [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
+       )
+     }{defined $2 ? $2 : ""}gxse;
+
+A slight modification also removes C++ comments, possibly spanning multiple lines
+using a continuation character:
+
+ s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
 
 =head2 Can I use Perl regular expressions to match balanced text?
-
-Although Perl regular expressions are more powerful than "mathematical"
-regular expressions, because they feature conveniences like backreferences
-(C<\1> and its ilk), they still aren't powerful enough. You still need
-to use non-regexp techniques to parse balanced text, such as the text
-enclosed between matching parentheses or braces, for example.
-
-An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
-and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,
-or C<(> and C<)> can be found in
-http://www.perl.com/CPAN/authors/id/TOMC/scripts/pull_quotes.gz .
-
-The C::Scan module from CPAN contains such subs for internal usage,
-but they are undocumented.
-
-=head2 What does it mean that regexps are greedy?  How can I get around it?
-
-Most people mean that greedy regexps match as much as they can.
+X<regex, matching balanced test> X<regexp, matching balanced test>
+X<regular expression, matching balanced test> X<possessive> X<PARNO>
+X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion>
+
+(contributed by brian d foy)
+
+Your first try should probably be the C<Text::Balanced> module, which
+is in the Perl standard library since Perl 5.8. It has a variety of
+functions to deal with tricky text. The C<Regexp::Common> module can
+also help by providing canned patterns you can use.
+
+As of Perl 5.10, you can match balanced text with regular expressions
+using recursive patterns. Before Perl 5.10, you had to resort to
+various tricks such as using Perl code in C<(??{})> sequences.
+
+Here's an example using a recursive regular expression. The goal is to
+capture all of the text within angle brackets, including the text in
+nested angle brackets. This sample text has two "major" groups: a
+group with one level of nesting and a group with two levels of
+nesting. There are five total groups in angle brackets:
+
+	I have some <brackets in <nested brackets> > and
+	<another group <nested once <nested twice> > >
+	and that's it.
+
+The regular expression to match the balanced text  uses two new (to
+Perl 5.10) regular expression features. These are covered in L<perlre>
+and this example is a modified version of one in that documentation.
+
+First, adding the new possessive C<+> to any quantifier finds the
+longest match and does not backtrack. That's important since you want
+to handle any angle brackets through the recursion, not backtracking.
+The group C<< [^<>]++ >> finds one or more non-angle brackets without
+backtracking.
+
+Second, the new C<(?PARNO)> refers to the sub-pattern in the
+particular capture buffer given by C<PARNO>. In the following regex,
+the first capture buffer finds (and remembers) the balanced text, and
+you  need that same pattern within the first buffer to get past the
+nested text. That's the recursive part. The C<(?1)> uses the pattern
+in the outer capture buffer as an independent part of the regex.
+
+Putting it all together, you have:
+
+	#!/usr/local/bin/perl5.10.0
+
+	my $string =<<"HERE";
+	I have some <brackets in <nested brackets> > and
+	<another group <nested once <nested twice> > >
+	and that's it.
+	HERE
+
+	my @groups = $string =~ m/
+			(                   # start of capture buffer 1
+			<                   # match an opening angle bracket
+				(?:
+					[^<>]++     # one or more non angle brackets, non backtracking
+					  |
+					(?1)        # found < or >, so recurse to capture buffer 1
+				)*
+			>                   # match a closing angle bracket
+			)                   # end of capture buffer 1
+			/xg;
+
+	$" = "\n\t";
+	print "Found:\n\t@groups\n";
+
+The output shows that Perl found the two major groups:
+
+	Found:
+		<brackets in <nested brackets> >
+		<another group <nested once <nested twice> > >
+
+With a little extra work, you can get the all of the groups in angle
+brackets even if they are in other angle brackets too. Each time you
+get a balanced match, remove its outer delimiter (that's the one you
+just matched so don't match it again) and add it to a queue of strings
+to process. Keep doing that until you get no matches:
+
+	#!/usr/local/bin/perl5.10.0
+
+	my @queue =<<"HERE";
+	I have some <brackets in <nested brackets> > and
+	<another group <nested once <nested twice> > >
+	and that's it.
+	HERE
+
+	my $regex = qr/
+			(                   # start of bracket 1
+			<                   # match an opening angle bracket
+				(?:
+					[^<>]++     # one or more non angle brackets, non backtracking
+					  |
+					(?1)        # recurse to bracket 1
+				)*
+			>                   # match a closing angle bracket
+			)                   # end of bracket 1
+			/x;
+
+	$" = "\n\t";
+
+	while( @queue )
+		{
+		my $string = shift @queue;
+
+		my @groups = $string =~ m/$regex/g;
+		print "Found:\n\t@groups\n\n" if @groups;
+
+		unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
+		}
+
+The output shows all of the groups. The outermost matches show up
+first and the nested matches so up later:
+
+	Found:
+		<brackets in <nested brackets> >
+		<another group <nested once <nested twice> > >
+
+	Found:
+		<nested brackets>
+
+	Found:
+		<nested once <nested twice> >
+
+	Found:
+		<nested twice>
+
+=head2 What does it mean that regexes are greedy?  How can I get around it?
+X<greedy> X<greediness>
+
+Most people mean that greedy regexes match as much as they can.
 Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
 C<{}>) that are greedy rather than the whole pattern; Perl prefers local
 greed and immediate gratification to overall greed.  To get non-greedy
@@ -311,9 +637,9 @@ versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
 
 An example:
 
-        $s1 = $s2 = "I am very very cold";
-        $s1 =~ s/ve.*y //;      # I am cold
-        $s2 =~ s/ve.*?y //;     # I am very cold
+	$s1 = $s2 = "I am very very cold";
+	$s1 =~ s/ve.*y //;      # I am cold
+	$s2 =~ s/ve.*?y //;     # I am very cold
 
 Notice how the second substitution stopped matching as soon as it
 encountered "y ".  The C<*?> quantifier effectively tells the regular
@@ -321,26 +647,28 @@ expression engine to find a match as quickly as possible and pass
 control on to whatever is next in line, like you would if you were
 playing hot potato.
 
-=head2  How do I process each word on each line?
+=head2 How do I process each word on each line?
+X<word>
 
 Use the split function:
 
-    while (<>) {
-	foreach $word ( split ) { 
-	    # do something with $word here
-	} 
-    }
+	while (<>) {
+		foreach $word ( split ) {
+			# do something with $word here
+		}
+	}
 
 Note that this isn't really a word in the English sense; it's just
 chunks of consecutive non-whitespace characters.
 
-To work with only alphanumeric sequences, you might consider
+To work with only alphanumeric sequences (including underscores), you
+might consider
 
-    while (<>) {
-	foreach $word (m/(\w+)/g) {
-	    # do something with $word here
+	while (<>) {
+		foreach $word (m/(\w+)/g) {
+			# do something with $word here
+		}
 	}
-    }
 
 =head2 How can I print out a word-frequency or line-frequency summary?
 
@@ -349,175 +677,278 @@ pretend that by word you mean chunk of alphabetics, hyphens, or
 apostrophes, rather than the non-whitespace chunk idea of a word given
 in the previous question:
 
-    while (<>) {
-	while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
-	    $seen{$1}++;
+	while (<>) {
+		while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
+			$seen{$1}++;
+		}
 	}
-    }
-    while ( ($word, $count) = each %seen ) {
-	print "$count $word\n";
-    }
+
+	while ( ($word, $count) = each %seen ) {
+		print "$count $word\n";
+		}
 
 If you wanted to do the same thing for lines, you wouldn't need a
 regular expression:
 
-    while (<>) { 
-	$seen{$_}++;
-    }
-    while ( ($line, $count) = each %seen ) {
-	print "$count $line";
-    }
+	while (<>) {
+		$seen{$_}++;
+		}
+
+	while ( ($line, $count) = each %seen ) {
+		print "$count $line";
+	}
 
-If you want these output in a sorted order, see the section on Hashes.
+If you want these output in a sorted order, see L<perlfaq4>: "How do I
+sort a hash (optionally by value instead of key)?".
 
 =head2 How can I do approximate matching?
+X<match, approximate> X<matching, approximate>
 
 See the module String::Approx available from CPAN.
 
 =head2 How do I efficiently match many regular expressions at once?
+X<regex, efficiency> X<regexp, efficiency>
+X<regular expression, efficiency>
+
+( contributed by brian d foy )
+
+Avoid asking Perl to compile a regular expression every time
+you want to match it. In this example, perl must recompile
+the regular expression for every iteration of the C<foreach>
+loop since it has no way to know what $pattern will be.
+
+	@patterns = qw( foo bar baz );
+
+	LINE: while( <DATA> )
+		{
+		foreach $pattern ( @patterns )
+			{
+			if( /\b$pattern\b/i )
+				{
+				print;
+				next LINE;
+				}
+			}
+		}
+
+The C<qr//> operator showed up in perl 5.005.  It compiles a
+regular expression, but doesn't apply it.  When you use the
+pre-compiled version of the regex, perl does less work. In
+this example, I inserted a C<map> to turn each pattern into
+its pre-compiled form. The rest of the script is the same,
+but faster.
+
+	@patterns = map { qr/\b$_\b/i } qw( foo bar baz );
+
+	LINE: while( <> )
+		{
+		foreach $pattern ( @patterns )
+			{
+			if( /$pattern/ )
+				{
+				print;
+				next LINE;
+				}
+			}
+		}
+
+In some cases, you may be able to make several patterns into
+a single regular expression. Beware of situations that require
+backtracking though.
+
+	$regex = join '|', qw( foo bar baz );
+
+	LINE: while( <> )
+		{
+		print if /\b(?:$regex)\b/i;
+		}
+
+For more details on regular expression efficiency, see I<Mastering
+Regular Expressions> by Jeffrey Freidl.  He explains how regular
+expressions engine work and why some patterns are surprisingly
+inefficient.  Once you understand how perl applies regular
+expressions, you can tune them for individual situations.
 
-The following is super-inefficient:
-
-    while (<FH>) {
-        foreach $pat (@patterns) {
-            if ( /$pat/ ) {
-                # do something
-            }
-        }
-    }
-
-Instead, you either need to use one of the experimental Regexp extension
-modules from CPAN (which might well be overkill for your purposes),
-or else put together something like this, inspired from a routine
-in Jeffrey Friedl's book:
-
-    sub _bm_build {
-        my $condition = shift;
-        my @regexp = @_;  # this MUST not be local(); need my()
-        my $expr = join $condition => map { "m/\$regexp[$_]/o" } (0..$#regexp);
-        my $match_func = eval "sub { $expr }";
-        die if $@;  # propagate $@; this shouldn't happen!
-        return $match_func;
-    }
-
-    sub bm_and { _bm_build('&&', @_) }
-    sub bm_or  { _bm_build('||', @_) }
-
-    $f1 = bm_and qw{
-            xterm
-            (?i)window
-    };
+=head2 Why don't word-boundary searches with C<\b> work for me?
+X<\b>
 
-    $f2 = bm_or qw{
-            \b[Ff]ree\b
-            \bBSD\B
-            (?i)sys(tem)?\s*[V5]\b
-    };
+(contributed by brian d foy)
 
-    # feed me /etc/termcap, prolly
-    while ( <> ) {
-        print "1: $_" if &$f1;
-        print "2: $_" if &$f2;
-    }
+Ensure that you know what \b really does: it's the boundary between a
+word character, \w, and something that isn't a word character. That
+thing that isn't a word character might be \W, but it can also be the
+start or end of the string.
 
-=head2 Why don't word-boundary searches with C<\b> work for me?
+It's not (not!) the boundary between whitespace and non-whitespace,
+and it's not the stuff between words we use to create sentences.
 
-Two common misconceptions are that C<\b> is a synonym for C<\s+>, and
-that it's the edge between whitespace characters and non-whitespace
-characters.  Neither is correct.  C<\b> is the place between a C<\w>
-character and a C<\W> character (that is, C<\b> is the edge of a
-"word").  It's a zero-width assertion, just like C<^>, C<$>, and all
-the other anchors, so it doesn't consume any characters.  L<perlre>
-describes the behaviour of all the regexp metacharacters.
+In regex speak, a word boundary (\b) is a "zero width assertion",
+meaning that it doesn't represent a character in the string, but a
+condition at a certain position.
 
-Here are examples of the incorrect application of C<\b>, with fixes:
+For the regular expression, /\bPerl\b/, there has to be a word
+boundary before the "P" and after the "l".  As long as something other
+than a word character precedes the "P" and succeeds the "l", the
+pattern will match. These strings match /\bPerl\b/.
 
-    "two words" =~ /(\w+)\b(\w+)/;	    # WRONG
-    "two words" =~ /(\w+)\s+(\w+)/;	    # right
+	"Perl"    # no word char before P or after l
+	"Perl "   # same as previous (space is not a word char)
+	"'Perl'"  # the ' char is not a word char
+	"Perl's"  # no word char before P, non-word char after "l"
 
-    " =matchless= text" =~ /\b=(\w+)=\b/;   # WRONG
-    " =matchless= text" =~ /=(\w+)=/;       # right
+These strings do not match /\bPerl\b/.
 
-Although they may not do what you thought they did, C<\b> and C<\B>
-can still be quite useful.  For an example of the correct use of
-C<\b>, see the example of matching duplicate words over multiple
-lines.
+	"Perl_"   # _ is a word char!
+	"Perler"  # no word char before P, but one after l
 
-An example of using C<\B> is the pattern C<\Bis\B>.  This will find
-occurrences of "is" on the insides of words only, as in "thistle", but
-not "this" or "island".
+You don't have to use \b to match words though.  You can look for
+non-word characters surrounded by word characters.  These strings
+match the pattern /\b'\b/.
 
-=head2 Why does using $&, $`, or $' slow my program down?
+	"don't"   # the ' char is surrounded by "n" and "t"
+	"qep'a'"  # the ' char is surrounded by "p" and "a"
 
-Because once Perl sees that you need one of these variables anywhere
-in the program, it has to provide them on each and every pattern
-match.  The same mechanism that handles these provides for the use of
-$1, $2, etc., so you pay the same price for each regexp that contains
-capturing parentheses. But if you never use $&, etc., in your script,
-then regexps I<without> capturing parentheses won't be penalized. So
-avoid $&, $', and $` if you can, but if you can't (and some algorithms
-really appreciate them), once you've used them once, use them at will,
-because you've already paid the price.
+These strings do not match /\b'\b/.
 
-=head2 What good is C<\G> in a regular expression?
+	"foo'"    # there is no word char after non-word '
 
-The notation C<\G> is used in a match or substitution in conjunction the
-C</g> modifier (and ignored if there's no C</g>) to anchor the regular
-expression to the point just past where the last match occurred, i.e. the
-pos() point.
+You can also use the complement of \b, \B, to specify that there
+should not be a word boundary.
 
-For example, suppose you had a line of text quoted in standard mail
-and Usenet notation, (that is, with leading C<E<gt>> characters), and
-you want change each leading C<E<gt>> into a corresponding C<:>.  You
-could do so in this way:
+In the pattern /\Bam\B/, there must be a word character before the "a"
+and after the "m". These patterns match /\Bam\B/:
 
-     s/^(>+)/':' x length($1)/gem;
+	"llama"   # "am" surrounded by word chars
+	"Samuel"  # same
 
-Or, using C<\G>, the much simpler (and faster):
+These strings do not match /\Bam\B/
 
-    s/\G>/:/g;
+	"Sam"      # no word boundary before "a", but one after "m"
+	"I am Sam" # "am" surrounded by non-word chars
 
-A more sophisticated use might involve a tokenizer.  The following
-lex-like example is courtesy of Jeffrey Friedl.  It did not work in
-5.003 due to bugs in that release, but does work in 5.004 or better:
 
-    while (<>) {
-      chomp;
-      PARSER: {
-           m/ \G( \d+\b    )/gx     && do { print "number: $1\n";  redo; };
-           m/ \G( \w+      )/gx     && do { print "word:   $1\n";  redo; };
-           m/ \G( \s+      )/gx     && do { print "space:  $1\n";  redo; };
-           m/ \G( [^\w\d]+ )/gx     && do { print "other:  $1\n";  redo; };
-      }
-    }
+=head2 Why does using $&, $`, or $' slow my program down?
+X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
+
+(contributed by Anno Siegel)
+
+Once Perl sees that you need one of these variables anywhere in the
+program, it provides them on each and every pattern match. That means
+that on every pattern match the entire string will be copied, part of it
+to $`, part to $&, and part to $'. Thus the penalty is most severe with
+long strings and patterns that match often. Avoid $&, $', and $` if you
+can, but if you can't, once you've used them at all, use them at will
+because you've already paid the price. Remember that some algorithms
+really appreciate them. As of the 5.005 release, the $& variable is no
+longer "expensive" the way the other two are.
+
+Since Perl 5.6.1 the special variables @- and @+ can functionally replace
+$`, $& and $'.  These arrays contain pointers to the beginning and end
+of each match (see perlvar for the full story), so they give you
+essentially the same information, but without the risk of excessive
+string copying.
+
+Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and
+C<${^POSTMATCH}> to do the same job but without the global performance
+penalty. Perl 5.10 only sets these variables if you compile or execute the
+regular expression with the C</p> modifier.
 
-Of course, that could have been written as
-
-    while (<>) {
-      chomp;
-      PARSER: {
-	   if ( /\G( \d+\b    )/gx  {
-		print "number: $1\n";
-		redo PARSER;
-	   }
-	   if ( /\G( \w+      )/gx  {
-		print "word: $1\n";
-		redo PARSER;
-	   }
-	   if ( /\G( \s+      )/gx  {
-		print "space: $1\n";
-		redo PARSER;
-	   }
-	   if ( /\G( [^\w\d]+ )/gx  {
-		print "other: $1\n";
-		redo PARSER;
-	   }
-      }
-    }
+=head2 What good is C<\G> in a regular expression?
+X<\G>
+
+You use the C<\G> anchor to start the next match on the same
+string where the last match left off.  The regular
+expression engine cannot skip over any characters to find
+the next match with this anchor, so C<\G> is similar to the
+beginning of string anchor, C<^>.  The C<\G> anchor is typically
+used with the C<g> flag.  It uses the value of C<pos()>
+as the position to start the next match.  As the match
+operator makes successive matches, it updates C<pos()> with the
+position of the next character past the last match (or the
+first character of the next match, depending on how you like
+to look at it). Each string has its own C<pos()> value.
+
+Suppose you want to match all of consecutive pairs of digits
+in a string like "1122a44" and stop matching when you
+encounter non-digits.  You want to match C<11> and C<22> but
+the letter <a> shows up between C<22> and C<44> and you want
+to stop at C<a>. Simply matching pairs of digits skips over
+the C<a> and still matches C<44>.
+
+	$_ = "1122a44";
+	my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )
+
+If you use the C<\G> anchor, you force the match after C<22> to
+start with the C<a>.  The regular expression cannot match
+there since it does not find a digit, so the next match
+fails and the match operator returns the pairs it already
+found.
+
+	$_ = "1122a44";
+	my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
+
+You can also use the C<\G> anchor in scalar context. You
+still need the C<g> flag.
+
+	$_ = "1122a44";
+	while( m/\G(\d\d)/g )
+		{
+		print "Found $1\n";
+		}
+
+After the match fails at the letter C<a>, perl resets C<pos()>
+and the next match on the same string starts at the beginning.
+
+	$_ = "1122a44";
+	while( m/\G(\d\d)/g )
+		{
+		print "Found $1\n";
+		}
+
+	print "Found $1 after while" if m/(\d\d)/g; # finds "11"
+
+You can disable C<pos()> resets on fail with the C<c> flag, documented
+in L<perlop> and L<perlreref>. Subsequent matches start where the last
+successful match ended (the value of C<pos()>) even if a match on the
+same string has failed in the meantime. In this case, the match after
+the C<while()> loop starts at the C<a> (where the last match stopped),
+and since it does not use any anchor it can skip over the C<a> to find
+C<44>.
+
+	$_ = "1122a44";
+	while( m/\G(\d\d)/gc )
+		{
+		print "Found $1\n";
+		}
+
+	print "Found $1 after while" if m/(\d\d)/g; # finds "44"
+
+Typically you use the C<\G> anchor with the C<c> flag
+when you want to try a different match if one fails,
+such as in a tokenizer. Jeffrey Friedl offers this example
+which works in 5.004 or later.
+
+	while (<>) {
+		chomp;
+		PARSER: {
+			m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
+			m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
+			m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
+			m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
+		}
+	}
 
-But then you lose the vertical alignment of the regular expressions.
+For each line, the C<PARSER> loop first tries to match a series
+of digits followed by a word boundary.  This match has to
+start at the place the last match left off (or the beginning
+of the string on the first match). Since C<m/ \G( \d+\b
+)/gcx> uses the C<c> flag, if the string does not match that
+regular expression, perl does not reset pos() and the next
+match starts at the same position to try a different
+pattern.
 
-=head2 Are Perl regexps DFAs or NFAs?  Are they POSIX compliant?
+=head2 Are Perl regexes DFAs or NFAs?  Are they POSIX compliant?
+X<DFA> X<NFA> X<POSIX>
 
 While it's true that Perl's regular expressions resemble the DFAs
 (deterministic finite automata) of the egrep(1) program, they are in
@@ -530,22 +961,37 @@ guaranteed is slowness.)  See the book "Mastering Regular Expressions"
 hope to know on these matters (a full citation appears in
 L<perlfaq2>).
 
-=head2 What's wrong with using grep or map in a void context?
+=head2 What's wrong with using grep in a void context?
+X<grep>
+
+The problem is that grep builds a return list, regardless of the context.
+This means you're making Perl go to the trouble of building a list that
+you then just throw away. If the list is large, you waste both time and space.
+If your intent is to iterate over the list, then use a for loop for this
+purpose.
 
-Strictly speaking, nothing.  Stylistically speaking, it's not a good
-way to write maintainable code.  That's because you're using these
-constructs not for their return values but rather for their
-side-effects, and side-effects can be mystifying.  There's no void
-grep() that's not better written as a C<for> (well, C<foreach>,
-technically) loop.
+In perls older than 5.8.1, map suffers from this problem as well.
+But since 5.8.1, this has been fixed, and map is context aware - in void
+context, no lists are constructed.
 
 =head2 How can I match strings with multibyte characters?
+X<regex, and multibyte characters> X<regexp, and multibyte characters>
+X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
+
+Starting from Perl 5.6 Perl has had some level of multibyte character
+support.  Perl 5.8 or later is recommended.  Supported multibyte
+character repertoires include Unicode, and legacy encodings
+through the Encode module.  See L<perluniintro>, L<perlunicode>,
+and L<Encode>.
 
-This is hard, and there's no good way.  Perl does not directly support
-wide characters.  It pretends that a byte and a character are
-synonymous.  The following set of approaches was offered by Jeffrey
-Friedl, whose article in issue #5 of The Perl Journal talks about this
-very matter.
+If you are stuck with older Perls, you can do Unicode with the
+C<Unicode::String> module, and character conversions using the
+C<Unicode::Map8> and C<Unicode::Map> modules.  If you are using
+Japanese encodings, you might try using the jperl 5.005_03.
+
+Finally, the following set of approaches was offered by Jeffrey
+Friedl, whose article in issue #5 of The Perl Journal talks about
+this very matter.
 
 Let's suppose you have some weird Martian encoding where pairs of
 ASCII uppercase letters encode single Martian letters (i.e. the two
@@ -564,40 +1010,134 @@ looks like it is because "SG" is next to "XX", but there's no real
 
 Here are a few ways, all painful, to deal with it:
 
-   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
-                                      # are no longer adjacent.
-   print "found GX!\n" if $martian =~ /GX/;
+	# Make sure adjacent "martian" bytes are no longer adjacent.
+	$martian =~ s/([A-Z][A-Z])/ $1 /g;
+
+	print "found GX!\n" if $martian =~ /GX/;
 
 Or like this:
 
-   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
-   # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
-   #
-   foreach $char (@chars) {
-       print "found GX!\n", last if $char eq 'GX';
-   }
+	@chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
+	# above is conceptually similar to:     @chars = $text =~ m/(.)/g;
+	#
+	foreach $char (@chars) {
+	print "found GX!\n", last if $char eq 'GX';
+	}
 
 Or like this:
 
-   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
-       print "found GX!\n", last if $1 eq 'GX';
-   }
+	while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
+		print "found GX!\n", last if $1 eq 'GX';
+		}
 
-Or like this:
+Here's another, slightly less painful, way to do it from Benjamin
+Goldberg, who uses a zero-width negative look-behind assertion.
+
+	print "found GX!\n" if	$martian =~ m/
+		(?<![A-Z])
+		(?:[A-Z][A-Z])*?
+		GX
+		/x;
+
+This succeeds if the "martian" character GX is in the string, and fails
+otherwise.  If you don't like using (?<!), a zero-width negative
+look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
+
+It does have the drawback of putting the wrong thing in $-[0] and $+[0],
+but this usually can be worked around.
+
+=head2 How do I match a regular expression that's in a variable?
+X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
+X<\E, regex>, X<qr//>
+
+(contributed by brian d foy)
+
+We don't have to hard-code patterns into the match operator (or
+anything else that works with regular expressions). We can put the
+pattern in a variable for later use.
+
+The match operator is a double quote context, so you can interpolate
+your variable just like a double quoted string. In this case, you
+read the regular expression as user input and store it in C<$regex>.
+Once you have the pattern in C<$regex>, you use that variable in the
+match operator.
+
+	chomp( my $regex = <STDIN> );
+
+	if( $string =~ m/$regex/ ) { ... }
+
+Any regular expression special characters in C<$regex> are still
+special, and the pattern still has to be valid or Perl will complain.
+For instance, in this pattern there is an unpaired parenthesis.
+
+	my $regex = "Unmatched ( paren";
 
-   die "sorry, Perl doesn't (yet) have Martian support )-:\n";
+	"Two parens to bind them all" =~ m/$regex/;
 
-In addition, a sample program which converts half-width to full-width
-katakana (in Shift-JIS or EUC encoding) is available from CPAN as
+When Perl compiles the regular expression, it treats the parenthesis
+as the start of a memory match. When it doesn't find the closing
+parenthesis, it complains:
 
-=for Tom make it so
+	Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE  paren/ at script line 3.
 
-There are many double- (and multi-) byte encodings commonly used these
-days.  Some versions of these have 1-, 2-, 3-, and 4-byte characters,
-all mixed.
+You can get around this in several ways depending on our situation.
+First, if you don't want any of the characters in the string to be
+special, you can escape them with C<quotemeta> before you use the string.
+
+	chomp( my $regex = <STDIN> );
+	$regex = quotemeta( $regex );
+
+	if( $string =~ m/$regex/ ) { ... }
+
+You can also do this directly in the match operator using the C<\Q>
+and C<\E> sequences. The C<\Q> tells Perl where to start escaping
+special characters, and the C<\E> tells it where to stop (see L<perlop>
+for more details).
+
+	chomp( my $regex = <STDIN> );
+
+	if( $string =~ m/\Q$regex\E/ ) { ... }
+
+Alternately, you can use C<qr//>, the regular expression quote operator (see
+L<perlop> for more details).  It quotes and perhaps compiles the pattern,
+and you can apply regular expression flags to the pattern.
+
+	chomp( my $input = <STDIN> );
+
+	my $regex = qr/$input/is;
+
+	$string =~ m/$regex/  # same as m/$input/is;
+
+You might also want to trap any errors by wrapping an C<eval> block
+around the whole thing.
+
+	chomp( my $input = <STDIN> );
+
+	eval {
+		if( $string =~ m/\Q$input\E/ ) { ... }
+		};
+	warn $@ if $@;
+
+Or...
+
+	my $regex = eval { qr/$input/is };
+	if( defined $regex ) {
+		$string =~ m/$regex/;
+		}
+	else {
+		warn $@;
+		}
 
 =head1 AUTHOR AND COPYRIGHT
 
-Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
-All rights reserved.  See L<perlfaq> for distribution information.
+Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
+other authors as noted. All rights reserved.
+
+This documentation is free; you can redistribute it and/or modify it
+under the same terms as Perl itself.
 
+Irrespective of its distribution, all code examples in this file
+are hereby placed into the public domain.  You are permitted and
+encouraged to use this code in your own programs for fun
+or for profit as you see fit.  A simple comment in the code giving
+credit would be courteous but is not required.