Revert "show -E in error message when called with -E"
[p5sagit/p5-mst-13.2.git] / pod / perlfaq6.pod
CommitLineData
68dc0745 1=head1 NAME
2
c195e131 3perlfaq6 - Regular Expressions ($Revision: 10126 $)
68dc0745 4
5=head1 DESCRIPTION
6
7This section is surprisingly small because the rest of the FAQ is
8littered with answers involving regular expressions. For example,
9decoding a URL and checking whether something is a number are handled
10with regular expressions, but those answers are found elsewhere in
b432a672 11this document (in L<perlfaq9>: "How do I decode or create those %-encodings
12on the web" and L<perlfaq4>: "How do I determine whether a scalar is
13a number/whole/integer/float", to be precise).
68dc0745 14
54310121 15=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
d74e8afc 16X<regex, legibility> X<regexp, legibility>
17X<regular expression, legibility> X</x>
68dc0745 18
19Three techniques can make regular expressions maintainable and
20understandable.
21
22=over 4
23
d92eb7b0 24=item Comments Outside the Regex
68dc0745 25
26Describe what you're doing and how you're doing it, using normal Perl
27comments.
28
ac9dac7f 29 # turn the line into the first word, a colon, and the
30 # number of characters on the rest of the line
31 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
68dc0745 32
d92eb7b0 33=item Comments Inside the Regex
68dc0745 34
d92eb7b0 35The C</x> modifier causes whitespace to be ignored in a regex pattern
68dc0745 36(except in a character class), and also allows you to use normal
37comments there, too. As you can imagine, whitespace and comments help
38a lot.
39
40C</x> lets you turn this:
41
ac9dac7f 42 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
68dc0745 43
44into this:
45
ac9dac7f 46 s{ < # opening angle bracket
47 (?: # Non-backreffing grouping paren
48 [^>'"] * # 0 or more things that are neither > nor ' nor "
49 | # or else
50 ".*?" # a section between double quotes (stingy match)
51 | # or else
52 '.*?' # a section between single quotes (stingy match)
53 ) + # all occurring one or more times
54 > # closing angle bracket
55 }{}gsx; # replace with nothing, i.e. delete
68dc0745 56
57It's still not quite so clear as prose, but it is very useful for
58describing the meaning of each part of the pattern.
59
60=item Different Delimiters
61
62While we normally think of patterns as being delimited with C</>
63characters, they can be delimited by almost any character. L<perlre>
64describes this. For example, the C<s///> above uses braces as
65delimiters. Selecting another delimiter can avoid quoting the
66delimiter within the pattern:
67
ac9dac7f 68 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice
69 s#/usr/local#/usr/share#g; # better
68dc0745 70
71=back
72
73=head2 I'm having trouble matching over more than one line. What's wrong?
d74e8afc 74X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
68dc0745 75
3392b9ec 76Either you don't have more than one line in the string you're looking
77at (probably), or else you aren't using the correct modifier(s) on
78your pattern (possibly).
68dc0745 79
80There are many ways to get multiline data into a string. If you want
81it to happen automatically while reading input, you'll want to set $/
82(probably to '' for paragraphs or C<undef> for the whole file) to
83allow you to read more than one line at a time.
84
85Read L<perlre> to help you decide which of C</s> and C</m> (or both)
86you might want to use: C</s> allows dot to include newline, and C</m>
87allows caret and dollar to match next to a newline, not just at the
88end of the string. You do need to make sure that you've actually
89got a multiline string in there.
90
91For example, this program detects duplicate words, even when they span
92line breaks (but not paragraph ones). For this example, we don't need
93C</s> because we aren't using dot in a regular expression that we want
94to cross line boundaries. Neither do we need C</m> because we aren't
95wanting caret or dollar to match at any point inside the record next
96to newlines. But it's imperative that $/ be set to something other
97than the default, or else we won't actually ever have a multiline
98record read in.
99
ac9dac7f 100 $/ = ''; # read in more whole paragraph, not just one line
101 while ( <> ) {
102 while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha
103 print "Duplicate $1 at paragraph $.\n";
104 }
54310121 105 }
68dc0745 106
107Here's code that finds sentences that begin with "From " (which would
108be mangled by many mailers):
109
ac9dac7f 110 $/ = ''; # read in more whole paragraph, not just one line
111 while ( <> ) {
112 while ( /^From /gm ) { # /m makes ^ match next to \n
113 print "leading from in paragraph $.\n";
114 }
68dc0745 115 }
68dc0745 116
117Here's code that finds everything between START and END in a paragraph:
118
ac9dac7f 119 undef $/; # read in whole file, not just one line or paragraph
120 while ( <> ) {
121 while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
122 print "$1\n";
123 }
68dc0745 124 }
68dc0745 125
126=head2 How can I pull out lines between two patterns that are themselves on different lines?
d74e8afc 127X<..>
68dc0745 128
129You can use Perl's somewhat exotic C<..> operator (documented in
130L<perlop>):
131
ac9dac7f 132 perl -ne 'print if /START/ .. /END/' file1 file2 ...
68dc0745 133
134If you wanted text and not lines, you would use
135
ac9dac7f 136 perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
68dc0745 137
138But if you want nested occurrences of C<START> through C<END>, you'll
139run up against the problem described in the question in this section
140on matching balanced text.
141
5a964f20 142Here's another example of using C<..>:
143
ac9dac7f 144 while (<>) {
145 $in_header = 1 .. /^$/;
e573f903 146 $in_body = /^$/ .. eof;
5a964f20 147 # now choose between them
ac9dac7f 148 } continue {
e573f903 149 $. = 0 if eof; # fix $.
ac9dac7f 150 }
5a964f20 151
68dc0745 152=head2 I put a regular expression into $/ but it didn't work. What's wrong?
d74e8afc 153X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
154X<$RS, regexes in>
68dc0745 155
c195e131 156$/ has to be a string. You can use these examples if you really need to
157do this.
49d635f9 158
28b41a80 159If you have File::Stream, this is easy.
160
ac9dac7f 161 use File::Stream;
162
163 my $stream = File::Stream->new(
164 $filehandle,
165 separator => qr/\s*,\s*/,
166 );
28b41a80 167
ac9dac7f 168 print "$_\n" while <$stream>;
28b41a80 169
170If you don't have File::Stream, you have to do a little more work.
171
172You can use the four argument form of sysread to continually add to
197aec24 173a buffer. After you add to the buffer, you check if you have a
49d635f9 174complete line (using your regular expression).
175
ac9dac7f 176 local $_ = "";
177 while( sysread FH, $_, 8192, length ) {
178 while( s/^((?s).*?)your_pattern/ ) {
179 my $record = $1;
180 # do stuff here.
181 }
182 }
197aec24 183
49d635f9 184 You can do the same thing with foreach and a match using the
185 c flag and the \G anchor, if you do not mind your entire file
186 being in memory at the end.
197aec24 187
ac9dac7f 188 local $_ = "";
189 while( sysread FH, $_, 8192, length ) {
190 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
191 # do stuff here.
192 }
193 substr( $_, 0, pos ) = "" if pos;
194 }
68dc0745 195
3fe9a6f1 196
a6dd486b 197=head2 How do I substitute case insensitively on the LHS while preserving case on the RHS?
d74e8afc 198X<replace, case preserving> X<substitute, case preserving>
199X<substitution, case preserving> X<s, case preserving>
68dc0745 200
d92eb7b0 201Here's a lovely Perlish solution by Larry Rosler. It exploits
202properties of bitwise xor on ASCII strings.
203
ac9dac7f 204 $_= "this is a TEsT case";
d92eb7b0 205
ac9dac7f 206 $old = 'test';
207 $new = 'success';
d92eb7b0 208
ac9dac7f 209 s{(\Q$old\E)}
210 { uc $new | (uc $1 ^ $1) .
211 (uc(substr $1, -1) ^ substr $1, -1) x
212 (length($new) - length $1)
213 }egi;
d92eb7b0 214
ac9dac7f 215 print;
d92eb7b0 216
8305e449 217And here it is as a subroutine, modeled after the above:
d92eb7b0 218
ac9dac7f 219 sub preserve_case($$) {
220 my ($old, $new) = @_;
221 my $mask = uc $old ^ $old;
d92eb7b0 222
ac9dac7f 223 uc $new | $mask .
224 substr($mask, -1) x (length($new) - length($old))
d92eb7b0 225 }
226
ac9dac7f 227 $a = "this is a TEsT case";
228 $a =~ s/(test)/preserve_case($1, "success")/egi;
229 print "$a\n";
d92eb7b0 230
231This prints:
232
ac9dac7f 233 this is a SUcCESS case
d92eb7b0 234
74b9445a 235As an alternative, to keep the case of the replacement word if it is
236longer than the original, you can use this code, by Jeff Pinyan:
237
ac9dac7f 238 sub preserve_case {
239 my ($from, $to) = @_;
240 my ($lf, $lt) = map length, @_;
7207e29d 241
ac9dac7f 242 if ($lt < $lf) { $from = substr $from, 0, $lt }
243 else { $from .= substr $to, $lf }
7207e29d 244
ac9dac7f 245 return uc $to | ($from ^ uc $from);
246 }
74b9445a 247
248This changes the sentence to "this is a SUcCess case."
249
d92eb7b0 250Just to show that C programmers can write C in any programming language,
251if you prefer a more C-like solution, the following script makes the
252substitution have the same case, letter by letter, as the original.
253(It also happens to run about 240% slower than the Perlish solution runs.)
254If the substitution has more characters than the string being substituted,
255the case of the last character is used for the rest of the substitution.
68dc0745 256
ac9dac7f 257 # Original by Nathan Torkington, massaged by Jeffrey Friedl
258 #
259 sub preserve_case($$)
260 {
261 my ($old, $new) = @_;
262 my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc
263 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
264 my ($len) = $oldlen < $newlen ? $oldlen : $newlen;
265
266 for ($i = 0; $i < $len; $i++) {
267 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
268 $state = 0;
269 } elsif (lc $c eq $c) {
270 substr($new, $i, 1) = lc(substr($new, $i, 1));
271 $state = 1;
272 } else {
273 substr($new, $i, 1) = uc(substr($new, $i, 1));
274 $state = 2;
275 }
276 }
277 # finish up with any remaining new (for when new is longer than old)
278 if ($newlen > $oldlen) {
279 if ($state == 1) {
280 substr($new, $oldlen) = lc(substr($new, $oldlen));
281 } elsif ($state == 2) {
282 substr($new, $oldlen) = uc(substr($new, $oldlen));
283 }
284 }
285 return $new;
286 }
68dc0745 287
5a964f20 288=head2 How can I make C<\w> match national character sets?
d74e8afc 289X<\w>
68dc0745 290
49d635f9 291Put C<use locale;> in your script. The \w character class is taken
292from the current locale.
293
294See L<perllocale> for details.
68dc0745 295
296=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
d74e8afc 297X<alpha>
68dc0745 298
49d635f9 299You can use the POSIX character class syntax C</[[:alpha:]]/>
300documented in L<perlre>.
301
302No matter which locale you are in, the alphabetic characters are
303the characters in \w without the digits and the underscore.
304As a regex, that looks like C</[^\W\d_]/>. Its complement,
197aec24 305the non-alphabetics, is then everything in \W along with
306the digits and the underscore, or C</[\W\d_]/>.
68dc0745 307
d92eb7b0 308=head2 How can I quote a variable to use in a regex?
d74e8afc 309X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
68dc0745 310
311The Perl parser will expand $variable and @variable references in
312regular expressions unless the delimiter is a single quote. Remember,
79a522f5 313too, that the right-hand side of a C<s///> substitution is considered
68dc0745 314a double-quoted string (see L<perlop> for more details). Remember
d92eb7b0 315also that any regex special characters will be acted on unless you
68dc0745 316precede the substitution with \Q. Here's an example:
317
ac9dac7f 318 $string = "Placido P. Octopus";
319 $regex = "P.";
68dc0745 320
ac9dac7f 321 $string =~ s/$regex/Polyp/;
322 # $string is now "Polypacido P. Octopus"
68dc0745 323
c83084d1 324Because C<.> is special in regular expressions, and can match any
325single character, the regex C<P.> here has matched the <Pl> in the
326original string.
327
328To escape the special meaning of C<.>, we use C<\Q>:
329
ac9dac7f 330 $string = "Placido P. Octopus";
331 $regex = "P.";
c83084d1 332
ac9dac7f 333 $string =~ s/\Q$regex/Polyp/;
334 # $string is now "Placido Polyp Octopus"
c83084d1 335
336The use of C<\Q> causes the <.> in the regex to be treated as a
337regular character, so that C<P.> matches a C<P> followed by a dot.
68dc0745 338
339=head2 What is C</o> really for?
ee891a00 340X</o, regular expressions> X<compile, regular expressions>
68dc0745 341
ee891a00 342(contributed by brian d foy)
68dc0745 343
ee891a00 344The C</o> option for regular expressions (documented in L<perlop> and
345L<perlreref>) tells Perl to compile the regular expression only once.
346This is only useful when the pattern contains a variable. Perls 5.6
347and later handle this automatically if the pattern does not change.
68dc0745 348
ee891a00 349Since the match operator C<m//>, the substitution operator C<s///>,
350and the regular expression quoting operator C<qr//> are double-quotish
351constructs, you can interpolate variables into the pattern. See the
352answer to "How can I quote a variable to use in a regex?" for more
353details.
68dc0745 354
ee891a00 355This example takes a regular expression from the argument list and
356prints the lines of input that match it:
68dc0745 357
ee891a00 358 my $pattern = shift @ARGV;
359
360 while( <> ) {
361 print if m/$pattern/;
362 }
363
364Versions of Perl prior to 5.6 would recompile the regular expression
365for each iteration, even if C<$pattern> had not changed. The C</o>
366would prevent this by telling Perl to compile the pattern the first
367time, then reuse that for subsequent iterations:
368
369 my $pattern = shift @ARGV;
370
371 while( <> ) {
372 print if m/$pattern/o; # useful for Perl < 5.6
373 }
374
375In versions 5.6 and later, Perl won't recompile the regular expression
376if the variable hasn't changed, so you probably don't need the C</o>
377option. It doesn't hurt, but it doesn't help either. If you want any
378version of Perl to compile the regular expression only once even if
379the variable changes (thus, only using its initial value), you still
380need the C</o>.
381
382You can watch Perl's regular expression engine at work to verify for
383yourself if Perl is recompiling a regular expression. The C<use re
384'debug'> pragma (comes with Perl 5.005 and later) shows the details.
385With Perls before 5.6, you should see C<re> reporting that its
386compiling the regular expression on each iteration. With Perl 5.6 or
387later, you should only see C<re> report that for the first iteration.
388
389 use re 'debug';
390
391 $regex = 'Perl';
392 foreach ( qw(Perl Java Ruby Python) ) {
393 print STDERR "-" x 73, "\n";
394 print STDERR "Trying $_...\n";
395 print STDERR "\t$_ is good!\n" if m/$regex/;
396 }
68dc0745 397
398=head2 How do I use a regular expression to strip C style comments from a file?
399
400While this actually can be done, it's much harder than you'd think.
401For example, this one-liner
402
ac9dac7f 403 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
68dc0745 404
405will work in many but not all cases. You see, it's too simple-minded for
406certain kinds of C programs, in particular, those with what appear to be
407comments in quoted strings. For that, you'd need something like this,
d92eb7b0 408created by Jeffrey Friedl and later modified by Fred Curtis.
68dc0745 409
ac9dac7f 410 $/ = undef;
411 $_ = <>;
412 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
413 print;
68dc0745 414
415This could, of course, be more legibly written with the C</x> modifier, adding
d92eb7b0 416whitespace and comments. Here it is expanded, courtesy of Fred Curtis.
417
418 s{
419 /\* ## Start of /* ... */ comment
420 [^*]*\*+ ## Non-* followed by 1-or-more *'s
421 (
422 [^/*][^*]*\*+
423 )* ## 0-or-more things which don't start with /
424 ## but do end with '*'
425 / ## End of /* ... */ comment
426
427 | ## OR various things which aren't comments:
428
429 (
430 " ## Start of " ... " string
431 (
432 \\. ## Escaped char
433 | ## OR
434 [^"\\] ## Non "\
435 )*
436 " ## End of " ... " string
437
438 | ## OR
439
440 ' ## Start of ' ... ' string
441 (
442 \\. ## Escaped char
443 | ## OR
444 [^'\\] ## Non '\
445 )*
446 ' ## End of ' ... ' string
447
448 | ## OR
449
450 . ## Anything other char
451 [^/"'\\]* ## Chars which doesn't start a comment, string or escape
452 )
c98c5709 453 }{defined $2 ? $2 : ""}gxse;
d92eb7b0 454
e573f903 455A slight modification also removes C++ comments, as long as they are not
456spread over multiple lines using a continuation character):
d92eb7b0 457
ac9dac7f 458 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
68dc0745 459
460=head2 Can I use Perl regular expressions to match balanced text?
d74e8afc 461X<regex, matching balanced test> X<regexp, matching balanced test>
462X<regular expression, matching balanced test>
68dc0745 463
8305e449 464Historically, Perl regular expressions were not capable of matching
465balanced text. As of more recent versions of perl including 5.6.1
466experimental features have been added that make it possible to do this.
467Look at the documentation for the (??{ }) construct in recent perlre manual
468pages to see an example of matching balanced parentheses. Be sure to take
469special notice of the warnings present in the manual before making use
470of this feature.
471
472CPAN contains many modules that can be useful for matching text
473depending on the context. Damian Conway provides some useful
474patterns in Regexp::Common. The module Text::Balanced provides a
475general solution to this problem.
476
477One of the common applications of balanced text matching is working
478with XML and HTML. There are many modules available that support
479these needs. Two examples are HTML::Parser and XML::Parser. There
480are many others.
68dc0745 481
482An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
483and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>,
484or C<(> and C<)> can be found in
a93751fa 485http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz .
68dc0745 486
8305e449 487The C::Scan module from CPAN also contains such subs for internal use,
68dc0745 488but they are undocumented.
489
d92eb7b0 490=head2 What does it mean that regexes are greedy? How can I get around it?
d74e8afc 491X<greedy> X<greediness>
68dc0745 492
d92eb7b0 493Most people mean that greedy regexes match as much as they can.
68dc0745 494Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
495C<{}>) that are greedy rather than the whole pattern; Perl prefers local
496greed and immediate gratification to overall greed. To get non-greedy
497versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
498
499An example:
500
ac9dac7f 501 $s1 = $s2 = "I am very very cold";
502 $s1 =~ s/ve.*y //; # I am cold
503 $s2 =~ s/ve.*?y //; # I am very cold
68dc0745 504
505Notice how the second substitution stopped matching as soon as it
506encountered "y ". The C<*?> quantifier effectively tells the regular
507expression engine to find a match as quickly as possible and pass
508control on to whatever is next in line, like you would if you were
509playing hot potato.
510
f9ac83b8 511=head2 How do I process each word on each line?
d74e8afc 512X<word>
68dc0745 513
514Use the split function:
515
ac9dac7f 516 while (<>) {
517 foreach $word ( split ) {
518 # do something with $word here
519 }
197aec24 520 }
68dc0745 521
54310121 522Note that this isn't really a word in the English sense; it's just
523chunks of consecutive non-whitespace characters.
68dc0745 524
f1cbbd6e 525To work with only alphanumeric sequences (including underscores), you
526might consider
68dc0745 527
ac9dac7f 528 while (<>) {
529 foreach $word (m/(\w+)/g) {
530 # do something with $word here
531 }
68dc0745 532 }
68dc0745 533
534=head2 How can I print out a word-frequency or line-frequency summary?
535
536To do this, you have to parse out each word in the input stream. We'll
54310121 537pretend that by word you mean chunk of alphabetics, hyphens, or
538apostrophes, rather than the non-whitespace chunk idea of a word given
68dc0745 539in the previous question:
540
ac9dac7f 541 while (<>) {
542 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
543 $seen{$1}++;
544 }
54310121 545 }
ac9dac7f 546
547 while ( ($word, $count) = each %seen ) {
548 print "$count $word\n";
549 }
68dc0745 550
551If you wanted to do the same thing for lines, you wouldn't need a
552regular expression:
553
ac9dac7f 554 while (<>) {
555 $seen{$_}++;
556 }
557
558 while ( ($line, $count) = each %seen ) {
559 print "$count $line";
560 }
68dc0745 561
b432a672 562If you want these output in a sorted order, see L<perlfaq4>: "How do I
563sort a hash (optionally by value instead of key)?".
68dc0745 564
565=head2 How can I do approximate matching?
d74e8afc 566X<match, approximate> X<matching, approximate>
68dc0745 567
568See the module String::Approx available from CPAN.
569
570=head2 How do I efficiently match many regular expressions at once?
d74e8afc 571X<regex, efficiency> X<regexp, efficiency>
572X<regular expression, efficiency>
68dc0745 573
7678cced 574( contributed by brian d foy )
575
6670e5e7 576Avoid asking Perl to compile a regular expression every time
7678cced 577you want to match it. In this example, perl must recompile
578the regular expression for every iteration of the foreach()
579loop since it has no way to know what $pattern will be.
580
ac9dac7f 581 @patterns = qw( foo bar baz );
6670e5e7 582
ac9dac7f 583 LINE: while( <DATA> )
584 {
6670e5e7 585 foreach $pattern ( @patterns )
7678cced 586 {
ac9dac7f 587 if( /\b$pattern\b/i )
588 {
589 print;
590 next LINE;
591 }
592 }
7678cced 593 }
68dc0745 594
7678cced 595The qr// operator showed up in perl 5.005. It compiles a
596regular expression, but doesn't apply it. When you use the
597pre-compiled version of the regex, perl does less work. In
598this example, I inserted a map() to turn each pattern into
599its pre-compiled form. The rest of the script is the same,
600but faster.
601
ac9dac7f 602 @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
7678cced 603
ac9dac7f 604 LINE: while( <> )
605 {
6670e5e7 606 foreach $pattern ( @patterns )
7678cced 607 {
c195e131 608 print if /$pattern/i;
ac9dac7f 609 next LINE;
610 }
7678cced 611 }
6670e5e7 612
7678cced 613In some cases, you may be able to make several patterns into
614a single regular expression. Beware of situations that require
615backtracking though.
65acb1b1 616
7678cced 617 $regex = join '|', qw( foo bar baz );
618
ac9dac7f 619 LINE: while( <> )
620 {
7678cced 621 print if /\b(?:$regex)\b/i;
622 }
623
624For more details on regular expression efficiency, see Mastering
625Regular Expressions by Jeffrey Freidl. He explains how regular
626expressions engine work and why some patterns are surprisingly
6670e5e7 627inefficient. Once you understand how perl applies regular
7678cced 628expressions, you can tune them for individual situations.
68dc0745 629
630=head2 Why don't word-boundary searches with C<\b> work for me?
d74e8afc 631X<\b>
68dc0745 632
7678cced 633(contributed by brian d foy)
634
635Ensure that you know what \b really does: it's the boundary between a
636word character, \w, and something that isn't a word character. That
637thing that isn't a word character might be \W, but it can also be the
638start or end of the string.
639
640It's not (not!) the boundary between whitespace and non-whitespace,
641and it's not the stuff between words we use to create sentences.
642
643In regex speak, a word boundary (\b) is a "zero width assertion",
644meaning that it doesn't represent a character in the string, but a
645condition at a certain position.
646
647For the regular expression, /\bPerl\b/, there has to be a word
648boundary before the "P" and after the "l". As long as something other
649than a word character precedes the "P" and succeeds the "l", the
650pattern will match. These strings match /\bPerl\b/.
651
652 "Perl" # no word char before P or after l
653 "Perl " # same as previous (space is not a word char)
654 "'Perl'" # the ' char is not a word char
655 "Perl's" # no word char before P, non-word char after "l"
656
657These strings do not match /\bPerl\b/.
658
659 "Perl_" # _ is a word char!
660 "Perler" # no word char before P, but one after l
6670e5e7 661
7678cced 662You don't have to use \b to match words though. You can look for
d7f8936a 663non-word characters surrounded by word characters. These strings
7678cced 664match the pattern /\b'\b/.
665
666 "don't" # the ' char is surrounded by "n" and "t"
667 "qep'a'" # the ' char is surrounded by "p" and "a"
6670e5e7 668
7678cced 669These strings do not match /\b'\b/.
68dc0745 670
7678cced 671 "foo'" # there is no word char after non-word '
6670e5e7 672
7678cced 673You can also use the complement of \b, \B, to specify that there
674should not be a word boundary.
68dc0745 675
7678cced 676In the pattern /\Bam\B/, there must be a word character before the "a"
677and after the "m". These patterns match /\Bam\B/:
68dc0745 678
7678cced 679 "llama" # "am" surrounded by word chars
680 "Samuel" # same
6670e5e7 681
7678cced 682These strings do not match /\Bam\B/
68dc0745 683
7678cced 684 "Sam" # no word boundary before "a", but one after "m"
685 "I am Sam" # "am" surrounded by non-word chars
68dc0745 686
68dc0745 687
688=head2 Why does using $&, $`, or $' slow my program down?
d74e8afc 689X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
68dc0745 690
571e049f 691(contributed by Anno Siegel)
68dc0745 692
571e049f 693Once Perl sees that you need one of these variables anywhere in the
b68463f7 694program, it provides them on each and every pattern match. That means
695that on every pattern match the entire string will be copied, part of it
696to $`, part to $&, and part to $'. Thus the penalty is most severe with
697long strings and patterns that match often. Avoid $&, $', and $` if you
698can, but if you can't, once you've used them at all, use them at will
699because you've already paid the price. Remember that some algorithms
700really appreciate them. As of the 5.005 release, the $& variable is no
701longer "expensive" the way the other two are.
702
703Since Perl 5.6.1 the special variables @- and @+ can functionally replace
704$`, $& and $'. These arrays contain pointers to the beginning and end
705of each match (see perlvar for the full story), so they give you
706essentially the same information, but without the risk of excessive
707string copying.
6670e5e7 708
68dc0745 709=head2 What good is C<\G> in a regular expression?
d74e8afc 710X<\G>
68dc0745 711
49d635f9 712You use the C<\G> anchor to start the next match on the same
713string where the last match left off. The regular
714expression engine cannot skip over any characters to find
715the next match with this anchor, so C<\G> is similar to the
716beginning of string anchor, C<^>. The C<\G> anchor is typically
ee891a00 717used with the C<g> flag. It uses the value of C<pos()>
49d635f9 718as the position to start the next match. As the match
ee891a00 719operator makes successive matches, it updates C<pos()> with the
49d635f9 720position of the next character past the last match (or the
721first character of the next match, depending on how you like
ee891a00 722to look at it). Each string has its own C<pos()> value.
49d635f9 723
ee891a00 724Suppose you want to match all of consecutive pairs of digits
49d635f9 725in a string like "1122a44" and stop matching when you
726encounter non-digits. You want to match C<11> and C<22> but
727the letter <a> shows up between C<22> and C<44> and you want
728to stop at C<a>. Simply matching pairs of digits skips over
729the C<a> and still matches C<44>.
730
731 $_ = "1122a44";
732 my @pairs = m/(\d\d)/g; # qw( 11 22 44 )
733
ee891a00 734If you use the C<\G> anchor, you force the match after C<22> to
49d635f9 735start with the C<a>. The regular expression cannot match
736there since it does not find a digit, so the next match
737fails and the match operator returns the pairs it already
738found.
739
740 $_ = "1122a44";
741 my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
742
743You can also use the C<\G> anchor in scalar context. You
744still need the C<g> flag.
745
746 $_ = "1122a44";
747 while( m/\G(\d\d)/g )
748 {
749 print "Found $1\n";
750 }
197aec24 751
ee891a00 752After the match fails at the letter C<a>, perl resets C<pos()>
49d635f9 753and the next match on the same string starts at the beginning.
754
755 $_ = "1122a44";
756 while( m/\G(\d\d)/g )
757 {
758 print "Found $1\n";
759 }
760
761 print "Found $1 after while" if m/(\d\d)/g; # finds "11"
762
ee891a00 763You can disable C<pos()> resets on fail with the C<c> flag, documented
764in L<perlop> and L<perlreref>. Subsequent matches start where the last
765successful match ended (the value of C<pos()>) even if a match on the
766same string has failed in the meantime. In this case, the match after
767the C<while()> loop starts at the C<a> (where the last match stopped),
768and since it does not use any anchor it can skip over the C<a> to find
769C<44>.
49d635f9 770
771 $_ = "1122a44";
772 while( m/\G(\d\d)/gc )
773 {
774 print "Found $1\n";
775 }
776
777 print "Found $1 after while" if m/(\d\d)/g; # finds "44"
778
779Typically you use the C<\G> anchor with the C<c> flag
780when you want to try a different match if one fails,
781such as in a tokenizer. Jeffrey Friedl offers this example
782which works in 5.004 or later.
68dc0745 783
ac9dac7f 784 while (<>) {
785 chomp;
786 PARSER: {
787 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
788 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
789 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
790 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
791 }
792 }
68dc0745 793
ee891a00 794For each line, the C<PARSER> loop first tries to match a series
49d635f9 795of digits followed by a word boundary. This match has to
796start at the place the last match left off (or the beginning
197aec24 797of the string on the first match). Since C<m/ \G( \d+\b
49d635f9 798)/gcx> uses the C<c> flag, if the string does not match that
799regular expression, perl does not reset pos() and the next
800match starts at the same position to try a different
801pattern.
68dc0745 802
d92eb7b0 803=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
d74e8afc 804X<DFA> X<NFA> X<POSIX>
68dc0745 805
806While it's true that Perl's regular expressions resemble the DFAs
807(deterministic finite automata) of the egrep(1) program, they are in
46fc3d4c 808fact implemented as NFAs (non-deterministic finite automata) to allow
68dc0745 809backtracking and backreferencing. And they aren't POSIX-style either,
810because those guarantee worst-case behavior for all cases. (It seems
811that some people prefer guarantees of consistency, even when what's
812guaranteed is slowness.) See the book "Mastering Regular Expressions"
813(from O'Reilly) by Jeffrey Friedl for all the details you could ever
814hope to know on these matters (a full citation appears in
815L<perlfaq2>).
816
788611b6 817=head2 What's wrong with using grep in a void context?
d74e8afc 818X<grep>
68dc0745 819
788611b6 820The problem is that grep builds a return list, regardless of the context.
821This means you're making Perl go to the trouble of building a list that
822you then just throw away. If the list is large, you waste both time and space.
823If your intent is to iterate over the list, then use a for loop for this
f05bbc40 824purpose.
68dc0745 825
788611b6 826In perls older than 5.8.1, map suffers from this problem as well.
827But since 5.8.1, this has been fixed, and map is context aware - in void
828context, no lists are constructed.
829
54310121 830=head2 How can I match strings with multibyte characters?
d74e8afc 831X<regex, and multibyte characters> X<regexp, and multibyte characters>
ac9dac7f 832X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
68dc0745 833
d9d154f2 834Starting from Perl 5.6 Perl has had some level of multibyte character
835support. Perl 5.8 or later is recommended. Supported multibyte
fe854a6f 836character repertoires include Unicode, and legacy encodings
d9d154f2 837through the Encode module. See L<perluniintro>, L<perlunicode>,
838and L<Encode>.
839
840If you are stuck with older Perls, you can do Unicode with the
841C<Unicode::String> module, and character conversions using the
842C<Unicode::Map8> and C<Unicode::Map> modules. If you are using
843Japanese encodings, you might try using the jperl 5.005_03.
844
845Finally, the following set of approaches was offered by Jeffrey
846Friedl, whose article in issue #5 of The Perl Journal talks about
847this very matter.
68dc0745 848
fc36a67e 849Let's suppose you have some weird Martian encoding where pairs of
850ASCII uppercase letters encode single Martian letters (i.e. the two
851bytes "CV" make a single Martian letter, as do the two bytes "SG",
852"VS", "XX", etc.). Other bytes represent single characters, just like
853ASCII.
68dc0745 854
fc36a67e 855So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
856nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
68dc0745 857
858Now, say you want to search for the single character C</GX/>. Perl
fc36a67e 859doesn't know about Martian, so it'll find the two bytes "GX" in the "I
860am CVSGXX!" string, even though that character isn't there: it just
861looks like it is because "SG" is next to "XX", but there's no real
862"GX". This is a big problem.
68dc0745 863
864Here are a few ways, all painful, to deal with it:
865
ac9dac7f 866 # Make sure adjacent "martian" bytes are no longer adjacent.
867 $martian =~ s/([A-Z][A-Z])/ $1 /g;
868
869 print "found GX!\n" if $martian =~ /GX/;
68dc0745 870
871Or like this:
872
ac9dac7f 873 @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
874 # above is conceptually similar to: @chars = $text =~ m/(.)/g;
875 #
876 foreach $char (@chars) {
877 print "found GX!\n", last if $char eq 'GX';
878 }
68dc0745 879
880Or like this:
881
ac9dac7f 882 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
883 print "found GX!\n", last if $1 eq 'GX';
884 }
68dc0745 885
49d635f9 886Here's another, slightly less painful, way to do it from Benjamin
c98c5709 887Goldberg, who uses a zero-width negative look-behind assertion.
49d635f9 888
c98c5709 889 print "found GX!\n" if $martian =~ m/
ac9dac7f 890 (?<![A-Z])
891 (?:[A-Z][A-Z])*?
892 GX
c98c5709 893 /x;
197aec24 894
49d635f9 895This succeeds if the "martian" character GX is in the string, and fails
c98c5709 896otherwise. If you don't like using (?<!), a zero-width negative
897look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
49d635f9 898
899It does have the drawback of putting the wrong thing in $-[0] and $+[0],
900but this usually can be worked around.
68dc0745 901
ac9dac7f 902=head2 How do I match a regular expression that's in a variable?
903X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
904X<\E, regex>, X<qr//>
65acb1b1 905
ac9dac7f 906(contributed by brian d foy)
65acb1b1 907
ac9dac7f 908We don't have to hard-code patterns into the match operator (or
909anything else that works with regular expressions). We can put the
910pattern in a variable for later use.
65acb1b1 911
ac9dac7f 912The match operator is a double quote context, so you can interpolate
913your variable just like a double quoted string. In this case, you
914read the regular expression as user input and store it in C<$regex>.
915Once you have the pattern in C<$regex>, you use that variable in the
916match operator.
65acb1b1 917
ac9dac7f 918 chomp( my $regex = <STDIN> );
65acb1b1 919
ac9dac7f 920 if( $string =~ m/$regex/ ) { ... }
65acb1b1 921
ac9dac7f 922Any regular expression special characters in C<$regex> are still
923special, and the pattern still has to be valid or Perl will complain.
924For instance, in this pattern there is an unpaired parenthesis.
65acb1b1 925
ac9dac7f 926 my $regex = "Unmatched ( paren";
927
928 "Two parens to bind them all" =~ m/$regex/;
929
930When Perl compiles the regular expression, it treats the parenthesis
931as the start of a memory match. When it doesn't find the closing
932parenthesis, it complains:
933
934 Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
935
936You can get around this in several ways depending on our situation.
937First, if you don't want any of the characters in the string to be
938special, you can escape them with C<quotemeta> before you use the string.
939
940 chomp( my $regex = <STDIN> );
941 $regex = quotemeta( $regex );
942
943 if( $string =~ m/$regex/ ) { ... }
944
945You can also do this directly in the match operator using the C<\Q>
946and C<\E> sequences. The C<\Q> tells Perl where to start escaping
947special characters, and the C<\E> tells it where to stop (see L<perlop>
948for more details).
949
950 chomp( my $regex = <STDIN> );
951
952 if( $string =~ m/\Q$regex\E/ ) { ... }
953
954Alternately, you can use C<qr//>, the regular expression quote operator (see
955L<perlop> for more details). It quotes and perhaps compiles the pattern,
956and you can apply regular expression flags to the pattern.
957
958 chomp( my $input = <STDIN> );
959
960 my $regex = qr/$input/is;
961
962 $string =~ m/$regex/ # same as m/$input/is;
963
964You might also want to trap any errors by wrapping an C<eval> block
965around the whole thing.
966
967 chomp( my $input = <STDIN> );
968
969 eval {
970 if( $string =~ m/\Q$input\E/ ) { ... }
971 };
972 warn $@ if $@;
973
974Or...
975
976 my $regex = eval { qr/$input/is };
977 if( defined $regex ) {
978 $string =~ m/$regex/;
979 }
980 else {
981 warn $@;
982 }
65acb1b1 983
500071f4 984=head1 REVISION
985
c195e131 986Revision: $Revision: 10126 $
500071f4 987
c195e131 988Date: $Date: 2007-10-27 21:29:20 +0200 (Sat, 27 Oct 2007) $
500071f4 989
990See L<perlfaq> for source control details and availability.
991
68dc0745 992=head1 AUTHOR AND COPYRIGHT
993
ee891a00 994Copyright (c) 1997-2007 Tom Christiansen, Nathan Torkington, and
7678cced 995other authors as noted. All rights reserved.
5a964f20 996
5a7beb56 997This documentation is free; you can redistribute it and/or modify it
998under the same terms as Perl itself.
5a964f20 999
1000Irrespective of its distribution, all code examples in this file
1001are hereby placed into the public domain. You are permitted and
1002encouraged to use this code in your own programs for fun
1003or for profit as you see fit. A simple comment in the code giving
1004credit would be courteous but is not required.