X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlfaq6.pod;h=6b0f3bb9a49abab58bd8a00847ab650addf3210c;hb=8a2485f87de4ac33d6c8564ae6b27c5efc3e1430;hp=cf3a8fb7ca737822bae4e1212f58509a557623e1;hpb=49d635f9372392ae44fe4c5b62b06e41912ae0c9;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index cf3a8fb..6b0f3bb 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -1,6 +1,6 @@ =head1 NAME -perlfaq6 - Regular Expressions ($Revision: 1.18 $, $Date: 2002/10/30 18:44:21 $) +perlfaq6 - Regular Expressions ($Revision: 1.27 $, $Date: 2004/11/03 22:52:16 $) =head1 DESCRIPTION @@ -8,7 +8,7 @@ This section is surprisingly small because the rest of the FAQ is littered with answers involving regular expressions. For example, decoding a URL and checking whether something is a number are handled with regular expressions, but those answers are found elsewhere in -this document (in L: ``How do I decode or create those %-encodings +this document (in L: ``How do I decode or create those %-encodings on the web'' and L: ``How do I determine whether a scalar is a number/whole/integer/float'', to be precise). @@ -143,16 +143,28 @@ Here's another example of using C<..>: # now choose between them } continue { reset if eof(); # fix $. - } + } =head2 I put a regular expression into $/ but it didn't work. What's wrong? -As of Perl 5.8.0, $/ has to be a string. This may change in 5.10, +Up to Perl 5.8.0, $/ has to be a string. This may change in 5.10, but don't get your hopes up. Until then, you can use these examples if you really need to do this. -Use the four argument form of sysread to continually add to -a buffer. After you add to the buffer, you check if you have a +If you have File::Stream, this is easy. + + use File::Stream; + my $stream = File::Stream->new( + $filehandle, + separator => qr/\s*,\s*/, + ); + + print "$_\n" while <$stream>; + +If you don't have File::Stream, you have to do a little more work. + +You can use the four argument form of sysread to continually add to +a buffer. After you add to the buffer, you check if you have a complete line (using your regular expression). local $_ = ""; @@ -162,11 +174,11 @@ complete line (using your regular expression). # do stuff here. } } - + You can do the same thing with foreach and a match using the c flag and the \G anchor, if you do not mind your entire file being in memory at the end. - + local $_ = ""; while( sysread FH, $_, 8192, length ) { foreach my $record ( m/\G((?s).*?)your_pattern/gc ) { @@ -201,7 +213,7 @@ And here it is as a subroutine, modeled after the above: my $mask = uc $old ^ $old; uc $new | $mask . - substr($mask, -1) x (length($new) - length($old)) + substr($mask, -1) x (length($new) - length($old)) } $a = "this is a TEsT case"; @@ -280,8 +292,8 @@ documented in L. No matter which locale you are in, the alphabetic characters are the characters in \w without the digits and the underscore. As a regex, that looks like C. Its complement, -the non-alphabetics, is then everything in \W along with -the digits and the underscore, or C. +the non-alphabetics, is then everything in \W along with +the digits and the underscore, or C. =head2 How can I quote a variable to use in a regex? @@ -292,14 +304,26 @@ a double-quoted string (see L for more details). Remember also that any regex special characters will be acted on unless you precede the substitution with \Q. Here's an example: - $string = "to die?"; - $lhs = "die?"; - $rhs = "sleep, no more"; + $string = "Placido P. Octopus"; + $regex = "P."; + + $string =~ s/$regex/Polyp/; + # $string is now "Polypacido P. Octopus" + +Because C<.> is special in regular expressions, and can match any +single character, the regex C here has matched the in the +original string. - $string =~ s/\Q$lhs/$rhs/; - # $string is now "to sleep no more" +To escape the special meaning of C<.>, we use C<\Q>: -Without the \Q, the regex would also spuriously match "di". + $string = "Placido P. Octopus"; + $regex = "P."; + + $string =~ s/\Q$regex/Polyp/; + # $string is now "Placido Polyp Octopus" + +The use of C<\Q> causes the <.> in the regex to be treated as a +regular character, so that C matches a C

followed by a dot. =head2 What is C really for? @@ -342,7 +366,7 @@ created by Jeffrey Friedl and later modified by Fred Curtis. $/ = undef; $_ = <>; - s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#$2#gs + s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse; print; This could, of course, be more legibly written with the C modifier, adding @@ -383,11 +407,11 @@ whitespace and comments. Here it is expanded, courtesy of Fred Curtis. . ## Anything other char [^/"'\\]* ## Chars which doesn't start a comment, string or escape ) - }{$2}gxs; + }{defined $2 ? $2 : ""}gxse; A slight modification also removes C++ comments: - s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#$2#gs; + s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse; =head2 Can I use Perl regular expressions to match balanced text? @@ -442,9 +466,9 @@ playing hot potato. Use the split function: while (<>) { - foreach $word ( split ) { + foreach $word ( split ) { # do something with $word here - } + } } Note that this isn't really a word in the English sense; it's just @@ -478,7 +502,7 @@ in the previous question: If you wanted to do the same thing for lines, you wouldn't need a regular expression: - while (<>) { + while (<>) { $seen{$_}++; } while ( ($line, $count) = each %seen ) { @@ -500,12 +524,12 @@ The following is extremely inefficient: @popstates = qw(CO ON MI WI MN); while (defined($line = <>)) { for $state (@popstates) { - if ($line =~ /\b$state\b/i) { + if ($line =~ /\b$state\b/i) { print $line; last; } } - } + } That's because Perl has to recompile all those patterns for each of the lines of the file. As of the 5.005 release, there's a much better @@ -602,7 +626,7 @@ still need the C flag. { print "Found $1\n"; } - + After the match fails at the letter C, perl resets pos() and the next match on the same string starts at the beginning. @@ -648,7 +672,7 @@ which works in 5.004 or later. For each line, the PARSER loop first tries to match a series of digits followed by a word boundary. This match has to start at the place the last match left off (or the beginning -of the string on the first match). Since C uses the C flag, if the string does not match that regular expression, perl does not reset pos() and the next match starts at the same position to try a different @@ -667,15 +691,18 @@ guaranteed is slowness.) See the book "Mastering Regular Expressions" hope to know on these matters (a full citation appears in L). -=head2 What's wrong with using grep or map in a void context? +=head2 What's wrong with using grep in a void context? -The problem is that both grep and map build a return list, -regardless of the context. This means you're making Perl go -to the trouble of building a list that you then just throw away. -If the list is large, you waste both time and space. If your -intent is to iterate over the list then use a for loop for this +The problem is that grep builds a return list, regardless of the context. +This means you're making Perl go to the trouble of building a list that +you then just throw away. If the list is large, you waste both time and space. +If your intent is to iterate over the list, then use a for loop for this purpose. +In perls older than 5.8.1, map suffers from this problem as well. +But since 5.8.1, this has been fixed, and map is context aware - in void +context, no lists are constructed. + =head2 How can I match strings with multibyte characters? Starting from Perl 5.6 Perl has had some level of multibyte character @@ -730,17 +757,17 @@ Or like this: } Here's another, slightly less painful, way to do it from Benjamin -Goldberg: - - $martian =~ m/ - (?!<[A-Z]) - (?:[A-Z][A-Z])*? - GX - /x; - +Goldberg, who uses a zero-width negative look-behind assertion. + + print "found GX!\n" if $martian =~ m/ + (?