X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlfaq6.pod;h=d21a11157b11c80dea2ae6361fda4836226ff05d;hb=3bf198a5e20d135d4136d3233d58cf49a70772d9;hp=589d89e495b3dc5d156558f12ec449508ed89111;hpb=68dc074516a6859e3424b48d1647bcb08b1a1a7d;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index 589d89e..d21a111 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -1,6 +1,6 @@ =head1 NAME -perlfaq6 - Regexps ($Revision: 1.14 $) +perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $) =head1 DESCRIPTION @@ -11,7 +11,7 @@ with regular expressions, but those answers are found elsewhere in this document (in the section on Data and the Networking one on networking, to be precise). -=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? +=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? Three techniques can make regular expressions maintainable and understandable. @@ -96,8 +96,8 @@ record read in. while ( <> ) { while ( /\b(\w\S+)(\s+\1)+\b/gi ) { print "Duplicate $1 at paragraph $.\n"; - } - } + } + } Here's code that finds sentences that begin with "From " (which would be mangled by many mailers): @@ -138,11 +138,32 @@ on matching balanced text. $/ must be a string, not a regular expression. Awk has to be better for something. :-) -Actually, you could do this if you don't mind reading the whole file into +Actually, you could do this if you don't mind reading the whole file +into memory: undef $/; @records = split /your_pattern/, ; +The Net::Telnet module (available from CPAN) has the capability to +wait for a pattern in the input stream, or timeout if it doesn't +appear within a certain time. + + ## Create a file with three lines. + open FH, ">file"; + print FH "The first line\nThe second line\nThe third line\n"; + close FH; + + ## Get a read/write filehandle to it. + $fh = new FileHandle "+ $fh); + + ## Search for the second line and print out the third. + $file->waitfor('/second line\n/'); + print $file->getline; + =head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS? It depends on what you mean by "preserving case". The following @@ -197,10 +218,10 @@ See L. =head2 How can I match a locale-smart version of C? One alphabetic character would be C, no matter what locale -you're in. Non-alphabetics would be C (assuming you don't +you're in. Non-alphabetics would be C (assuming you don't consider an underscore a letter). -=head2 How can I quote a variable to use in a regexp? +=head2 How can I quote a variable to use in a regexp? The Perl parser will expand $variable and @variable references in regular expressions unless the delimiter is a single quote. Remember, @@ -308,10 +329,10 @@ Use the split function: foreach $word ( split ) { # do something with $word here } - } + } -Note that this isn't really a word in the English sense; it's just -chunks of consecutive non-whitespace characters. +Note that this isn't really a word in the English sense; it's just +chunks of consecutive non-whitespace characters. To work with only alphanumeric sequences, you might consider @@ -324,25 +345,25 @@ To work with only alphanumeric sequences, you might consider =head2 How can I print out a word-frequency or line-frequency summary? To do this, you have to parse out each word in the input stream. We'll -pretend that by word you mean chunk of alphabetics, hyphens, or -apostrophes, rather than the non-whitespace chunk idea of a word given +pretend that by word you mean chunk of alphabetics, hyphens, or +apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question: while (<>) { while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" $seen{$1}++; - } - } + } + } while ( ($word, $count) = each %seen ) { print "$count $word\n"; - } + } If you wanted to do the same thing for lines, you wouldn't need a regular expression: while (<>) { $seen{$_}++; - } + } while ( ($line, $count) = each %seen ) { print "$count $line"; } @@ -475,7 +496,7 @@ Of course, that could have been written as while (<>) { chomp; PARSER: { - if ( /\G( \d+\b )/gx { + if ( /\G( \d+\b )/gx { print "number: $1\n"; redo PARSER; } @@ -518,7 +539,7 @@ side-effects, and side-effects can be mystifying. There's no void grep() that's not better written as a C (well, C, technically) loop. -=head2 How can I match strings with multi-byte characters? +=head2 How can I match strings with multibyte characters? This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are @@ -526,23 +547,24 @@ synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter. -Let's suppose you have some weird Martian encoding where pairs of ASCII -uppercase letters encode single Martian letters (i.e. the two bytes -"CV" make a single Martian letter, as do the two bytes "SG", "VS", -"XX", etc.). Other bytes represent single characters, just like ASCII. +Let's suppose you have some weird Martian encoding where pairs of +ASCII uppercase letters encode single Martian letters (i.e. the two +bytes "CV" make a single Martian letter, as do the two bytes "SG", +"VS", "XX", etc.). Other bytes represent single characters, just like +ASCII. -So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine -characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. +So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the +nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. Now, say you want to search for the single character C. Perl -doesn't know about Martian, so it'll find the two bytes "GX" in the -"I am CVSGXX!" string, even though that character isn't there: it just -looks like it is because "SG" is next to "XX", but there's no real "GX". -This is a big problem. +doesn't know about Martian, so it'll find the two bytes "GX" in the "I +am CVSGXX!" string, even though that character isn't there: it just +looks like it is because "SG" is next to "XX", but there's no real +"GX". This is a big problem. Here are a few ways, all painful, to deal with it: - $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes + $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes # are no longer adjacent. print "found GX!\n" if $martian =~ /GX/; @@ -558,7 +580,7 @@ Or like this: Or like this: while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded - print "found GX!\n", last if $1 eq 'GX'; + print "found GX!\n", last if $1 eq 'GX'; } Or like this: @@ -566,7 +588,7 @@ Or like this: die "sorry, Perl doesn't (yet) have Martian support )-:\n"; In addition, a sample program which converts half-width to full-width -katakana (in Shift-JIS or EUC encoding) is available from CPAN as +katakana (in Shift-JIS or EUC encoding) is available from CPAN as =for Tom make it so @@ -578,3 +600,4 @@ all mixed. Copyright (c) 1997 Tom Christiansen and Nathan Torkington. All rights reserved. See L for distribution information. +