X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlfaq6.pod;h=535e4644551f3e56872bd29f01a089a619b3b25d;hb=51cf62d8ec31d46fecbc8564c5b48c17f5776f7f;hp=0adebd72feb0ad9ee8b494ea4697ac7bc7c72aa7;hpb=3fe9a6f19eb206c685bd7389e54e2838fdfd04b7;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index 0adebd7..535e464 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -1,6 +1,6 @@ =head1 NAME -perlfaq6 - Regexps ($Revision: 1.16 $, $Date: 1997/03/25 18:16:56 $) +perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $) =head1 DESCRIPTION @@ -11,7 +11,7 @@ with regular expressions, but those answers are found elsewhere in this document (in the section on Data and the Networking one on networking, to be precise). -=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? +=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? Three techniques can make regular expressions maintainable and understandable. @@ -96,8 +96,8 @@ record read in. while ( <> ) { while ( /\b(\w\S+)(\s+\1)+\b/gi ) { print "Duplicate $1 at paragraph $.\n"; - } - } + } + } Here's code that finds sentences that begin with "From " (which would be mangled by many mailers): @@ -138,7 +138,8 @@ on matching balanced text. $/ must be a string, not a regular expression. Awk has to be better for something. :-) -Actually, you could do this if you don't mind reading the whole file into +Actually, you could do this if you don't mind reading the whole file +into memory: undef $/; @records = split /your_pattern/, ; @@ -217,10 +218,10 @@ See L. =head2 How can I match a locale-smart version of C? One alphabetic character would be C, no matter what locale -you're in. Non-alphabetics would be C (assuming you don't +you're in. Non-alphabetics would be C (assuming you don't consider an underscore a letter). -=head2 How can I quote a variable to use in a regexp? +=head2 How can I quote a variable to use in a regexp? The Perl parser will expand $variable and @variable references in regular expressions unless the delimiter is a single quote. Remember, @@ -328,10 +329,10 @@ Use the split function: foreach $word ( split ) { # do something with $word here } - } + } -Note that this isn't really a word in the English sense; it's just -chunks of consecutive non-whitespace characters. +Note that this isn't really a word in the English sense; it's just +chunks of consecutive non-whitespace characters. To work with only alphanumeric sequences, you might consider @@ -344,25 +345,25 @@ To work with only alphanumeric sequences, you might consider =head2 How can I print out a word-frequency or line-frequency summary? To do this, you have to parse out each word in the input stream. We'll -pretend that by word you mean chunk of alphabetics, hyphens, or -apostrophes, rather than the non-whitespace chunk idea of a word given +pretend that by word you mean chunk of alphabetics, hyphens, or +apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question: while (<>) { while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" $seen{$1}++; - } - } + } + } while ( ($word, $count) = each %seen ) { print "$count $word\n"; - } + } If you wanted to do the same thing for lines, you wouldn't need a regular expression: while (<>) { $seen{$_}++; - } + } while ( ($line, $count) = each %seen ) { print "$count $line"; } @@ -478,15 +479,17 @@ Or, using C<\G>, the much simpler (and faster): A more sophisticated use might involve a tokenizer. The following lex-like example is courtesy of Jeffrey Friedl. It did not work in -5.003 due to bugs in that release, but does work in 5.004 or better: +5.003 due to bugs in that release, but does work in 5.004 or better. +(Note the use of C, which prevents a failed match with C from +resetting the search position back to the beginning of the string.) while (<>) { chomp; PARSER: { - m/ \G( \d+\b )/gx && do { print "number: $1\n"; redo; }; - m/ \G( \w+ )/gx && do { print "word: $1\n"; redo; }; - m/ \G( \s+ )/gx && do { print "space: $1\n"; redo; }; - m/ \G( [^\w\d]+ )/gx && do { print "other: $1\n"; redo; }; + m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; + m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; + m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; + m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; } } @@ -495,19 +498,19 @@ Of course, that could have been written as while (<>) { chomp; PARSER: { - if ( /\G( \d+\b )/gx { + if ( /\G( \d+\b )/gcx { print "number: $1\n"; redo PARSER; } - if ( /\G( \w+ )/gx { + if ( /\G( \w+ )/gcx { print "word: $1\n"; redo PARSER; } - if ( /\G( \s+ )/gx { + if ( /\G( \s+ )/gcx { print "space: $1\n"; redo PARSER; } - if ( /\G( [^\w\d]+ )/gx { + if ( /\G( [^\w\d]+ )/gcx { print "other: $1\n"; redo PARSER; } @@ -538,7 +541,7 @@ side-effects, and side-effects can be mystifying. There's no void grep() that's not better written as a C (well, C, technically) loop. -=head2 How can I match strings with multi-byte characters? +=head2 How can I match strings with multibyte characters? This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are @@ -546,19 +549,20 @@ synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter. -Let's suppose you have some weird Martian encoding where pairs of ASCII -uppercase letters encode single Martian letters (i.e. the two bytes -"CV" make a single Martian letter, as do the two bytes "SG", "VS", -"XX", etc.). Other bytes represent single characters, just like ASCII. +Let's suppose you have some weird Martian encoding where pairs of +ASCII uppercase letters encode single Martian letters (i.e. the two +bytes "CV" make a single Martian letter, as do the two bytes "SG", +"VS", "XX", etc.). Other bytes represent single characters, just like +ASCII. -So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine -characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. +So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the +nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. Now, say you want to search for the single character C. Perl -doesn't know about Martian, so it'll find the two bytes "GX" in the -"I am CVSGXX!" string, even though that character isn't there: it just -looks like it is because "SG" is next to "XX", but there's no real "GX". -This is a big problem. +doesn't know about Martian, so it'll find the two bytes "GX" in the "I +am CVSGXX!" string, even though that character isn't there: it just +looks like it is because "SG" is next to "XX", but there's no real +"GX". This is a big problem. Here are a few ways, all painful, to deal with it: @@ -578,7 +582,7 @@ Or like this: Or like this: while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded - print "found GX!\n", last if $1 eq 'GX'; + print "found GX!\n", last if $1 eq 'GX'; } Or like this: @@ -586,7 +590,7 @@ Or like this: die "sorry, Perl doesn't (yet) have Martian support )-:\n"; In addition, a sample program which converts half-width to full-width -katakana (in Shift-JIS or EUC encoding) is available from CPAN as +katakana (in Shift-JIS or EUC encoding) is available from CPAN as =for Tom make it so @@ -598,3 +602,4 @@ all mixed. Copyright (c) 1997 Tom Christiansen and Nathan Torkington. All rights reserved. See L for distribution information. +