X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlfaq6.pod;h=053e28496de72f8667604c62d8ae63b877dedb2e;hb=2d259d9294e79c03b1a69d3eaac3d6e5647468d7;hp=0adebd72feb0ad9ee8b494ea4697ac7bc7c72aa7;hpb=3fe9a6f19eb206c685bd7389e54e2838fdfd04b7;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index 0adebd7..053e284 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -1,6 +1,6 @@ =head1 NAME -perlfaq6 - Regexps ($Revision: 1.16 $, $Date: 1997/03/25 18:16:56 $) +perlfaq6 - Regexps ($Revision: 1.21 $, $Date: 1998/06/22 04:23:04 $) =head1 DESCRIPTION @@ -11,7 +11,7 @@ with regular expressions, but those answers are found elsewhere in this document (in the section on Data and the Networking one on networking, to be precise). -=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? +=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? Three techniques can make regular expressions maintainable and understandable. @@ -25,7 +25,7 @@ comments. # turn the line into the first word, a colon, and the # number of characters on the rest of the line - s/^(\w+)(.*)/ lc($1) . ":" . length($2) /ge; + s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg; =item Comments Inside the Regexp @@ -69,8 +69,9 @@ delimiter within the pattern: =head2 I'm having trouble matching over more than one line. What's wrong? -Either you don't have newlines in your string, or you aren't using the -correct modifier(s) on your pattern. +Either you don't have more than one line in the string you're looking at +(probably), or else you aren't using the correct modifier(s) on your +pattern (possibly). There are many ways to get multiline data into a string. If you want it to happen automatically while reading input, you'll want to set $/ @@ -94,10 +95,10 @@ record read in. $/ = ''; # read in more whole paragraph, not just one line while ( <> ) { - while ( /\b(\w\S+)(\s+\1)+\b/gi ) { + while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha print "Duplicate $1 at paragraph $.\n"; - } - } + } + } Here's code that finds sentences that begin with "From " (which would be mangled by many mailers): @@ -133,12 +134,23 @@ But if you want nested occurrences of C through C, you'll run up against the problem described in the question in this section on matching balanced text. +Here's another example of using C<..>: + + while (<>) { + $in_header = 1 .. /^$/; + $in_body = /^$/ .. eof(); + # now choose between them + } continue { + reset if eof(); # fix $. + } + =head2 I put a regular expression into $/ but it didn't work. What's wrong? $/ must be a string, not a regular expression. Awk has to be better for something. :-) -Actually, you could do this if you don't mind reading the whole file into +Actually, you could do this if you don't mind reading the whole file +into memory: undef $/; @records = split /your_pattern/, ; @@ -210,17 +222,17 @@ This prints: this is a SUcCESS case -=head2 How can I make C<\w> match accented characters? +=head2 How can I make C<\w> match national character sets? See L. =head2 How can I match a locale-smart version of C? One alphabetic character would be C, no matter what locale -you're in. Non-alphabetics would be C (assuming you don't +you're in. Non-alphabetics would be C (assuming you don't consider an underscore a letter). -=head2 How can I quote a variable to use in a regexp? +=head2 How can I quote a variable to use in a regexp? The Perl parser will expand $variable and @variable references in regular expressions unless the delimiter is a single quote. Remember, @@ -328,10 +340,10 @@ Use the split function: foreach $word ( split ) { # do something with $word here } - } + } -Note that this isn't really a word in the English sense; it's just -chunks of consecutive non-whitespace characters. +Note that this isn't really a word in the English sense; it's just +chunks of consecutive non-whitespace characters. To work with only alphanumeric sequences, you might consider @@ -344,25 +356,25 @@ To work with only alphanumeric sequences, you might consider =head2 How can I print out a word-frequency or line-frequency summary? To do this, you have to parse out each word in the input stream. We'll -pretend that by word you mean chunk of alphabetics, hyphens, or -apostrophes, rather than the non-whitespace chunk idea of a word given +pretend that by word you mean chunk of alphabetics, hyphens, or +apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question: while (<>) { while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" $seen{$1}++; - } - } + } + } while ( ($word, $count) = each %seen ) { print "$count $word\n"; - } + } If you wanted to do the same thing for lines, you wouldn't need a regular expression: while (<>) { $seen{$_}++; - } + } while ( ($line, $count) = each %seen ) { print "$count $line"; } @@ -478,15 +490,17 @@ Or, using C<\G>, the much simpler (and faster): A more sophisticated use might involve a tokenizer. The following lex-like example is courtesy of Jeffrey Friedl. It did not work in -5.003 due to bugs in that release, but does work in 5.004 or better: +5.003 due to bugs in that release, but does work in 5.004 or better. +(Note the use of C, which prevents a failed match with C from +resetting the search position back to the beginning of the string.) while (<>) { chomp; PARSER: { - m/ \G( \d+\b )/gx && do { print "number: $1\n"; redo; }; - m/ \G( \w+ )/gx && do { print "word: $1\n"; redo; }; - m/ \G( \s+ )/gx && do { print "space: $1\n"; redo; }; - m/ \G( [^\w\d]+ )/gx && do { print "other: $1\n"; redo; }; + m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; + m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; + m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; + m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; } } @@ -495,19 +509,19 @@ Of course, that could have been written as while (<>) { chomp; PARSER: { - if ( /\G( \d+\b )/gx { + if ( /\G( \d+\b )/gcx { print "number: $1\n"; redo PARSER; } - if ( /\G( \w+ )/gx { + if ( /\G( \w+ )/gcx { print "word: $1\n"; redo PARSER; } - if ( /\G( \s+ )/gx { + if ( /\G( \s+ )/gcx { print "space: $1\n"; redo PARSER; } - if ( /\G( [^\w\d]+ )/gx { + if ( /\G( [^\w\d]+ )/gcx { print "other: $1\n"; redo PARSER; } @@ -538,7 +552,7 @@ side-effects, and side-effects can be mystifying. There's no void grep() that's not better written as a C (well, C, technically) loop. -=head2 How can I match strings with multi-byte characters? +=head2 How can I match strings with multibyte characters? This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are @@ -546,19 +560,20 @@ synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter. -Let's suppose you have some weird Martian encoding where pairs of ASCII -uppercase letters encode single Martian letters (i.e. the two bytes -"CV" make a single Martian letter, as do the two bytes "SG", "VS", -"XX", etc.). Other bytes represent single characters, just like ASCII. +Let's suppose you have some weird Martian encoding where pairs of +ASCII uppercase letters encode single Martian letters (i.e. the two +bytes "CV" make a single Martian letter, as do the two bytes "SG", +"VS", "XX", etc.). Other bytes represent single characters, just like +ASCII. -So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine -characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. +So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the +nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. Now, say you want to search for the single character C. Perl -doesn't know about Martian, so it'll find the two bytes "GX" in the -"I am CVSGXX!" string, even though that character isn't there: it just -looks like it is because "SG" is next to "XX", but there's no real "GX". -This is a big problem. +doesn't know about Martian, so it'll find the two bytes "GX" in the "I +am CVSGXX!" string, even though that character isn't there: it just +looks like it is because "SG" is next to "XX", but there's no real +"GX". This is a big problem. Here are a few ways, all painful, to deal with it: @@ -578,7 +593,7 @@ Or like this: Or like this: while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded - print "found GX!\n", last if $1 eq 'GX'; + print "found GX!\n", last if $1 eq 'GX'; } Or like this: @@ -586,7 +601,7 @@ Or like this: die "sorry, Perl doesn't (yet) have Martian support )-:\n"; In addition, a sample program which converts half-width to full-width -katakana (in Shift-JIS or EUC encoding) is available from CPAN as +katakana (in Shift-JIS or EUC encoding) is available from CPAN as =for Tom make it so @@ -596,5 +611,18 @@ all mixed. =head1 AUTHOR AND COPYRIGHT -Copyright (c) 1997 Tom Christiansen and Nathan Torkington. -All rights reserved. See L for distribution information. +Copyright (c) 1997, 1998 Tom Christiansen and Nathan Torkington. +All rights reserved. + +When included as part of the Standard Version of Perl, or as part of +its complete documentation whether printed or otherwise, this work +may be distributed only under the terms of Perl's Artistic License. +Any distribution of this file or derivatives thereof I +of that package require that special arrangements be made with +copyright holder. + +Irrespective of its distribution, all code examples in this file +are hereby placed into the public domain. You are permitted and +encouraged to use this code in your own programs for fun +or for profit as you see fit. A simple comment in the code giving +credit would be courteous but is not required.