=head1 NAME
-perlfaq6 - Regexps ($Revision: 1.14 $)
+perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $)
=head1 DESCRIPTION
this document (in the section on Data and the Networking one on
networking, to be precise).
-=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
+=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
Three techniques can make regular expressions maintainable and
understandable.
while ( <> ) {
while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
print "Duplicate $1 at paragraph $.\n";
- }
- }
+ }
+ }
Here's code that finds sentences that begin with "From " (which would
be mangled by many mailers):
$/ must be a string, not a regular expression. Awk has to be better
for something. :-)
-Actually, you could do this if you don't mind reading the whole file into
+Actually, you could do this if you don't mind reading the whole file
+into memory:
undef $/;
@records = split /your_pattern/, <FH>;
+The Net::Telnet module (available from CPAN) has the capability to
+wait for a pattern in the input stream, or timeout if it doesn't
+appear within a certain time.
+
+ ## Create a file with three lines.
+ open FH, ">file";
+ print FH "The first line\nThe second line\nThe third line\n";
+ close FH;
+
+ ## Get a read/write filehandle to it.
+ $fh = new FileHandle "+<file";
+
+ ## Attach it to a "stream" object.
+ use Net::Telnet;
+ $file = new Net::Telnet (-fhopen => $fh);
+
+ ## Search for the second line and print out the third.
+ $file->waitfor('/second line\n/');
+ print $file->getline;
+
=head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS?
It depends on what you mean by "preserving case". The following
=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
One alphabetic character would be C</[^\W\d_]/>, no matter what locale
-you're in. Non-alphabetics would be C</[\W\d_]/> (assuming you don't
+you're in. Non-alphabetics would be C</[\W\d_]/> (assuming you don't
consider an underscore a letter).
-=head2 How can I quote a variable to use in a regexp?
+=head2 How can I quote a variable to use in a regexp?
The Perl parser will expand $variable and @variable references in
regular expressions unless the delimiter is a single quote. Remember,
foreach $word ( split ) {
# do something with $word here
}
- }
+ }
-Note that this isn't really a word in the English sense; it's just
-chunks of consecutive non-whitespace characters.
+Note that this isn't really a word in the English sense; it's just
+chunks of consecutive non-whitespace characters.
To work with only alphanumeric sequences, you might consider
=head2 How can I print out a word-frequency or line-frequency summary?
To do this, you have to parse out each word in the input stream. We'll
-pretend that by word you mean chunk of alphabetics, hyphens, or
-apostrophes, rather than the non-whitespace chunk idea of a word given
+pretend that by word you mean chunk of alphabetics, hyphens, or
+apostrophes, rather than the non-whitespace chunk idea of a word given
in the previous question:
while (<>) {
while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
$seen{$1}++;
- }
- }
+ }
+ }
while ( ($word, $count) = each %seen ) {
print "$count $word\n";
- }
+ }
If you wanted to do the same thing for lines, you wouldn't need a
regular expression:
while (<>) {
$seen{$_}++;
- }
+ }
while ( ($line, $count) = each %seen ) {
print "$count $line";
}
while (<>) {
chomp;
PARSER: {
- if ( /\G( \d+\b )/gx {
+ if ( /\G( \d+\b )/gx {
print "number: $1\n";
redo PARSER;
}
grep() that's not better written as a C<for> (well, C<foreach>,
technically) loop.
-=head2 How can I match strings with multi-byte characters?
+=head2 How can I match strings with multibyte characters?
This is hard, and there's no good way. Perl does not directly support
wide characters. It pretends that a byte and a character are
Friedl, whose article in issue #5 of The Perl Journal talks about this
very matter.
-Let's suppose you have some weird Martian encoding where pairs of ASCII
-uppercase letters encode single Martian letters (i.e. the two bytes
-"CV" make a single Martian letter, as do the two bytes "SG", "VS",
-"XX", etc.). Other bytes represent single characters, just like ASCII.
+Let's suppose you have some weird Martian encoding where pairs of
+ASCII uppercase letters encode single Martian letters (i.e. the two
+bytes "CV" make a single Martian letter, as do the two bytes "SG",
+"VS", "XX", etc.). Other bytes represent single characters, just like
+ASCII.
-So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine
-characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
+So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
+nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character C</GX/>. Perl
-doesn't know about Martian, so it'll find the two bytes "GX" in the
-"I am CVSGXX!" string, even though that character isn't there: it just
-looks like it is because "SG" is next to "XX", but there's no real "GX".
-This is a big problem.
+doesn't know about Martian, so it'll find the two bytes "GX" in the "I
+am CVSGXX!" string, even though that character isn't there: it just
+looks like it is because "SG" is next to "XX", but there's no real
+"GX". This is a big problem.
Here are a few ways, all painful, to deal with it:
- $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes
+ $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
# are no longer adjacent.
print "found GX!\n" if $martian =~ /GX/;
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
- print "found GX!\n", last if $1 eq 'GX';
+ print "found GX!\n", last if $1 eq 'GX';
}
Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n";
In addition, a sample program which converts half-width to full-width
-katakana (in Shift-JIS or EUC encoding) is available from CPAN as
+katakana (in Shift-JIS or EUC encoding) is available from CPAN as
=for Tom make it so
Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
All rights reserved. See L<perlfaq> for distribution information.
+