=head1 NAME
-perlfaq6 - Regexps ($Revision: 1.14 $)
+perlfaq6 - Regexps ($Revision: 1.16 $, $Date: 1997/03/25 18:16:56 $)
=head1 DESCRIPTION
this document (in the section on Data and the Networking one on
networking, to be precise).
-=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
+=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
Three techniques can make regular expressions maintainable and
understandable.
while ( <> ) {
while ( /\b(\w\S+)(\s+\1)+\b/gi ) {
print "Duplicate $1 at paragraph $.\n";
- }
- }
+ }
+ }
Here's code that finds sentences that begin with "From " (which would
be mangled by many mailers):
$/ must be a string, not a regular expression. Awk has to be better
for something. :-)
-Actually, you could do this if you don't mind reading the whole file into
+Actually, you could do this if you don't mind reading the whole file into
undef $/;
@records = split /your_pattern/, <FH>;
+The Net::Telnet module (available from CPAN) has the capability to
+wait for a pattern in the input stream, or timeout if it doesn't
+appear within a certain time.
+
+ ## Create a file with three lines.
+ open FH, ">file";
+ print FH "The first line\nThe second line\nThe third line\n";
+ close FH;
+
+ ## Get a read/write filehandle to it.
+ $fh = new FileHandle "+<file";
+
+ ## Attach it to a "stream" object.
+ use Net::Telnet;
+ $file = new Net::Telnet (-fhopen => $fh);
+
+ ## Search for the second line and print out the third.
+ $file->waitfor('/second line\n/');
+ print $file->getline;
+
=head2 How do I substitute case insensitively on the LHS, but preserving case on the RHS?
It depends on what you mean by "preserving case". The following
=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
One alphabetic character would be C</[^\W\d_]/>, no matter what locale
-you're in. Non-alphabetics would be C</[\W\d_]/> (assuming you don't
+you're in. Non-alphabetics would be C</[\W\d_]/> (assuming you don't
consider an underscore a letter).
-=head2 How can I quote a variable to use in a regexp?
+=head2 How can I quote a variable to use in a regexp?
The Perl parser will expand $variable and @variable references in
regular expressions unless the delimiter is a single quote. Remember,
=head2 What is C</o> really for?
-Using a variable in a regular expression match forces a re-evaluation
+Using a variable in a regular expression match forces a reevaluation
(and perhaps recompilation) each time through. The C</o> modifier
locks in the regexp the first time it's used. This always happens in a
constant regular expression, and in fact, the pattern was compiled
Use the split function:
while (<>) {
- foreach $word ( split ) {
+ foreach $word ( split ) {
# do something with $word here
- }
- }
+ }
+ }
-Note that this isn't really a word in the English sense; it's just
-chunks of consecutive non-whitespace characters.
+Note that this isn't really a word in the English sense; it's just
+chunks of consecutive non-whitespace characters.
To work with only alphanumeric sequences, you might consider
=head2 How can I print out a word-frequency or line-frequency summary?
To do this, you have to parse out each word in the input stream. We'll
-pretend that by word you mean chunk of alphabetics, hyphens, or
-apostrophes, rather than the non-whitespace chunk idea of a word given
+pretend that by word you mean chunk of alphabetics, hyphens, or
+apostrophes, rather than the non-whitespace chunk idea of a word given
in the previous question:
while (<>) {
while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
$seen{$1}++;
- }
- }
+ }
+ }
while ( ($word, $count) = each %seen ) {
print "$count $word\n";
- }
+ }
If you wanted to do the same thing for lines, you wouldn't need a
regular expression:
- while (<>) {
+ while (<>) {
$seen{$_}++;
- }
+ }
while ( ($line, $count) = each %seen ) {
print "$count $line";
}
while (<>) {
chomp;
PARSER: {
- if ( /\G( \d+\b )/gx {
+ if ( /\G( \d+\b )/gx {
print "number: $1\n";
redo PARSER;
}
While it's true that Perl's regular expressions resemble the DFAs
(deterministic finite automata) of the egrep(1) program, they are in
-fact implemented as NFAs (non-deterministic finite automata) to allow
+fact implemented as NFAs (nondeterministic finite automata) to allow
backtracking and backreferencing. And they aren't POSIX-style either,
because those guarantee worst-case behavior for all cases. (It seems
that some people prefer guarantees of consistency, even when what's
grep() that's not better written as a C<for> (well, C<foreach>,
technically) loop.
-=head2 How can I match strings with multi-byte characters?
+=head2 How can I match strings with multibyte characters?
This is hard, and there's no good way. Perl does not directly support
wide characters. It pretends that a byte and a character are
Here are a few ways, all painful, to deal with it:
- $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``maritan'' bytes
+ $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
# are no longer adjacent.
print "found GX!\n" if $martian =~ /GX/;
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
- print "found GX!\n", last if $1 eq 'GX';
+ print "found GX!\n", last if $1 eq 'GX';
}
Or like this:
die "sorry, Perl doesn't (yet) have Martian support )-:\n";
In addition, a sample program which converts half-width to full-width
-katakana (in Shift-JIS or EUC encoding) is available from CPAN as
+katakana (in Shift-JIS or EUC encoding) is available from CPAN as
=for Tom make it so
-There are many double- (and multi-) byte encodings commonly used these
+There are many double (and multi) byte encodings commonly used these
days. Some versions of these have 1-, 2-, 3-, and 4-byte characters,
all mixed.