=head1 NAME
-perlfaq6 - Regexps ($Revision: 1.16 $, $Date: 1997/03/25 18:16:56 $)
+perlfaq6 - Regexps ($Revision: 1.17 $, $Date: 1997/04/24 22:44:10 $)
=head1 DESCRIPTION
$/ must be a string, not a regular expression. Awk has to be better
for something. :-)
-Actually, you could do this if you don't mind reading the whole file into
+Actually, you could do this if you don't mind reading the whole file
+into memory:
undef $/;
@records = split /your_pattern/, <FH>;
=head2 What is C</o> really for?
-Using a variable in a regular expression match forces a reevaluation
+Using a variable in a regular expression match forces a re-evaluation
(and perhaps recompilation) each time through. The C</o> modifier
locks in the regexp the first time it's used. This always happens in a
constant regular expression, and in fact, the pattern was compiled
Use the split function:
while (<>) {
- foreach $word ( split ) {
+ foreach $word ( split ) {
# do something with $word here
- }
+ }
}
Note that this isn't really a word in the English sense; it's just
If you wanted to do the same thing for lines, you wouldn't need a
regular expression:
- while (<>) {
+ while (<>) {
$seen{$_}++;
}
while ( ($line, $count) = each %seen ) {
A more sophisticated use might involve a tokenizer. The following
lex-like example is courtesy of Jeffrey Friedl. It did not work in
-5.003 due to bugs in that release, but does work in 5.004 or better:
+5.003 due to bugs in that release, but does work in 5.004 or better.
+(Note the use of C</c>, which prevents a failed match with C</g> from
+resetting the search position back to the beginning of the string.)
while (<>) {
chomp;
PARSER: {
- m/ \G( \d+\b )/gx && do { print "number: $1\n"; redo; };
- m/ \G( \w+ )/gx && do { print "word: $1\n"; redo; };
- m/ \G( \s+ )/gx && do { print "space: $1\n"; redo; };
- m/ \G( [^\w\d]+ )/gx && do { print "other: $1\n"; redo; };
+ m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
+ m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
+ m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
+ m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
}
}
while (<>) {
chomp;
PARSER: {
- if ( /\G( \d+\b )/gx {
+ if ( /\G( \d+\b )/gcx {
print "number: $1\n";
redo PARSER;
}
- if ( /\G( \w+ )/gx {
+ if ( /\G( \w+ )/gcx {
print "word: $1\n";
redo PARSER;
}
- if ( /\G( \s+ )/gx {
+ if ( /\G( \s+ )/gcx {
print "space: $1\n";
redo PARSER;
}
- if ( /\G( [^\w\d]+ )/gx {
+ if ( /\G( [^\w\d]+ )/gcx {
print "other: $1\n";
redo PARSER;
}
While it's true that Perl's regular expressions resemble the DFAs
(deterministic finite automata) of the egrep(1) program, they are in
-fact implemented as NFAs (nondeterministic finite automata) to allow
+fact implemented as NFAs (non-deterministic finite automata) to allow
backtracking and backreferencing. And they aren't POSIX-style either,
because those guarantee worst-case behavior for all cases. (It seems
that some people prefer guarantees of consistency, even when what's
Friedl, whose article in issue #5 of The Perl Journal talks about this
very matter.
-Let's suppose you have some weird Martian encoding where pairs of ASCII
-uppercase letters encode single Martian letters (i.e. the two bytes
-"CV" make a single Martian letter, as do the two bytes "SG", "VS",
-"XX", etc.). Other bytes represent single characters, just like ASCII.
+Let's suppose you have some weird Martian encoding where pairs of
+ASCII uppercase letters encode single Martian letters (i.e. the two
+bytes "CV" make a single Martian letter, as do the two bytes "SG",
+"VS", "XX", etc.). Other bytes represent single characters, just like
+ASCII.
-So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the nine
-characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
+So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
+nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character C</GX/>. Perl
-doesn't know about Martian, so it'll find the two bytes "GX" in the
-"I am CVSGXX!" string, even though that character isn't there: it just
-looks like it is because "SG" is next to "XX", but there's no real "GX".
-This is a big problem.
+doesn't know about Martian, so it'll find the two bytes "GX" in the "I
+am CVSGXX!" string, even though that character isn't there: it just
+looks like it is because "SG" is next to "XX", but there's no real
+"GX". This is a big problem.
Here are a few ways, all painful, to deal with it:
=for Tom make it so
-There are many double (and multi) byte encodings commonly used these
+There are many double- (and multi-) byte encodings commonly used these
days. Some versions of these have 1-, 2-, 3-, and 4-byte characters,
all mixed.
Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
All rights reserved. See L<perlfaq> for distribution information.
+