"2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary +
"The interval is [0,1)." =~ /[0,1)./ # is a syntax error!
"The interval is [0,1)." =~ /\[0,1\)\./ # matches
- "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches
+ "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches
In the last regexp, the forward slash C<'/'> is also backslashed,
because it is used to delimit the regexp. This can lead to LTS
(leaning toothpick syndrome), however, and it is often more readable
to change delimiters.
+ "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read
The backslash character C<'\'> is a metacharacter itself and needs to
be backslashed:
$x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
$x =~ /girl.Who/sm; # matches, "." matches "\n"
-Most of the time, the default behavior is what is want, but C<//s> and
+Most of the time, the default behavior is what is wanted, but C<//s> and
C<//m> are occasionally very useful. If C<//m> is being used, the start
of the string can still be matched with C<\A> and the end of string
can still be matched with the anchors C<\Z> (matches both the end and
used just as ordinary variables:
# extract hours, minutes, seconds
- $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format
- $hours = $1;
- $minutes = $2;
- $seconds = $3;
+ if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format
+ $hours = $1;
+ $minutes = $2;
+ $seconds = $3;
+ }
Now, we know that in scalar context,
S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or false
will ignore it. If you don't want any substitutions at all, use the
special delimiter C<m''>:
- $pattern = 'Seuss';
+ @pattern = ('Seuss');
while (<>) {
- print if m'$pattern'; # matches '$pattern', not 'Seuss'
+ print if m'@pattern'; # matches literal '@pattern', not 'Seuss'
}
C<m''> acts like single quotes on a regexp; all other C<m> delimiters
The combination of C<//g> and C<\G> allows us to process the string a
bit at a time and use arbitrary Perl logic to decide what to do next.
+Currently, the C<\G> anchor is only fully supported when used to anchor
+to the start of the pattern.
C<\G> is also invaluable in processing fixed length records with
regexps. Suppose we have a snippet of coding region DNA, encoded as
The last regexp matches, but is dangerous because the string
I<character> position is no longer synchronized to the string I<byte>
position. This generates the warning 'Malformed UTF-8
-character'. C<\C> is best used for matching the binary data in strings
+character'. The C<\C> is best used for matching the binary data in strings
with binary data intermixed with Unicode characters.
Let us now discuss the rest of the character classes. Just as with
IsPrint /^([LMNPS]|Co|Zs)/
IsPunct /^P/
IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
- IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/
+ IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
IsUpper /^L[ut]/
IsWord /^[LMN]/ || $code eq "005F"
IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/
character class of Unicode 'marks', for example accent marks.
For the full list see L<perlunicode>.
-The Unicode has also been separated into various sets of charaters
+The Unicode has also been separated into various sets of characters
which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
-for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>.
+for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
For the full list see L<perlunicode>.
C<\X> is an abbreviation for a character class sequence that includes
character classes. To negate a POSIX class, put a C<^> in front of
the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
-be used just like C<\d>, both inside and outside of character classes:
+be used just like C<\d>, with the exception that POSIX character
+classes can only be used inside of a character class:
/\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
- /^=item\s[:digit:]/; # match '=item',
+ /^=item\s[[:digit:]]/; # match '=item',
# followed by a space and a digit
use charnames ":full";
/\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
$x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
$x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
+The C<\C> is unsupported in lookbehind, because the already
+treacherous definition of C<\C> would become even more so
+when going backwards.
+
=head2 Using independent subexpressions to prevent backtracking
The last few extended patterns in this tutorial are experimental as of
$pat = qr/(?{ $foo = 1 })/; # precompile code regexp
/foo${pat}bar/; # compiles ok
-If a regexp has (1) code expressions and interpolating variables,or
+If a regexp has (1) code expressions and interpolating variables, or
(2) a variable that interpolates a code expression, perl treats the
regexp as an error. If the code expression is precompiled into a
variable, however, interpolating is ok. The question is, why is this