The last example points out that character classes are like
alternations of characters. At a given character position, the first
-alternative that allows the regexp match to succeed wil be the one
+alternative that allows the regexp match to succeed will be the one
that matches.
=head2 Grouping things and hierarchical matching
Alternations behave the same way in groups as out of them: at a given
string position, the leftmost alternative that allows the regexp to
-match is taken. So in the last example at tth first string position,
+match is taken. So in the last example at the first string position,
C<"20"> matches the second alternative, but there is nothing left over
to match the next two digits C<\d\d>. So perl moves on to the next
alternative, which is the null alternative and that works, since
The combination of C<//g> and C<\G> allows us to process the string a
bit at a time and use arbitrary Perl logic to decide what to do next.
+Currently, the C<\G> anchor is only fully supported when used to anchor
+to the start of the pattern.
C<\G> is also invaluable in processing fixed length records with
regexps. Suppose we have a snippet of coding region DNA, encoded as
with braces: C<\x{ab}>. Note that this is different than C<\xab>,
which is just a hexadecimal byte with no Unicode significance.
-B<NOTE>: in perl 5.6.0 it used to be that one needed to say C<use utf8>
-to use any Unicode features. This is no more the case: for almost all
-Unicode processing, the explicit C<utf8> pragma is not needed.
-(The only case where it matters is if your Perl script is in Unicode,
-that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C<use utf8>
-is needed.)
+B<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
+utf8> to use any Unicode features. This is no more the case: for
+almost all Unicode processing, the explicit C<utf8> pragma is not
+needed. (The only case where it matters is if your Perl script is in
+Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
Figuring out the hexadecimal sequence of a Unicode character you want
or deciphering someone else's hexadecimal Unicode regexp is about as
The last regexp matches, but is dangerous because the string
I<character> position is no longer synchronized to the string I<byte>
position. This generates the warning 'Malformed UTF-8
-character'. C<\C> is best used for matching the binary data in strings
+character'. The C<\C> is best used for matching the binary data in strings
with binary data intermixed with Unicode characters.
Let us now discuss the rest of the character classes. Just as with
IsPrint /^([LMNPS]|Co|Zs)/
IsPunct /^P/
IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/
- IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/
+ IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/
IsUpper /^L[ut]/
IsWord /^[LMN]/ || $code eq "005F"
IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/
The Unicode has also been separated into various sets of charaters
which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in),
-for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>.
+for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>.
For the full list see L<perlunicode>.
C<\X> is an abbreviation for a character class sequence that includes
character classes. To negate a POSIX class, put a C<^> in front of
the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under
C<utf8>, C<\P{IsDigit}>. The Unicode and POSIX character classes can
-be used just like C<\d>, both inside and outside of character classes:
+be used just like C<\d>, with the exception that POSIX character
+classes can only be used inside of a character class:
/\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit
- /^=item\s[:digit:]/; # match '=item',
+ /^=item\s[[:digit:]]/; # match '=item',
# followed by a space and a digit
use charnames ":full";
/\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit
$x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo'
$x =~ /(?<!\s)foo/; # matches, there is no \s before 'foo'
+The C<\C> is unsupported in lookbehind, because the already
+treacherous definition of C<\C> would become even more so
+when going backwards.
+
=head2 Using independent subexpressions to prevent backtracking
The last few extended patterns in this tutorial are experimental as of
parentheses and the second alternative C<\([^()]*\)> matching a
substring delimited by parentheses. The problem with this regexp is
that it is pathological: it has nested indeterminate quantifiers
- of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
+of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
like this could take an exponentially long time to execute if there
was no match possible. To prevent the exponential blowup, we need to
prevent useless backtracking at some point. This can be done by