"^" to match only at the beginning of the string and "$" to match
only at the end (or just before a newline at the end) of the string.
Together, as /ms, they let the "." match any character whatsoever,
-while yet allowing "^" and "$" to match, respectively, just after
+while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
=item x
matches zero, one, any alphabetic character, and the percentage sign.
If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs hold:
+\p{} constructs and equivalent backslash character classes (if available),
+will hold:
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
blank IsSpace
cntrl IsCntrl
- digit IsDigit
+ digit IsDigit \d
graph IsGraph
lower IsLower
print IsPrint
punct IsPunct
space IsSpace
+ IsSpacePerl \s
upper IsUpper
word IsWord
xdigit IsXDigit
There is no limit to the number of captured substrings that you may
use. However Perl also uses \10, \11, etc. as aliases for \010,
-\011, etc. (Recall that 0 means octal, so \011 is the 9'th ASCII
-character, a tab.) Perl resolves this ambiguity by interpreting
-\10 as a backreference only if at least 10 left parentheses have
-opened before it. Likewise \11 is a backreference only if at least
-11 left parentheses have opened before it. And so on. \1 through
-\9 are always interpreted as backreferences."
+\011, etc. (Recall that 0 means octal, so \011 is the character at
+number 9 in your coded character set; which would be the 10th character,
+a horizontal tab under ASCII.) Perl resolves this
+ambiguity by interpreting \10 as a backreference only if at least 10
+left parentheses have opened before it. Likewise \11 is a
+backreference only if at least 11 left parentheses have opened
+before it. And so on. \1 through \9 are always interpreted as
+backreferences.
Examples:
internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
- 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
+ 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
-And if you used C<*>'s instead of limiting it to 0 through 5 matches,
-then it would take forever--or until you ran out of stack space.
+And if you used C<*>'s in the internal groups instead of limiting them
+to 0 through 5 matches, then it would take forever--or until you ran
+out of stack space. Moreover, these internal optimizations are not
+always applicable. For example, if you put C<{0,5}> instead of C<*>
+on the external group, no current optimization is applicable, and the
+match takes a long time to finish.
A powerful tool for optimizing such beasts is what is known as an
"independent group",
at the end of the list, just before the closing "]". (The
following all specify the same class of three characters: C<[-az]>,
C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which
-specifies a class containing twenty-six characters.)
-Also, if you try to use the character classes C<\w>, C<\W>, C<\s>,
-C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range,
-the "-" is understood literally.
+specifies a class containing twenty-six characters, even on EBCDIC
+based coded character sets.) Also, if you try to use the character
+classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of
+a range, that's not a range, the "-" is understood literally.
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
Characters may be specified using a metacharacter syntax much like that
used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string
-of octal digits, matches the character whose ASCII value is I<nnn>.
-Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
-character whose ASCII value is I<nn>. The expression \cI<x> matches the
-ASCII character control-I<x>. Finally, the "." metacharacter matches any
-character except "\n" (unless you use C</s>).
+of octal digits, matches the character whose coded character set value
+is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits,
+matches the character whose numeric value is I<nn>. The expression \cI<x>
+matches the character control-I<x>. Finally, the "." metacharacter
+matches any character except "\n" (unless you use C</s>).
You can specify a series of alternatives for a pattern using "|" to
separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
$_ = 'bar';
s/\w??/<$&>/g;
-results in C<"<><b><><a><><r><>">. At each position of the string the best
+results in C<< <><b><><a><><r><> >>. At each position of the string the best
match given by non-greedy C<??> is the zero-length match, and the I<second
best> match is what is matched by C<\w>. Thus zero-length matches
alternate with one-character-long matches.
L<perllocale>.
+L<perlebcdic>.
+
I<Mastering Regular Expressions> by Jeffrey Friedl, published
by O'Reilly and Associates.