In addition, Perl defines the following:
\w Match a "word" character (alphanumeric plus "_")
- \W Match a non-word character
+ \W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
equivalent to C<(?:\PM\pM*)>
\C Match a single C char (octet) even under utf8.
-A C<\w> matches a single alphanumeric character, not a whole word.
+A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
Use C<\w+> to match a string of Perl-identifier characters (which isn't
the same as matching an English word). If C<use locale> is in effect, the
list of alphabetic characters generated by C<\w> is taken from the
alpha
alnum
ascii
+ blank [1]
cntrl
digit \d
graph
lower
print
punct
- space \s
+ space \s [2]
upper
- word \w
+ word \w [3]
xdigit
+ [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+ [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
+ also the (very rare) `vertical tabulator', "\ck", chr(11).
+ [3] A Perl extension.
+
For example use C<[:upper:]> to match all the uppercase characters.
-Note that the C<[]> are part of the C<[::]> construct, not part of the whole
-character class. For example:
+Note that the C<[]> are part of the C<[::]> construct, not part of the
+whole character class. For example:
[01[:alpha:]%]
-matches one, zero, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percentage sign.
If the C<utf8> pragma is used, the following equivalences to Unicode
\p{} constructs hold:
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
+ blank IsSpace
cntrl IsCntrl
digit IsDigit
graph IsGraph
For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the isalpha(3) interface (except for `word',
-which is a Perl extension, mirroring C<\w>).
+classes correlate with the usual isalpha(3) interface (except for
+`word' and `blank').
The assumedly non-obviously named classes are:
Any control character. Usually characters that don't produce output as
such but instead control the terminal somehow: for example newline and
backspace are control characters. All characters with ord() less than
-32 are most often classified as control characters.
+32 are most often classified as control characters (assuming ASCII,
+the ISO Latin character sets, and Unicode).
=item graph
-Any alphanumeric or punctuation character.
+Any alphanumeric or punctuation (special) character.
=item print
-Any alphanumeric or punctuation character or space.
+Any alphanumeric or punctuation (special) character or space.
=item punct
-Any punctuation character.
+Any punctuation (special) character.
=item xdigit
-Any hexadecimal digit. Though this may feel silly (/0-9a-f/i would
+Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would
work just fine) it is included for completeness.
=back
after the matched string.
The numbered variables ($1, $2, $3, etc.) and the related punctuation
-set (C<<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
+set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
interpreted as a literal character, not a metacharacter. This was
once used in a common idiom to disable or quote the special meanings
of regular expression metacharacters in a string that you want to
-use for a pattern. Simply quote all non-alphanumeric characters:
+use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
+(If C<use locale> is set, then this depends on the current locale.)
Today it is more common to use the quotemeta() function or the C<\Q>
metaquoting escape sequence to disable all metacharacters' special
meanings like this: