or C<\P{Katakana}>. Other sets are the Unicode blocks, the names
of which begin with "In". One such block is dedicated to mathematical
operators, and its pattern formula is <C\p{InMathematicalOperators>}>.
-For the full list see L<perlunicode>.
+For the full list see L<perluniprops>.
+
+What we have described so far is the single form of the C<\p{...}> character
+classes. There is also a compound form which you may run into. These
+look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon
+can be used interchangeably). These are more general than the single form,
+and in fact most of the single forms are just Perl-defined shortcuts for common
+compound forms. For example, the script examples in the previous paragraph
+could be written equivalently as C<\p{Script=Latin}>, C<\p{Script:Greek}>, and
+C<\P{script=katakana}> (case is irrelevant between the C<{}> braces). You may
+never have to use the compound forms, but sometimes it is necessary, and their
+use can make your code easier to understand.
C<\X> is an abbreviation for a character class that comprises
-the Unicode I<combining character sequences>. A combining character
-sequence is a base character followed by any number of diacritics, i.e.,
-signs like accents used to indicate different sounds of a letter. Using
-the Unicode full names, e.g., S<C<A + COMBINING RING>> is a combining
-character sequence with base character C<A> and combining character
-S<C<COMBINING RING>>, which translates in Danish to A with the circle
-atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>,
-i.e., a non-mark followed by one or more marks.
+a Unicode I<extended grapheme cluster>. This represents a "logical character",
+what appears to be a single character, but may be represented internally by more
+than one. As an example, using the Unicode full names, e.g., S<C<A + COMBINING
+RING>> is a grapheme cluster with base character C<A> and combining character
+S<C<COMBINING RING>>, which translates in Danish to A with the circle atop it,
+as in the word Angstrom.
For the full and latest information about Unicode see the latest
-Unicode standard, or the Unicode Consortium's website http://www.unicode.org/
+Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org>
As if all those classes weren't enough, Perl also defines POSIX style
character classes. These have the form C<[:name:]>, with C<name> the
while( $command = <> ){
$command =~ s/^\s+|\s+$//g; # trim leading and trailing spaces
if( ( @matches = $kwds =~ /\b$command\w*/g ) == 1 ){
- print "command: '$matches'\n";
+ print "command: '@matches'\n";
} elsif( @matches == 0 ){
print "no such command: '$command'\n";
} else {
This style of commenting has been largely superseded by the raw,
freeform commenting that is allowed with the C<//x> modifier.
-The modifiers C<//i>, C<//m>, C<//s>, C<//x> and C<//k> (or any
-combination thereof) can also embedded in
+The modifiers C<//i>, C<//m>, C<//s> and C<//x> (or any
+combination thereof) can also be embedded in
a regexp using C<(?i)>, C<(?m)>, C<(?s)>, and C<(?x)>. For instance,
/(?i)yes/; # match 'yes' case insensitively
}
}
-The second advantage is that embedded modifiers (except C<//k>, which
+The second advantage is that embedded modifiers (except C<//p>, which
modifies the entire regexp) only affect the regexp
inside the group the embedded modifier is contained in. So grouping
can be used to localize the modifier's effects:
have a word character up front and the same at its end, with another
palindrome in between.
- /(?: (\w) (?...Here be a palindrome...) \{-1} | \w? )/x
+ /(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
-Adding C<\W*> at either end to eliminate was is to be ignored, we already
+Adding C<\W*> at either end to eliminate what is to be ignored, we already
have the full pattern:
my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
the C<ddd> isn't going to match the target string. But look at this
example:
- $x =~ /abc(?{print "Hi Mom!";})[d]dd/; # doesn't match,
- # but _does_ print
+ $x =~ /abc(?{print "Hi Mom!";})[dD]dd/; # doesn't match,
+ # but _does_ print
Hmm. What happened here? If you've been following along, you know that
-the above pattern should be effectively the same as the last one --
+the above pattern should be effectively (almost) the same as the last one --
enclosing the d in a character class isn't going to change what it
matches. So why does the first not print while the second one does?
C<?{}> construct). It's smart enough to realize that the string 'ddd'
doesn't occur in our target string before actually running the pattern
through. But in the second case, we've tricked it into thinking that our
-pattern is more complicated than it is. It takes a look, sees our
+pattern is more complicated. It takes a look, sees our
character class, and decides that it will have to actually run the
pattern to determine whether or not it matches, and in the process of
running it hits the print statement before it discovers that we don't