\l Lowercase next character.
\L Lowercase till \E.
\n (Logical) newline character.
+ \N Any character but newline.
\N{} Named (Unicode) character.
- \p{}, \pP Character with a Unicode property.
- \P{}, \PP Character without a Unicode property.
+ \p{}, \pP Character with the given Unicode property.
+ \P{}, \PP Character without the given Unicode property.
\Q Quotemeta till \E.
\r Return character.
\R Generic new line.
\w Character class for word characters.
\W Character class for non-word characters.
\x{}, \x00 Hexadecimal escape sequence.
- \X Extended Unicode "combining character sequence".
+ \X Unicode "extended grapheme cluster".
\z End of string.
\Z End of string.
Octal escapes consist of a backslash followed by two or three octal digits
matching the code point of the character you want to use. This allows for
-522 characters (C<\00> up to C<\777>) that can be expressed this way.
+512 characters (C<\00> up to C<\777>) that can be expressed this way.
Enough in pre-Unicode days, but most Unicode characters cannot be escaped
this way.
C<\w> is a character class that matches any I<word> character (letters,
digits, underscore). C<\d> is a character class that matches any digit,
while the character class C<\s> matches any white space character.
-New in perl 5.10 are the classes C<\h> and C<\v> which match horizontal
+New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
and vertical white space characters.
The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
=head3 Relative referencing
-New in perl 5.10 is different way of referring to capture buffers: C<\g>.
+New in perl 5.10.0 is a different way of referring to capture buffers: C<\g>.
C<\g> takes a number as argument, with the number in curly braces (the
braces are optional). If the number (N) does not have a sign, it's a reference
to the Nth capture group (so C<\g{2}> is equivalent to C<\2> - except that
=head3 Named referencing
-Also new in perl 5.10 is the use of named capture buffers, which can be
+Also new in perl 5.10.0 is the use of named capture buffers, which can be
referred to by name. This is done with C<\g{name}>, which is a
backreference to the capture buffer with the name I<name>.
=item \K
-This is new in perl 5.10. Anything that is matched left of C<\K> is
+This is new in perl 5.10.0. Anything that is matched left of C<\K> is
not included in C<$&> - and will not be replaced if the pattern is
used in a substitution. This will allow you to write C<s/PAT1 \K PAT2/REPL/x>
instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
the newline used in Windows text files). C<\R> is equivalent with
C<< (?>\x0D\x0A)|\v) >>. Since C<\R> can match a more than one character,
it cannot be put inside a bracketed character class; C</[\R]/> is an error.
-C<\R> is introduced in perl 5.10.
+C<\R> was introduced in perl 5.10.0.
-Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>.
+Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>,
+and more importantly because Unicode recommends such a regular expression
+metacharacter, and suggests C<\R> as the notation.
=item \X
-This matches an extended Unicode I<combining character sequence>, and
-is equivalent to C<< (?>\PM\pM*) >>. C<\PM> matches any character that is
-not considered a Unicode mark character, while C<\pM> matches any character
-that is considered a Unicode mark character; so C<\X> matches any non
-mark character followed by zero or more mark characters. Mark characters
-include (but are not restricted to) I<combining characters> and
-I<vowel signs>.
+This matches a Unicode I<extended grapheme cluster>.
+
+C<\X> matches quite well what normal (non-Unicode-programmer) usage
+would consider a single character. As an example, consider a G with some sort
+of diacritic mark, such as an arrow. There is no such single character in
+Unicode, but one can be composed using a G followed by a Unicode "COMBINING
+UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
+were a single character.
Mnemonic: eI<X>tended Unicode character.