\PP Match non-P
\X Match eXtended Unicode "combining character sequence",
equivalent to C<(?:\PM\pM*)>
- \C Match a single C char (octet) even under utf8.
+ \C Match a single C char (octet) even under Unicode.
+ B<NOTE:> breaks up characters into their UTF-8 bytes,
+ so you may end up with malformed pieces of UTF-8.
A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
Use C<\w+> to match a string of Perl-identifier characters (which isn't
current locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,
C<\d>, and C<\D> within character classes, but if you try to use them
as endpoints of a range, that's not a range, the "-" is understood literally.
-See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
+See L<perlunicode> for details about C<\pP>, C<\PP>, and C<\X>.
The POSIX character class syntax
matches zero, one, any alphabetic character, and the percentage sign.
-If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs and equivalent backslash character classes (if available),
-will hold:
+The following equivalences to Unicode \p{} constructs and equivalent
+backslash character classes (if available), will hold:
+
+ [:...:] \p{...} backslash
alpha IsAlpha
alnum IsAlnum
You can negate the [::] character classes by prefixing the class name
with a '^'. This is a Perl extension. For example:
- POSIX trad. Perl utf8 Perl
+ POSIX traditional Unicode
[:^digit:] \D \P{IsDigit}
[:^space:] \S \P{IsSpace}
with braces: C<\x{ab}>. Note that this is different than C<\xab>,
which is just a hexadecimal byte with no Unicode significance.
-B<NOTE>: in perl 5.6.0 it used to be that one needed to say C<use utf8>
-to use any Unicode features. This is no more the case: for almost all
-Unicode processing, the explicit C<utf8> pragma is not needed.
-(The only case where it matters is if your Perl script is in Unicode,
-that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C<use utf8>
-is needed.)
+B<NOTE>: in Perl 5.6.0 it used to be that one needed to say C<use
+utf8> to use any Unicode features. This is no more the case: for
+almost all Unicode processing, the explicit C<utf8> pragma is not
+needed. (The only case where it matters is if your Perl script is in
+Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
Figuring out the hexadecimal sequence of a Unicode character you want
or deciphering someone else's hexadecimal Unicode regexp is about as
=head1 SEE ALSO
-L<encoding>, L<Encode>, L<open>, L<bytes>, L<utf8>, L<perlretut>,
-L<perlvar/"${^WIDE_SYSTEM_CALLS}">
+L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
=cut