pod/perlrecharclass.pod

   1 =head1 NAME
   2 X<character class>
   3
   4 perlrecharclass - Perl Regular Expression Character Classes
   5
   6 =head1 DESCRIPTION
   7
   8 The top level documentation about Perl regular expressions
   9 is found in L<perlre>.
  10
  11 This manual page discusses the syntax and use of character
  12 classes in Perl Regular Expressions.
  13
  14 A character class is a way of denoting a set of characters,
  15 in such a way that one character of the set is matched.
  16 It's important to remember that matching a character class
  17 consumes exactly one character in the source string. (The source
  18 string is the string the regular expression is matched against.)
  19
  20 There are three types of character classes in Perl regular
  21 expressions: the dot, backslashed sequences, and the form enclosed in square
  22 brackets.  Keep in mind, though, that often the term "character class" is used
  23 to mean just the bracketed form.  This is true in other Perl documentation.
  24
  25 =head2 The dot
  26
  27 The dot (or period), C<.> is probably the most used, and certainly
  28 the most well-known character class. By default, a dot matches any
  29 character, except for the newline. The default can be changed to
  30 add matching the newline with the I<single line> modifier: either
  31 for the entire regular expression using the C</s> modifier, or
  32 locally using C<(?s)>.
  33
  34 Here are some examples:
  35
  36  "a"  =~  /./       # Match
  37  "."  =~  /./       # Match
  38  ""   =~  /./       # No match (dot has to match a character)
  39  "\n" =~  /./       # No match (dot does not match a newline)
  40  "\n" =~  /./s      # Match (global 'single line' modifier)
  41  "\n" =~  /(?s:.)/  # Match (local 'single line' modifier)
  42  "ab" =~  /^.$/     # No match (dot matches one character)
  43
  44 =head2 Backslashed sequences
  45 X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P>
  46 X<\N> X<\v> X<\V> X<\h> X<\H>
  47 X<word> X<whitespace>
  48
  49 Perl regular expressions contain many backslashed sequences that
  50 constitute a character class. That is, they will match a single
  51 character, if that character belongs to a specific set of characters
  52 (defined by the sequence). A backslashed sequence is a sequence of
  53 characters starting with a backslash. Not all backslashed sequences
  54 are character classes; for a full list, see L<perlrebackslash>.
  55
  56 Here's a list of the backslashed sequences that are character classes.  They
  57 are discussed in more detail below.
  58
  59  \d             Match a digit character.
  60  \D             Match a non-digit character.
  61  \w             Match a "word" character.
  62  \W             Match a non-"word" character.
  63  \s             Match a whitespace character.
  64  \S             Match a non-whitespace character.
  65  \h             Match a horizontal whitespace character.
  66  \H             Match a character that isn't horizontal whitespace.
  67  \N             Match a character that isn't newline.  Experimental.
  68  \v             Match a vertical whitespace character.
  69  \V             Match a character that isn't vertical whitespace.
  70  \pP, \p{Prop}  Match a character matching a Unicode property.
  71  \PP, \P{Prop}  Match a character that doesn't match a Unicode property.
  72
  73 =head3 Digits
  74
  75 C<\d> matches a single character that is considered to be a I<digit>.  What is
  76 considered a digit depends on the internal encoding of the source string and
  77 the locale that is in effect. If the source string is in UTF-8 format, C<\d>
  78 not only matches the digits '0' - '9', but also Arabic, Devanagari and digits
  79 from other languages. Otherwise, if there is a locale in effect, it will match
  80 whatever characters the locale considers digits. Without a locale, C<\d>
  81 matches the digits '0' to '9'.  See L</Locale, EBCDIC, Unicode and UTF-8>.
  82
  83 Any character that isn't matched by C<\d> will be matched by C<\D>.
  84
  85 =head3 Word characters
  86
  87 A C<\w> matches a single alphanumeric character (an alphabetic character, or a
  88 decimal digit) or an underscore (C<_>), not a whole word.  Use C<\w+> to match
  89 a string of Perl-identifier characters (which isn't the same as matching an
  90 English word).  What is considered a word character depends on the internal
  91 encoding of the string and the locale or EBCDIC code page that is in effect. If
  92 it's in UTF-8 format, C<\w> matches those characters that are considered word
  93 characters in the Unicode database. That is, it not only matches ASCII letters,
  94 but also Thai letters, Greek letters, etc.  If the source string isn't in UTF-8
  95 format, C<\w> matches those characters that are considered word characters by
  96 the current locale or EBCDIC code page.  Without a locale or EBCDIC code page,
  97 C<\w> matches the ASCII letters, digits and the underscore.
  98 See L</Locale, EBCDIC, Unicode and UTF-8>.
  99
 100 Any character that isn't matched by C<\w> will be matched by C<\W>.
 101
 102 =head3 Whitespace
 103
 104 C<\s> matches any single character that is considered whitespace. In the ASCII
 105 range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form
 106 feed (C<\f>), the carriage return (C<\r>), and the space.  (The vertical tab,
 107 C<\cK> is not matched by C<\s>.)  The exact set of characters matched by C<\s>
 108 depends on whether the source string is in UTF-8 format and the locale or
 109 EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what
 110 is considered whitespace in the Unicode database; the complete list is in the
 111 table below. Otherwise, if there is a locale or EBCDIC code page in effect,
 112 C<\s> matches whatever is considered whitespace by the current locale or EBCDIC
 113 code page. Without a locale or EBCDIC code page, C<\s> matches the five
 114 characters mentioned in the beginning of this paragraph.  Perhaps the most
 115 notable possible surprise is that C<\s> matches a non-breaking space only if
 116 the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
 117 code page that is in effect has that character.
 118 See L</Locale, EBCDIC, Unicode and UTF-8>.
 119
 120 Any character that isn't matched by C<\s> will be matched by C<\S>.
 121
 122 C<\h> will match any character that is considered horizontal whitespace;
 123 this includes the space and the tab characters and 17 other characters that are
 124 listed in the table below. C<\H> will match any character
 125 that is not considered horizontal whitespace.
 126
 127 C<\N> is new in 5.12, and is experimental.  It, like the dot, will match any
 128 character that is not a newline. The difference is that C<\N> will not be
 129 influenced by the single line C</s> regular expression modifier. (Note that,
 130 there is a second meaning of C<\N> when of the form C<\N{...}>.  This form is
 131 for named characters.  See L<charnames> for those.  If C<\N> is followed by an
 132 opening brace and something that is not a quantifier, perl will assume that a
 133 character name is coming, and not this meaning of C<\N>.  For example, C<\N{3}>
 134 means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines,
 135 but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to
 136 look for characters named C<4F> or C<F4>, respectively (and won't find them,
 137 thus raising an error, unless they have been defined using custom names).
 138
 139 C<\v> will match any character that is considered vertical whitespace;
 140 this includes the carriage return and line feed characters (newline) plus 5
 141 other characters listed in the table below.
 142 C<\V> will match any character that is not considered vertical whitespace.
 143
 144 C<\R> matches anything that can be considered a newline under Unicode
 145 rules. It's not a character class, as it can match a multi-character
 146 sequence. Therefore, it cannot be used inside a bracketed character
 147 class; use C<\v> instead (vertical whitespace).
 148 Details are discussed in L<perlrebackslash>.
 149
 150 Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
 151 the same characters, regardless whether the source string is in UTF-8
 152 format or not. The set of characters they match is also not influenced
 153 by locale or EBCDIC code page.
 154
 155 One might think that C<\s> is equivalent to C<[\h\v]>. This is not true.  The
 156 vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered
 157 vertical whitespace. Furthermore, if the source string is not in UTF-8 format,
 158 and any locale or EBCDIC code page that is in effect doesn't include them, the
 159 next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not
 160 matched by C<\s>, but are by C<\v> and C<\h> respectively.  If the source
 161 string is in UTF-8 format, both the next line and the no-break space are
 162 matched by C<\s>.
 163
 164 The following table is a complete listing of characters matched by
 165 C<\s>, C<\h> and C<\v> as of Unicode 5.2.
 166
 167 The first column gives the code point of the character (in hex format),
 168 the second column gives the (Unicode) name. The third column indicates
 169 by which class(es) the character is matched (assuming no locale or EBCDIC code
 170 page is in effect that changes the C<\s> matching).
 171
 172  0x00009        CHARACTER TABULATION   h s
 173  0x0000a              LINE FEED (LF)    vs
 174  0x0000b             LINE TABULATION    v
 175  0x0000c              FORM FEED (FF)    vs
 176  0x0000d        CARRIAGE RETURN (CR)    vs
 177  0x00020                       SPACE   h s
 178  0x00085             NEXT LINE (NEL)    vs  [1]
 179  0x000a0              NO-BREAK SPACE   h s  [1]
 180  0x01680            OGHAM SPACE MARK   h s
 181  0x0180e   MONGOLIAN VOWEL SEPARATOR   h s
 182  0x02000                     EN QUAD   h s
 183  0x02001                     EM QUAD   h s
 184  0x02002                    EN SPACE   h s
 185  0x02003                    EM SPACE   h s
 186  0x02004          THREE-PER-EM SPACE   h s
 187  0x02005           FOUR-PER-EM SPACE   h s
 188  0x02006            SIX-PER-EM SPACE   h s
 189  0x02007                FIGURE SPACE   h s
 190  0x02008           PUNCTUATION SPACE   h s
 191  0x02009                  THIN SPACE   h s
 192  0x0200a                  HAIR SPACE   h s
 193  0x02028              LINE SEPARATOR    vs
 194  0x02029         PARAGRAPH SEPARATOR    vs
 195  0x0202f       NARROW NO-BREAK SPACE   h s
 196  0x0205f   MEDIUM MATHEMATICAL SPACE   h s
 197  0x03000           IDEOGRAPHIC SPACE   h s
 198
 199 =over 4
 200
 201 =item [1]
 202
 203 NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
 204 UTF-8 format, or the locale or EBCDIC code page that is in effect includes them.
 205
 206 =back
 207
 208 It is worth noting that C<\d>, C<\w>, etc, match single characters, not
 209 complete numbers or words. To match a number (that consists of integers),
 210 use C<\d+>; to match a word, use C<\w+>.
 211
 212
 213 =head3 Unicode Properties
 214
 215 C<\pP> and C<\p{Prop}> are character classes to match characters that
 216 fit given Unicode classes. One letter classes can be used in the C<\pP>
 217 form, with the class name following the C<\p>, otherwise, braces are required.
 218 There is a single form, which is just the property name enclosed in the braces,
 219 and a compound form which looks like C<\p{name=value}>, which means to match
 220 if the property "name" for the character has the particular "value".
 221 For instance, a match for a number can be written as C</\pN/> or as
 222 C</\p{Number}/>, or as C</\p{Number=True}/>.
 223 Lowercase letters are matched by the property I<Lowercase_Letter> which
 224 has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
 225 C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
 226 (the underscores are optional).
 227 C</\pLl/> is valid, but means something different.
 228 It matches a two character string: a letter (Unicode property C<\pL>),
 229 followed by a lowercase C<l>.
 230
 231 For more details, see L<perlunicode/Unicode Character Properties>; for a
 232 complete list of possible properties, see
 233 L<perluniprops/Properties accessible through \p{} and \P{}>.
 234 It is also possible to define your own properties. This is discussed in
 235 L<perlunicode/User-Defined Character Properties>.
 236
 237
 238 =head4 Examples
 239
 240  "a"  =~  /\w/      # Match, "a" is a 'word' character.
 241  "7"  =~  /\w/      # Match, "7" is a 'word' character as well.
 242  "a"  =~  /\d/      # No match, "a" isn't a digit.
 243  "7"  =~  /\d/      # Match, "7" is a digit.
 244  " "  =~  /\s/      # Match, a space is whitespace.
 245  "a"  =~  /\D/      # Match, "a" is a non-digit.
 246  "7"  =~  /\D/      # No match, "7" is not a non-digit.
 247  " "  =~  /\S/      # No match, a space is not non-whitespace.
 248
 249  " "  =~  /\h/      # Match, space is horizontal whitespace.
 250  " "  =~  /\v/      # No match, space is not vertical whitespace.
 251  "\r" =~  /\v/      # Match, a return is vertical whitespace.
 252
 253  "a"  =~  /\pL/     # Match, "a" is a letter.
 254  "a"  =~  /\p{Lu}/  # No match, /\p{Lu}/ matches upper case letters.
 255
 256  "\x{0e0b}" =~ /\p{Thai}/  # Match, \x{0e0b} is the character
 257                            # 'THAI CHARACTER SO SO', and that's in
 258                            # Thai Unicode class.
 259  "a"  =~  /\P{Lao}/ # Match, as "a" is not a Laotian character.
 260
 261
 262 =head2 Bracketed Character Classes
 263
 264 The third form of character class you can use in Perl regular expressions
 265 is the bracketed form. In its simplest form, it lists the characters
 266 that may be matched inside square brackets, like this: C<[aeiou]>.
 267 This matches one of C<a>, C<e>, C<i>, C<o> or C<u>.  Like the other
 268 character classes, exactly one character will be matched. To match
 269 a longer string consisting of characters mentioned in the character
 270 class, follow the character class with a quantifier. For instance,
 271 C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
 272
 273 Repeating a character in a character class has no
 274 effect; it's considered to be in the set only once.
 275
 276 Examples:
 277
 278  "e"  =~  /[aeiou]/        # Match, as "e" is listed in the class.
 279  "p"  =~  /[aeiou]/        # No match, "p" is not listed in the class.
 280  "ae" =~  /^[aeiou]$/      # No match, a character class only matches
 281                            # a single character.
 282  "ae" =~  /^[aeiou]+$/     # Match, due to the quantifier.
 283
 284 =head3 Special Characters Inside a Bracketed Character Class
 285
 286 Most characters that are meta characters in regular expressions (that
 287 is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
 288 their special meaning and can be used inside a character class without
 289 the need to escape them. For instance, C<[()]> matches either an opening
 290 parenthesis, or a closing parenthesis, and the parens inside the character
 291 class don't group or capture.
 292
 293 Characters that may carry a special meaning inside a character class are:
 294 C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
 295 escaped with a backslash, although this is sometimes not needed, in which
 296 case the backslash may be omitted.
 297
 298 The sequence C<\b> is special inside a bracketed character class. While
 299 outside the character class C<\b> is an assertion indicating a point
 300 that does not have either two word characters or two non-word characters
 301 on either side, inside a bracketed character class, C<\b> matches a
 302 backspace character.
 303
 304 The sequences
 305 C<\a>,
 306 C<\c>,
 307 C<\e>,
 308 C<\f>,
 309 C<\n>,
 310 C<\N{I<NAME>}>,
 311 C<\N{U+I<wide hex char>}>,
 312 C<\r>,
 313 C<\t>,
 314 and
 315 C<\x>
 316 are also special and have the same meanings as they do outside a bracketed character
 317 class.
 318
 319 Also, a backslash followed by two or three octal digits is considered an octal
 320 number.
 321
 322 A C<[> is not special inside a character class, unless it's the start
 323 of a POSIX character class (see below). It normally does not need escaping.
 324
 325 A C<]> is either the end of a POSIX character class (see below), or it
 326 signals the end of the bracketed character class. Normally it needs
 327 escaping if you want to include a C<]> in the set of characters.
 328 However, if the C<]> is the I<first> (or the second if the first
 329 character is a caret) character of a bracketed character class, it
 330 does not denote the end of the class (as you cannot have an empty class)
 331 and is considered part of the set of characters that can be matched without
 332 escaping.
 333
 334 Examples:
 335
 336  "+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
 337  "\cH" =~ /[\b]/      #  Match, \b inside in a character class
 338                       #  is equivalent with a backspace.
 339  "]"   =~ /[][]/      #  Match, as the character class contains.
 340                       #  both [ and ].
 341  "[]"  =~ /[[]]/      #  Match, the pattern contains a character class
 342                       #  containing just ], and the character class is
 343                       #  followed by a ].
 344
 345 =head3 Character Ranges
 346
 347 It is not uncommon to want to match a range of characters. Luckily, instead
 348 of listing all the characters in the range, one may use the hyphen (C<->).
 349 If inside a bracketed character class you have two characters separated
 350 by a hyphen, it's treated as if all the characters between the two are in
 351 the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
 352 matches any lowercase letter from the first half of the ASCII alphabet.
 353
 354 Note that the two characters on either side of the hyphen are not
 355 necessary both letters or both digits. Any character is possible,
 356 although not advisable.  C<['-?]> contains a range of characters, but
 357 most people will not know which characters that will be. Furthermore,
 358 such ranges may lead to portability problems if the code has to run on
 359 a platform that uses a different character set, such as EBCDIC.
 360
 361 If a hyphen in a character class cannot syntactically be part of a range, for
 362 instance because it is the first or the last character of the character class,
 363 or if it immediately follows a range, the hyphen isn't special, and will be
 364 considered a character that may be matched. You have to escape the hyphen with
 365 a backslash if you want to have a hyphen in your set of characters to be
 366 matched, and its position in the class is such that it could be considered part
 367 of a range.
 368
 369 Examples:
 370
 371  [a-z]       #  Matches a character that is a lower case ASCII letter.
 372  [a-fz]      #  Matches any letter between 'a' and 'f' (inclusive) or the
 373              #  letter 'z'.
 374  [-z]        #  Matches either a hyphen ('-') or the letter 'z'.
 375  [a-f-m]     #  Matches any letter between 'a' and 'f' (inclusive), the
 376              #  hyphen ('-'), or the letter 'm'.
 377  ['-?]       #  Matches any of the characters  '()*+,-./0123456789:;<=>?
 378              #  (But not on an EBCDIC platform).
 379
 380
 381 =head3 Negation
 382
 383 It is also possible to instead list the characters you do not want to
 384 match. You can do so by using a caret (C<^>) as the first character in the
 385 character class. For instance, C<[^a-z]> matches a character that is not a
 386 lowercase ASCII letter.
 387
 388 This syntax make the caret a special character inside a bracketed character
 389 class, but only if it is the first character of the class. So if you want
 390 to have the caret as one of the characters you want to match, you either
 391 have to escape the caret, or not list it first.
 392
 393 Examples:
 394
 395  "e"  =~  /[^aeiou]/   #  No match, the 'e' is listed.
 396  "x"  =~  /[^aeiou]/   #  Match, as 'x' isn't a lowercase vowel.
 397  "^"  =~  /[^^]/       #  No match, matches anything that isn't a caret.
 398  "^"  =~  /[x^]/       #  Match, caret is not special here.
 399
 400 =head3 Backslash Sequences
 401
 402 You can put any backslash sequence character class (with the exception of
 403 C<\N>) inside a bracketed character class, and it will act just
 404 as if you put all the characters matched by the backslash sequence inside the
 405 character class. For instance, C<[a-f\d]> will match any digit, or any of the
 406 lowercase letters between 'a' and 'f' inclusive.
 407
 408 C<\N> within a bracketed character class must be of the forms C<\N{I<name>}>  or
 409 C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
 410 bracketed character class loses its special meaning: it matches nearly
 411 anything, which generally isn't what you want to happen.
 412
 413 Examples:
 414
 415  /[\p{Thai}\d]/     # Matches a character that is either a Thai
 416                     # character, or a digit.
 417  /[^\p{Arabic}()]/  # Matches a character that is neither an Arabic
 418                     # character, nor a parenthesis.
 419
 420 Backslash sequence character classes cannot form one of the endpoints
 421 of a range.
 422
 423 =head3 Posix Character Classes
 424 X<character class> X<\p> X<\p{}>
 425 fix
 426 X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
 427 X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
 428
 429 Posix character classes have the form C<[:class:]>, where I<class> is
 430 name, and the C<[:> and C<:]> delimiters. Posix character classes only appear
 431 I<inside> bracketed character classes, and are a convenient and descriptive
 432 way of listing a group of characters. Be careful about the syntax,
 433
 434  # Correct:
 435  $string =~ /[[:alpha:]]/
 436
 437  # Incorrect (will warn):
 438  $string =~ /[:alpha:]/
 439
 440 The latter pattern would be a character class consisting of a colon,
 441 and the letters C<a>, C<l>, C<p> and C<h>.
 442 These character classes can be part of a larger bracketed character class.  For
 443 example,
 444
 445  [01[:alpha:]%]
 446
 447 is valid and matches '0', '1', any alphabetic character, and the percent sign.
 448
 449 Perl recognizes the following POSIX character classes:
 450
 451  alpha  Any alphabetical character ("[A-Za-z]").
 452  alnum  Any alphanumerical character. ("[A-Za-z0-9]")
 453  ascii  Any character in the ASCII character set.
 454  blank  A GNU extension, equal to a space or a horizontal tab ("\t").
 455  cntrl  Any control character.  See Note [2] below.
 456  digit  Any decimal digit ("[0-9]"), equivalent to "\d".
 457  graph  Any printable character, excluding a space.  See Note [3] below.
 458  lower  Any lowercase character ("[a-z]").
 459  print  Any printable character, including a space.  See Note [4] below.
 460  punct  Any graphical character excluding "word" characters.  See Note [5]
 461  space  Any whitespace character. "\s" plus the vertical tab ("\cK").
 462  upper  Any uppercase character ("[A-Z]").
 463  word   A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w".
 464  xdigit Any hexadecimal digit ("[0-9a-fA-F]").
 465
 466 Most POSIX character classes have two Unicode-style C<\p> property
 467 counterparts.  (They are not official Unicode properties, but Perl extensions
 468 derived from official Unicode properties.)  The table below shows the relation
 469 between POSIX character classes and these counterparts.
 470
 471 One counterpart, in the column labelled "ASCII-range Unicode" in
 472 the table will only match characters in the ASCII range.  (On EBCDIC platforms,
 473 they match those characters which have ASCII equivalents.)
 474
 475 The other counterpart, in the column labelled "Full-range Unicode", matches any
 476 appropriate characters in the full Unicode character set.  For example,
 477 C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any
 478 character in the entire Unicode character set that is considered to be
 479 alphabetic.
 480
 481 (Each of the counterparts has various synonyms as well.
 482 L<perluniprops/Properties accessible through \p{} and \P{}> lists all the
 483 synonyms, plus all the characters matched by each of the ASCII-range
 484 properties.  For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>,
 485 and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.)
 486
 487 Both the C<\p> forms are unaffected by any locale that is in effect, or whether
 488 the string is in UTF-8 format or not, or whether the platform is EBCDIC or not.
 489 In contrast, the POSIX character classes are affected.  If the source string is
 490 in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see
 491 Note [5]) behave like their "Full-range" Unicode counterparts.  If the source
 492 string is not in UTF-8 format, and no locale is in effect, and the platform is
 493 not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts.
 494 Otherwise, they behave based on the rules of the locale or EBCDIC code page.
 495 It is proposed to change this behavior in a future release of Perl so that the
 496 the UTF8ness of the source string will be irrelevant to the behavior of the
 497 POSIX character classes.  This means they will always behave in strict
 498 accordance with the official POSIX standard.  That is, if either locale or
 499 EBCDIC code page is present, they will behave in accordance with those; if
 500 absent, the classes will match only their ASCII-range counterparts.  If you
 501 disagree with this proposal, send email to C<perl5-porters@perl.org>.
 502
 503  [[:...:]]      ASCII-range        Full-range  backslash  Note
 504                  Unicode            Unicode    sequence
 505  -----------------------------------------------------
 506    alpha      \p{PosixAlpha}       \p{Alpha}
 507    alnum      \p{PosixAlnum}       \p{Alnum}
 508    ascii      \p{ASCII}
 509    blank      \p{PosixBlank}       \p{Blank} =
 510                                    \p{HorizSpace}  \h      [1]
 511    cntrl      \p{PosixCntrl}       \p{Cntrl}               [2]
 512    digit      \p{PosixDigit}       \p{Digit}       \d
 513    graph      \p{PosixGraph}       \p{Graph}               [3]
 514    lower      \p{PosixLower}       \p{Lower}
 515    print      \p{PosixPrint}       \p{Print}               [4]
 516    punct      \p{PosixPunct}       \p{Punct}               [5]
 517               \p{PerlSpace}        \p{SpacePerl}   \s      [6]
 518    space      \p{PosixSpace}       \p{Space}               [6]
 519    upper      \p{PosixUpper}       \p{Upper}
 520    word       \p{PerlWord}         \p{Word}        \w
 521    xdigit     \p{ASCII_Hex_Digit}  \p{XDigit}
 522
 523 =over 4
 524
 525 =item [1]
 526
 527 C<\p{Blank}> and C<\p{HorizSpace}> are synonyms.
 528
 529 =item [2]
 530
 531 Control characters don't produce output as such, but instead usually control
 532 the terminal somehow: for example newline and backspace are control characters.
 533 In the ASCII range, characters whose ordinals are between 0 and 31 inclusive,
 534 plus 127 (C<DEL>) are control characters.
 535
 536 On EBCDIC platforms, it is likely that the code page will define this character
 537 class to be the counterparts to the ASCII controls, plus the controls that in
 538 Unicode have ordinals from 128 through 139.
 539
 540 =item [3]
 541
 542 Any character that is I<graphical>, that is, visible. This class consists
 543 of all the alphanumerical characters and all punctuation characters.
 544
 545 =item [4]
 546
 547 All printable characters, which is the set of all the graphical characters
 548 plus whitespace characters that are not also controls.
 549
 550 =item [5]
 551
 552 C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the
 553 non-controls, non-alphanumeric, non-space characters:
 554 C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect,
 555 it could alter the behavior of C<[[:punct:]]>).
 556
 557 When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above
 558 set, plus whatever C<\p{Punct}> matches beyond the ASCII range.  It matches
 559 more than what C<\p{Punct}> matches in the ASCII range, because the POSIX
 560 definition of "Punct" includes more than what Unicode calls "Punct"; namely, it
 561 includes what Unicode calls "Symbol".  In other words, the Posix C<[[:punct:]]>
 562 lumps the Unicode "Punct" and "Symbol" together.
 563
 564 This character class does not match any characters of Unicode type "Symbol"
 565 outside the ASCII range when the matching string is in UTF-8 format.
 566
 567 =item [6]
 568
 569 C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally
 570 matches the vertical tab, C<\cK>.   Same for the two ASCII-only range forms.
 571
 572 =back
 573
 574 =head4 Negation
 575 X<character class, negation>
 576
 577 A Perl extension to the POSIX character class is the ability to
 578 negate it. This is done by prefixing the class name with a caret (C<^>).
 579 Some examples:
 580
 581      POSIX         ASCII-range     Full-range  backslash
 582                     Unicode         Unicode    sequence
 583  -----------------------------------------------------
 584  [[:^digit:]]   \P{PosixDigit}     \P{Digit}     \D
 585  [[:^space:]]   \P{PosixSpace}     \P{Space}
 586  [[:^word:]]    \P{PerlWord}       \P{Word}      \W
 587
 588 =head4 [= =] and [. .]
 589
 590 Perl will recognize the POSIX character classes C<[=class=]>, and
 591 C<[.class.]>, but does not (yet?) support them.  Use of
 592 such a construct will lead to an error.
 593
 594
 595 =head4 Examples
 596
 597  /[[:digit:]]/            # Matches a character that is a digit.
 598  /[01[:lower:]]/          # Matches a character that is either a
 599                           # lowercase letter, or '0' or '1'.
 600  /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
 601                           # but the letters 'a' to 'f' in either case.
 602                           # This is because the character class contains
 603                           # all digits, and anything that isn't a
 604                           # hex digit, resulting in a class containing
 605                           # all characters, but the letters 'a' to 'f'
 606                           # and 'A' to 'F'.
 607
 608
 609 =head2 Locale, EBCDIC, Unicode and UTF-8
 610
 611 Some of the character classes have a somewhat different behaviour depending
 612 on the internal encoding of the source string, and the locale that is
 613 in effect, and if the program is running on an EBCDIC platform.
 614
 615 C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
 616 including C<\W>, C<\D>, C<\S>) suffer from this behaviour.  (This also affects
 617 the backslash sequences C<\b> and C<\B>.)
 618
 619 The rule is that if the source string is in UTF-8 format, the character
 620 classes match according to the Unicode properties. If the source string
 621 isn't, then the character classes match according to whatever locale or EBCDIC
 622 code page is in effect. If there is no locale nor EBCDIC, they match the ASCII
 623 defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>;
 624 L</Whitespace> above gives the list for C<\s>).
 625
 626 This usually means that if you are matching against characters whose C<ord()>
 627 values are between 128 and 255 inclusive, your character class may match
 628 or not depending on the current locale or EBCDIC code page, and whether the
 629 source string is in UTF-8 format. The string will be in UTF-8 format if it
 630 contains characters whose C<ord()> value exceeds 255. But a string may be in
 631 UTF-8 format without it having such characters.  See L<perluniprops/The
 632 "Unicode Bug">.
 633
 634 For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
 635 or the POSIX character classes, and use the Unicode properties instead.
 636
 637 =head4 Examples
 638
 639  $str =  "\xDF";      # $str is not in UTF-8 format.
 640  $str =~ /^\w/;       # No match, as $str isn't in UTF-8 format.
 641  $str .= "\x{0e0b}";  # Now $str is in UTF-8 format.
 642  $str =~ /^\w/;       # Match! $str is now in UTF-8 format.
 643  chop $str;
 644  $str =~ /^\w/;       # Still a match! $str remains in UTF-8 format.
 645
 646 =cut