3 perlrecharclass - Perl Regular Expression Character Classes
7 The top level documentation about Perl regular expressions
10 This manual page discusses the syntax and use of character
11 classes in Perl Regular Expressions.
13 A character class is a way of denoting a set of characters,
14 in such a way that one character of the set is matched.
15 It's important to remember that matching a character class
16 consumes exactly one character in the source string. (The source
17 string is the string the regular expression is matched against.)
19 There are three types of character classes in Perl regular
20 expressions: the dot, backslashed sequences, and the bracketed form.
24 The dot (or period), C<.> is probably the most used, and certainly
25 the most well-known character class. By default, a dot matches any
26 character, except for the newline. The default can be changed to
27 add matching the newline with the I<single line> modifier: either
28 for the entire regular expression using the C</s> modifier, or
29 locally using C<(?s)>.
31 Here are some examples:
35 "" =~ /./ # No match (dot has to match a character)
36 "\n" =~ /./ # No match (dot does not match a newline)
37 "\n" =~ /./s # Match (global 'single line' modifier)
38 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
39 "ab" =~ /^.$/ # No match (dot matches one character)
41 =head2 Backslashed sequences
43 Perl regular expressions contain many backslashed sequences that
44 constitute a character class. That is, they will match a single
45 character, if that character belongs to a specific set of characters
46 (defined by the sequence). A backslashed sequence is a sequence of
47 characters starting with a backslash. Not all backslashed sequences
48 are character classes; for a full list, see L<perlrebackslash>.
50 Here's a list of the backslashed sequences, which are discussed in
53 \d Match a digit character.
54 \D Match a non-digit character.
55 \w Match a "word" character.
56 \W Match a non-"word" character.
57 \s Match a white space character.
58 \S Match a non-white space character.
59 \h Match a horizontal white space character.
60 \H Match a character that isn't horizontal white space.
61 \N Match a character that isn't newline.
62 \v Match a vertical white space character.
63 \V Match a character that isn't vertical white space.
64 \pP, \p{Prop} Match a character matching a Unicode property.
65 \PP, \P{Prop} Match a character that doesn't match a Unicode property.
69 C<\d> matches a single character that is considered to be a I<digit>.
70 What is considered a digit depends on the internal encoding of
71 the source string. If the source string is in UTF-8 format, C<\d>
72 not only matches the digits '0' - '9', but also Arabic, Devanagari and
73 digits from other languages. Otherwise, if there is a locale in effect,
74 it will match whatever characters the locale considers digits. Without
75 a locale, C<\d> matches the digits '0' to '9'.
76 See L</Locale, Unicode and UTF-8>.
78 Any character that isn't matched by C<\d> will be matched by C<\D>.
80 =head3 Word characters
82 C<\w> matches a single I<word> character: an alphanumeric character
83 (that is, an alphabetic character, or a digit), or the underscore (C<_>).
84 What is considered a word character depends on the internal encoding
85 of the string. If it's in UTF-8 format, C<\w> matches those characters
86 that are considered word characters in the Unicode database. That is, it
87 not only matches ASCII letters, but also Thai letters, Greek letters, etc.
88 If the source string isn't in UTF-8 format, C<\w> matches those characters
89 that are considered word characters by the current locale. Without
90 a locale in effect, C<\w> matches the ASCII letters, digits and the
93 Any character that isn't matched by C<\w> will be matched by C<\W>.
97 C<\s> matches any single character that is considered white space. In the
98 ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
99 (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
100 space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set
101 of characters matched by C<\s> depends on whether the source string is
102 in UTF-8 format. If it is, C<\s> matches what is considered white space
103 in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
104 matches whatever is considered white space by the current locale. Without
105 a locale, C<\s> matches the five characters mentioned in the beginning
106 of this paragraph. Perhaps the most notable difference is that C<\s>
107 matches a non-breaking space only if the non-breaking space is in a
108 UTF-8 encoded string.
110 Any character that isn't matched by C<\s> will be matched by C<\S>.
112 C<\h> will match any character that is considered horizontal white space;
113 this includes the space and the tab characters. C<\H> will match any character
114 that is not considered horizontal white space.
116 C<\N>, like the dot, will match any character that is not a newline. The
117 difference is that C<\N> will not be influenced by the single line C</s>
118 regular expression modifier. (Note that, since C<\N{}> is also used for
119 named characters, if C<\N> is followed by an opening brace and something that
120 is not a quantifier, perl will assume that a character name is coming. For
121 example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or
122 more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause
123 perl to look for a character named C<4F> (and won't find it unless custom names
126 C<\v> will match any character that is considered vertical white space;
127 this includes the carriage return and line feed characters (newline).
128 C<\V> will match any character that is not considered vertical white space.
130 C<\R> matches anything that can be considered a newline under Unicode
131 rules. It's not a character class, as it can match a multi-character
132 sequence. Therefore, it cannot be used inside a bracketed character
133 class. Details are discussed in L<perlrebackslash>.
135 C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.
137 Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
138 the same characters, regardless whether the source string is in UTF-8
139 format or not. The set of characters they match is also not influenced
142 One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
143 The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
144 considered vertical white space. Furthermore, if the source string is
145 not in UTF-8 format, the next line (C<"\x85">) and the no-break space
146 (C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
147 If the source string is in UTF-8 format, both the next line and the
148 no-break space are matched by C<\s>.
150 The following table is a complete listing of characters matched by
151 C<\s>, C<\h> and C<\v>.
153 The first column gives the code point of the character (in hex format),
154 the second column gives the (Unicode) name. The third column indicates
155 by which class(es) the character is matched.
157 0x00009 CHARACTER TABULATION h s
158 0x0000a LINE FEED (LF) vs
159 0x0000b LINE TABULATION v
160 0x0000c FORM FEED (FF) vs
161 0x0000d CARRIAGE RETURN (CR) vs
163 0x00085 NEXT LINE (NEL) vs [1]
164 0x000a0 NO-BREAK SPACE h s [1]
165 0x01680 OGHAM SPACE MARK h s
166 0x0180e MONGOLIAN VOWEL SEPARATOR h s
171 0x02004 THREE-PER-EM SPACE h s
172 0x02005 FOUR-PER-EM SPACE h s
173 0x02006 SIX-PER-EM SPACE h s
174 0x02007 FIGURE SPACE h s
175 0x02008 PUNCTUATION SPACE h s
176 0x02009 THIN SPACE h s
177 0x0200a HAIR SPACE h s
178 0x02028 LINE SEPARATOR vs
179 0x02029 PARAGRAPH SEPARATOR vs
180 0x0202f NARROW NO-BREAK SPACE h s
181 0x0205f MEDIUM MATHEMATICAL SPACE h s
182 0x03000 IDEOGRAPHIC SPACE h s
188 NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
193 It is worth noting that C<\d>, C<\w>, etc, match single characters, not
194 complete numbers or words. To match a number (that consists of integers),
195 use C<\d+>; to match a word, use C<\w+>.
198 =head3 Unicode Properties
200 C<\pP> and C<\p{Prop}> are character classes to match characters that
201 fit given Unicode classes. One letter classes can be used in the C<\pP>
202 form, with the class name following the C<\p>, otherwise, braces are required.
203 There is a single form, which is just the property name enclosed in the braces,
204 and a compound form which looks like C<\p{name=value}>, which means to match
205 if the property C<name> for the character has the particular C<value>.
206 For instance, a match for a number can be written as C</\pN/> or as
207 C</\p{Number}/>, or as C</\p{Number=True}/>.
208 Lowercase letters are matched by the property I<Lowercase_Letter> which
209 has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
210 C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
211 (the underscores are optional).
212 C</\pLl/> is valid, but means something different.
213 It matches a two character string: a letter (Unicode property C<\pL>),
214 followed by a lowercase C<l>.
216 For more details, see L<perlunicode/Unicode Character Properties>; for a
217 complete list of possible properties, see
218 L<perluniprops/Properties accessible through \p{} and \P{}>.
219 It is also possible to define your own properties. This is discussed in
220 L<perlunicode/User-Defined Character Properties>.
225 "a" =~ /\w/ # Match, "a" is a 'word' character.
226 "7" =~ /\w/ # Match, "7" is a 'word' character as well.
227 "a" =~ /\d/ # No match, "a" isn't a digit.
228 "7" =~ /\d/ # Match, "7" is a digit.
229 " " =~ /\s/ # Match, a space is white space.
230 "a" =~ /\D/ # Match, "a" is a non-digit.
231 "7" =~ /\D/ # No match, "7" is not a non-digit.
232 " " =~ /\S/ # No match, a space is not non-white space.
234 " " =~ /\h/ # Match, space is horizontal white space.
235 " " =~ /\v/ # No match, space is not vertical white space.
236 "\r" =~ /\v/ # Match, a return is vertical white space.
238 "a" =~ /\pL/ # Match, "a" is a letter.
239 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
241 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
242 # 'THAI CHARACTER SO SO', and that's in
243 # Thai Unicode class.
244 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.
247 =head2 Bracketed Character Classes
249 The third form of character class you can use in Perl regular expressions
250 is the bracketed form. In its simplest form, it lists the characters
251 that may be matched inside square brackets, like this: C<[aeiou]>.
252 This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
253 character classes, exactly one character will be matched. To match
254 a longer string consisting of characters mentioned in the characters
255 class, follow the character class with a quantifier. For instance,
256 C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
258 Repeating a character in a character class has no
259 effect; it's considered to be in the set only once.
263 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
264 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
265 "ae" =~ /^[aeiou]$/ # No match, a character class only matches
266 # a single character.
267 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
269 =head3 Special Characters Inside a Bracketed Character Class
271 Most characters that are meta characters in regular expressions (that
272 is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
273 their special meaning and can be used inside a character class without
274 the need to escape them. For instance, C<[()]> matches either an opening
275 parenthesis, or a closing parenthesis, and the parens inside the character
276 class don't group or capture.
278 Characters that may carry a special meaning inside a character class are:
279 C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
280 escaped with a backslash, although this is sometimes not needed, in which
281 case the backslash may be omitted.
283 The sequence C<\b> is special inside a bracketed character class. While
284 outside the character class C<\b> is an assertion indicating a point
285 that does not have either two word characters or two non-word characters
286 on either side, inside a bracketed character class, C<\b> matches a
300 are also special and have the same meanings as they do outside a bracketed character
303 Also, a backslash followed by digits is considered an octal number.
305 A C<[> is not special inside a character class, unless it's the start
306 of a POSIX character class (see below). It normally does not need escaping.
308 A C<]> is either the end of a POSIX character class (see below), or it
309 signals the end of the bracketed character class. Normally it needs
310 escaping if you want to include a C<]> in the set of characters.
311 However, if the C<]> is the I<first> (or the second if the first
312 character is a caret) character of a bracketed character class, it
313 does not denote the end of the class (as you cannot have an empty class)
314 and is considered part of the set of characters that can be matched without
319 "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
320 "\cH" =~ /[\b]/ # Match, \b inside in a character class
321 # is equivalent with a backspace.
322 "]" =~ /[][]/ # Match, as the character class contains.
324 "[]" =~ /[[]]/ # Match, the pattern contains a character class
325 # containing just ], and the character class is
328 =head3 Character Ranges
330 It is not uncommon to want to match a range of characters. Luckily, instead
331 of listing all the characters in the range, one may use the hyphen (C<->).
332 If inside a bracketed character class you have two characters separated
333 by a hyphen, it's treated as if all the characters between the two are in
334 the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
335 matches any lowercase letter from the first half of the ASCII alphabet.
337 Note that the two characters on either side of the hyphen are not
338 necessary both letters or both digits. Any character is possible,
339 although not advisable. C<['-?]> contains a range of characters, but
340 most people will not know which characters that will be. Furthermore,
341 such ranges may lead to portability problems if the code has to run on
342 a platform that uses a different character set, such as EBCDIC.
344 If a hyphen in a character class cannot be part of a range, for instance
345 because it is the first or the last character of the character class,
346 or if it immediately follows a range, the hyphen isn't special, and will be
347 considered a character that may be matched. You have to escape the hyphen
348 with a backslash if you want to have a hyphen in your set of characters to
349 be matched, and its position in the class is such that it can be considered
354 [a-z] # Matches a character that is a lower case ASCII letter.
355 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
357 [-z] # Matches either a hyphen ('-') or the letter 'z'.
358 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
359 # hyphen ('-'), or the letter 'm'.
360 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
361 # (But not on an EBCDIC platform).
366 It is also possible to instead list the characters you do not want to
367 match. You can do so by using a caret (C<^>) as the first character in the
368 character class. For instance, C<[^a-z]> matches a character that is not a
369 lowercase ASCII letter.
371 This syntax make the caret a special character inside a bracketed character
372 class, but only if it is the first character of the class. So if you want
373 to have the caret as one of the characters you want to match, you either
374 have to escape the caret, or not list it first.
378 "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
379 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
380 "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
381 "^" =~ /[x^]/ # Match, caret is not special here.
383 =head3 Backslash Sequences
385 You can put any backslash sequence character class (with one exception listed
386 in the next paragraph) inside a bracketed character class, and it will act just
387 as if you put all the characters matched by the backslash sequence inside the
388 character class. For instance, C<[a-f\d]> will match any digit, or any of the
389 lowercase letters between 'a' and 'f' inclusive.
391 C<\N> within a bracketed character class must be of the form C<\N{NAME}> for
392 the same reason that a dot C<.> inside a bracketed character class loses its
393 special meaning: it matches nearly anything, which generally isn't what you
398 /[\p{Thai}\d]/ # Matches a character that is either a Thai
399 # character, or a digit.
400 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
401 # character, nor a parenthesis.
403 Backslash sequence character classes cannot form one of the endpoints
406 =head3 Posix Character Classes
408 Posix character classes have the form C<[:class:]>, where I<class> is
409 name, and the C<[:> and C<:]> delimiters. Posix character classes appear
410 I<inside> bracketed character classes, and are a convenient and descriptive
411 way of listing a group of characters. Be careful about the syntax,
414 $string =~ /[[:alpha:]]/
416 # Incorrect (will warn):
417 $string =~ /[:alpha:]/
419 The latter pattern would be a character class consisting of a colon,
420 and the letters C<a>, C<l>, C<p> and C<h>.
422 Perl recognizes the following POSIX character classes:
424 alpha Any alphabetical character.
425 alnum Any alphanumerical character.
426 ascii Any ASCII character.
427 blank A GNU extension, equal to a space or a horizontal tab ("\t").
428 cntrl Any control character.
429 digit Any digit, equivalent to "\d".
430 graph Any printable character, excluding a space.
431 lower Any lowercase character.
432 print Any printable character, including a space.
433 punct Any punctuation character.
434 space Any white space character. "\s" plus the vertical tab ("\cK").
435 upper Any uppercase character.
436 word Any "word" character, equivalent to "\w".
437 xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
439 The exact set of characters matched depends on whether the source string
440 is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
442 Most POSIX character classes have C<\p> counterparts. The difference
443 is that the C<\p> classes will always match according to the Unicode
444 properties, regardless whether the string is in UTF-8 format or not.
446 The following table shows the relation between POSIX character classes
447 and the Unicode properties:
449 [[:...:]] \p{...} backslash
467 Some of these names may not be obvious:
473 Any control character. Usually, control characters don't produce output
474 as such, but instead control the terminal somehow: for example newline
475 and backspace are control characters. All characters with C<ord()> less
476 than 32 are usually classified as control characters (in ASCII, the ISO
477 Latin character sets, and Unicode), as is the character C<ord()> value
482 Any character that is I<graphical>, that is, visible. This class consists
483 of all the alphanumerical characters and all punctuation characters.
487 All printable characters, which is the set of all the graphical characters
492 Any punctuation (special) character.
498 A Perl extension to the POSIX character class is the ability to
499 negate it. This is done by prefixing the class name with a caret (C<^>).
502 POSIX Unicode Backslash
503 [[:^digit:]] \P{IsDigit} \D
504 [[:^space:]] \P{IsSpace} \S
505 [[:^word:]] \P{IsWord} \W
507 =head4 [= =] and [. .]
509 Perl will recognize the POSIX character classes C<[=class=]>, and
510 C<[.class.]>, but does not (yet?) support this construct. Use of
511 such a construct will lead to an error.
516 /[[:digit:]]/ # Matches a character that is a digit.
517 /[01[:lower:]]/ # Matches a character that is either a
518 # lowercase letter, or '0' or '1'.
519 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
520 # but the letters 'a' to 'f' in either case.
521 # This is because the character class contains
522 # all digits, and anything that isn't a
523 # hex digit, resulting in a class containing
524 # all characters, but the letters 'a' to 'f'
528 =head2 Locale, Unicode and UTF-8
530 Some of the character classes have a somewhat different behaviour depending
531 on the internal encoding of the source string, and the locale that is
534 C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
535 including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
537 The rule is that if the source string is in UTF-8 format, the character
538 classes match according to the Unicode properties. If the source string
539 isn't, then the character classes match according to whatever locale is
540 in effect. If there is no locale, they match the ASCII defaults
541 (52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
543 This usually means that if you are matching against characters whose C<ord()>
544 values are between 128 and 255 inclusive, your character class may match
545 or not depending on the current locale, and whether the source string is
546 in UTF-8 format. The string will be in UTF-8 format if it contains
547 characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
548 format without it having such characters.
550 For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
551 or the POSIX character classes, and use the Unicode properties instead.
555 $str = "\xDF"; # $str is not in UTF-8 format.
556 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
557 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
558 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
560 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.