Document \N{U+...}
[p5sagit/p5-mst-13.2.git] / pod / perlrecharclass.pod
CommitLineData
8a118206 1=head1 NAME
2
3perlrecharclass - Perl Regular Expression Character Classes
4
5=head1 DESCRIPTION
6
7The top level documentation about Perl regular expressions
8is found in L<perlre>.
9
10This manual page discusses the syntax and use of character
11classes in Perl Regular Expressions.
12
13A character class is a way of denoting a set of characters,
14in such a way that one character of the set is matched.
15It's important to remember that matching a character class
16consumes exactly one character in the source string. (The source
17string is the string the regular expression is matched against.)
18
19There are three types of character classes in Perl regular
20expressions: the dot, backslashed sequences, and the bracketed form.
21
22=head2 The dot
23
24The dot (or period), C<.> is probably the most used, and certainly
25the most well-known character class. By default, a dot matches any
26character, except for the newline. The default can be changed to
27add matching the newline with the I<single line> modifier: either
28for the entire regular expression using the C</s> modifier, or
29locally using C<(?s)>.
30
31Here are some examples:
32
33 "a" =~ /./ # Match
34 "." =~ /./ # Match
35 "" =~ /./ # No match (dot has to match a character)
36 "\n" =~ /./ # No match (dot does not match a newline)
37 "\n" =~ /./s # Match (global 'single line' modifier)
38 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
39 "ab" =~ /^.$/ # No match (dot matches one character)
40
8a118206 41=head2 Backslashed sequences
42
43Perl regular expressions contain many backslashed sequences that
44constitute a character class. That is, they will match a single
45character, if that character belongs to a specific set of characters
46(defined by the sequence). A backslashed sequence is a sequence of
47characters starting with a backslash. Not all backslashed sequences
df225385 48are character classes; for a full list, see L<perlrebackslash>.
8a118206 49
50Here's a list of the backslashed sequences, which are discussed in
51more detail below.
52
53 \d Match a digit character.
54 \D Match a non-digit character.
55 \w Match a "word" character.
56 \W Match a non-"word" character.
57 \s Match a white space character.
58 \S Match a non-white space character.
59 \h Match a horizontal white space character.
60 \H Match a character that isn't horizontal white space.
c741660a 61 \N Match a character that isn't newline.
8a118206 62 \v Match a vertical white space character.
63 \V Match a character that isn't vertical white space.
64 \pP, \p{Prop} Match a character matching a Unicode property.
65 \PP, \P{Prop} Match a character that doesn't match a Unicode property.
66
67=head3 Digits
68
69C<\d> matches a single character that is considered to be a I<digit>.
70What is considered a digit depends on the internal encoding of
71the source string. If the source string is in UTF-8 format, C<\d>
72not only matches the digits '0' - '9', but also Arabic, Devanagari and
73digits from other languages. Otherwise, if there is a locale in effect,
74it will match whatever characters the locale considers digits. Without
75a locale, C<\d> matches the digits '0' to '9'.
76See L</Locale, Unicode and UTF-8>.
77
78Any character that isn't matched by C<\d> will be matched by C<\D>.
79
80=head3 Word characters
81
82C<\w> matches a single I<word> character: an alphanumeric character
83(that is, an alphabetic character, or a digit), or the underscore (C<_>).
84What is considered a word character depends on the internal encoding
85of the string. If it's in UTF-8 format, C<\w> matches those characters
86that are considered word characters in the Unicode database. That is, it
87not only matches ASCII letters, but also Thai letters, Greek letters, etc.
88If the source string isn't in UTF-8 format, C<\w> matches those characters
89that are considered word characters by the current locale. Without
90a locale in effect, C<\w> matches the ASCII letters, digits and the
91underscore.
92
93Any character that isn't matched by C<\w> will be matched by C<\W>.
94
95=head3 White space
96
c741660a 97C<\s> matches any single character that is considered white space. In the
8a118206 98ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
99(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
100space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set
101of characters matched by C<\s> depends on whether the source string is
102in UTF-8 format. If it is, C<\s> matches what is considered white space
103in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
104matches whatever is considered white space by the current locale. Without
105a locale, C<\s> matches the five characters mentioned in the beginning
106of this paragraph. Perhaps the most notable difference is that C<\s>
107matches a non-breaking space only if the non-breaking space is in a
108UTF-8 encoded string.
109
110Any character that isn't matched by C<\s> will be matched by C<\S>.
111
112C<\h> will match any character that is considered horizontal white space;
113this includes the space and the tab characters. C<\H> will match any character
114that is not considered horizontal white space.
115
c741660a 116C<\N>, like the dot, will match any character that is not a newline. The
117difference is that C<\N> will not be influenced by the single line C</s>
118regular expression modifier. (Note that, since C<\N{}> is also used for
df225385 119named characters, if C<\N> is followed by an opening brace and something that
120is not a quantifier, perl will assume that a character name is coming. For
121example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or
122more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause
e526e8bb 123perl to look for a character named C<4F> (and won't find one unless custom names
124have been defined that include it.)
c741660a 125
8a118206 126C<\v> will match any character that is considered vertical white space;
127this includes the carriage return and line feed characters (newline).
128C<\V> will match any character that is not considered vertical white space.
129
130C<\R> matches anything that can be considered a newline under Unicode
131rules. It's not a character class, as it can match a multi-character
132sequence. Therefore, it cannot be used inside a bracketed character
133class. Details are discussed in L<perlrebackslash>.
134
99d59c4d 135C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.
8a118206 136
137Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
138the same characters, regardless whether the source string is in UTF-8
139format or not. The set of characters they match is also not influenced
140by locale.
141
142One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
143The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
144considered vertical white space. Furthermore, if the source string is
145not in UTF-8 format, the next line (C<"\x85">) and the no-break space
146(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
147If the source string is in UTF-8 format, both the next line and the
148no-break space are matched by C<\s>.
149
150The following table is a complete listing of characters matched by
151C<\s>, C<\h> and C<\v>.
152
153The first column gives the code point of the character (in hex format),
154the second column gives the (Unicode) name. The third column indicates
155by which class(es) the character is matched.
156
157 0x00009 CHARACTER TABULATION h s
158 0x0000a LINE FEED (LF) vs
159 0x0000b LINE TABULATION v
160 0x0000c FORM FEED (FF) vs
161 0x0000d CARRIAGE RETURN (CR) vs
162 0x00020 SPACE h s
163 0x00085 NEXT LINE (NEL) vs [1]
164 0x000a0 NO-BREAK SPACE h s [1]
165 0x01680 OGHAM SPACE MARK h s
166 0x0180e MONGOLIAN VOWEL SEPARATOR h s
167 0x02000 EN QUAD h s
168 0x02001 EM QUAD h s
169 0x02002 EN SPACE h s
170 0x02003 EM SPACE h s
171 0x02004 THREE-PER-EM SPACE h s
172 0x02005 FOUR-PER-EM SPACE h s
173 0x02006 SIX-PER-EM SPACE h s
174 0x02007 FIGURE SPACE h s
175 0x02008 PUNCTUATION SPACE h s
176 0x02009 THIN SPACE h s
177 0x0200a HAIR SPACE h s
178 0x02028 LINE SEPARATOR vs
179 0x02029 PARAGRAPH SEPARATOR vs
180 0x0202f NARROW NO-BREAK SPACE h s
181 0x0205f MEDIUM MATHEMATICAL SPACE h s
182 0x03000 IDEOGRAPHIC SPACE h s
183
184=over 4
185
186=item [1]
187
188NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
189UTF-8 format.
190
191=back
192
193It is worth noting that C<\d>, C<\w>, etc, match single characters, not
194complete numbers or words. To match a number (that consists of integers),
195use C<\d+>; to match a word, use C<\w+>.
196
197
198=head3 Unicode Properties
199
200C<\pP> and C<\p{Prop}> are character classes to match characters that
201fit given Unicode classes. One letter classes can be used in the C<\pP>
e1b711da 202form, with the class name following the C<\p>, otherwise, braces are required.
203There is a single form, which is just the property name enclosed in the braces,
204and a compound form which looks like C<\p{name=value}>, which means to match
205if the property C<name> for the character has the particular C<value>.
206For instance, a match for a number can be written as C</\pN/> or as
207C</\p{Number}/>, or as C</\p{Number=True}/>.
208Lowercase letters are matched by the property I<Lowercase_Letter> which
209has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
210C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
211(the underscores are optional).
212C</\pLl/> is valid, but means something different.
8a118206 213It matches a two character string: a letter (Unicode property C<\pL>),
214followed by a lowercase C<l>.
215
e1b711da 216For more details, see L<perlunicode/Unicode Character Properties>; for a
217complete list of possible properties, see
218L<perluniprops/Properties accessible through \p{} and \P{}>.
219It is also possible to define your own properties. This is discussed in
8a118206 220L<perlunicode/User-Defined Character Properties>.
221
222
223=head4 Examples
224
225 "a" =~ /\w/ # Match, "a" is a 'word' character.
226 "7" =~ /\w/ # Match, "7" is a 'word' character as well.
227 "a" =~ /\d/ # No match, "a" isn't a digit.
228 "7" =~ /\d/ # Match, "7" is a digit.
229 " " =~ /\s/ # Match, a space is white space.
230 "a" =~ /\D/ # Match, "a" is a non-digit.
231 "7" =~ /\D/ # No match, "7" is not a non-digit.
232 " " =~ /\S/ # No match, a space is not non-white space.
233
234 " " =~ /\h/ # Match, space is horizontal white space.
235 " " =~ /\v/ # No match, space is not vertical white space.
236 "\r" =~ /\v/ # Match, a return is vertical white space.
237
238 "a" =~ /\pL/ # Match, "a" is a letter.
239 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
240
241 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
242 # 'THAI CHARACTER SO SO', and that's in
243 # Thai Unicode class.
244 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.
245
246
247=head2 Bracketed Character Classes
248
249The third form of character class you can use in Perl regular expressions
250is the bracketed form. In its simplest form, it lists the characters
251that may be matched inside square brackets, like this: C<[aeiou]>.
252This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
253character classes, exactly one character will be matched. To match
254a longer string consisting of characters mentioned in the characters
255class, follow the character class with a quantifier. For instance,
256C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
257
258Repeating a character in a character class has no
259effect; it's considered to be in the set only once.
260
261Examples:
262
263 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
264 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
265 "ae" =~ /^[aeiou]$/ # No match, a character class only matches
266 # a single character.
267 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
268
269=head3 Special Characters Inside a Bracketed Character Class
270
271Most characters that are meta characters in regular expressions (that
df225385 272is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
8a118206 273their special meaning and can be used inside a character class without
274the need to escape them. For instance, C<[()]> matches either an opening
275parenthesis, or a closing parenthesis, and the parens inside the character
276class don't group or capture.
277
278Characters that may carry a special meaning inside a character class are:
279C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
280escaped with a backslash, although this is sometimes not needed, in which
281case the backslash may be omitted.
282
283The sequence C<\b> is special inside a bracketed character class. While
284outside the character class C<\b> is an assertion indicating a point
285that does not have either two word characters or two non-word characters
286on either side, inside a bracketed character class, C<\b> matches a
287backspace character.
288
df225385 289The sequences
290C<\a>,
291C<\c>,
292C<\e>,
293C<\f>,
294C<\n>,
e526e8bb 295C<\N{I<NAME>}>,
296C<\N{U+I<wide hex char>}>,
df225385 297C<\r>,
298C<\t>,
299and
300C<\x>
301are also special and have the same meanings as they do outside a bracketed character
302class.
303
304Also, a backslash followed by digits is considered an octal number.
305
8a118206 306A C<[> is not special inside a character class, unless it's the start
307of a POSIX character class (see below). It normally does not need escaping.
308
309A C<]> is either the end of a POSIX character class (see below), or it
310signals the end of the bracketed character class. Normally it needs
311escaping if you want to include a C<]> in the set of characters.
312However, if the C<]> is the I<first> (or the second if the first
313character is a caret) character of a bracketed character class, it
314does not denote the end of the class (as you cannot have an empty class)
315and is considered part of the set of characters that can be matched without
316escaping.
317
318Examples:
319
320 "+" =~ /[+?*]/ # Match, "+" in a character class is not special.
321 "\cH" =~ /[\b]/ # Match, \b inside in a character class
322 # is equivalent with a backspace.
323 "]" =~ /[][]/ # Match, as the character class contains.
324 # both [ and ].
325 "[]" =~ /[[]]/ # Match, the pattern contains a character class
326 # containing just ], and the character class is
327 # followed by a ].
328
329=head3 Character Ranges
330
331It is not uncommon to want to match a range of characters. Luckily, instead
332of listing all the characters in the range, one may use the hyphen (C<->).
333If inside a bracketed character class you have two characters separated
334by a hyphen, it's treated as if all the characters between the two are in
335the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
336matches any lowercase letter from the first half of the ASCII alphabet.
337
338Note that the two characters on either side of the hyphen are not
339necessary both letters or both digits. Any character is possible,
340although not advisable. C<['-?]> contains a range of characters, but
341most people will not know which characters that will be. Furthermore,
342such ranges may lead to portability problems if the code has to run on
343a platform that uses a different character set, such as EBCDIC.
344
345If a hyphen in a character class cannot be part of a range, for instance
346because it is the first or the last character of the character class,
347or if it immediately follows a range, the hyphen isn't special, and will be
348considered a character that may be matched. You have to escape the hyphen
349with a backslash if you want to have a hyphen in your set of characters to
350be matched, and its position in the class is such that it can be considered
351part of a range.
352
353Examples:
354
355 [a-z] # Matches a character that is a lower case ASCII letter.
356 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
357 # letter 'z'.
358 [-z] # Matches either a hyphen ('-') or the letter 'z'.
359 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
360 # hyphen ('-'), or the letter 'm'.
361 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
362 # (But not on an EBCDIC platform).
363
364
365=head3 Negation
366
367It is also possible to instead list the characters you do not want to
368match. You can do so by using a caret (C<^>) as the first character in the
369character class. For instance, C<[^a-z]> matches a character that is not a
370lowercase ASCII letter.
371
372This syntax make the caret a special character inside a bracketed character
373class, but only if it is the first character of the class. So if you want
374to have the caret as one of the characters you want to match, you either
375have to escape the caret, or not list it first.
376
377Examples:
378
379 "e" =~ /[^aeiou]/ # No match, the 'e' is listed.
380 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
381 "^" =~ /[^^]/ # No match, matches anything that isn't a caret.
382 "^" =~ /[x^]/ # Match, caret is not special here.
383
384=head3 Backslash Sequences
385
df225385 386You can put any backslash sequence character class (with one exception listed
387in the next paragraph) inside a bracketed character class, and it will act just
388as if you put all the characters matched by the backslash sequence inside the
389character class. For instance, C<[a-f\d]> will match any digit, or any of the
390lowercase letters between 'a' and 'f' inclusive.
391
e526e8bb 392C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or
393C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a
394bracketed character class loses its special meaning: it matches nearly
395anything, which generally isn't what you want to happen.
8a118206 396
397Examples:
398
399 /[\p{Thai}\d]/ # Matches a character that is either a Thai
400 # character, or a digit.
401 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
402 # character, nor a parenthesis.
403
404Backslash sequence character classes cannot form one of the endpoints
405of a range.
406
407=head3 Posix Character Classes
408
409Posix character classes have the form C<[:class:]>, where I<class> is
410name, and the C<[:> and C<:]> delimiters. Posix character classes appear
411I<inside> bracketed character classes, and are a convenient and descriptive
412way of listing a group of characters. Be careful about the syntax,
413
414 # Correct:
415 $string =~ /[[:alpha:]]/
416
417 # Incorrect (will warn):
418 $string =~ /[:alpha:]/
419
420The latter pattern would be a character class consisting of a colon,
421and the letters C<a>, C<l>, C<p> and C<h>.
422
423Perl recognizes the following POSIX character classes:
424
425 alpha Any alphabetical character.
426 alnum Any alphanumerical character.
427 ascii Any ASCII character.
ea8b8ad2 428 blank A GNU extension, equal to a space or a horizontal tab ("\t").
8a118206 429 cntrl Any control character.
ea8b8ad2 430 digit Any digit, equivalent to "\d".
8a118206 431 graph Any printable character, excluding a space.
432 lower Any lowercase character.
433 print Any printable character, including a space.
434 punct Any punctuation character.
ea8b8ad2 435 space Any white space character. "\s" plus the vertical tab ("\cK").
8a118206 436 upper Any uppercase character.
ea8b8ad2 437 word Any "word" character, equivalent to "\w".
8a118206 438 xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
439
440The exact set of characters matched depends on whether the source string
441is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
442
443Most POSIX character classes have C<\p> counterparts. The difference
444is that the C<\p> classes will always match according to the Unicode
445properties, regardless whether the string is in UTF-8 format or not.
446
447The following table shows the relation between POSIX character classes
448and the Unicode properties:
449
450 [[:...:]] \p{...} backslash
451
452 alpha IsAlpha
453 alnum IsAlnum
454 ascii IsASCII
455 blank
456 cntrl IsCntrl
457 digit IsDigit \d
458 graph IsGraph
459 lower IsLower
460 print IsPrint
461 punct IsPunct
462 space IsSpace
463 IsSpacePerl \s
464 upper IsUpper
465 word IsWord
466 xdigit IsXDigit
467
e1b711da 468Some of these names may not be obvious:
8a118206 469
470=over 4
471
472=item cntrl
473
474Any control character. Usually, control characters don't produce output
475as such, but instead control the terminal somehow: for example newline
476and backspace are control characters. All characters with C<ord()> less
477than 32 are usually classified as control characters (in ASCII, the ISO
478Latin character sets, and Unicode), as is the character C<ord()> value
479of 127 (C<DEL>).
480
481=item graph
482
483Any character that is I<graphical>, that is, visible. This class consists
484of all the alphanumerical characters and all punctuation characters.
485
486=item print
487
488All printable characters, which is the set of all the graphical characters
489plus the space.
490
491=item punct
492
493Any punctuation (special) character.
494
495=back
496
497=head4 Negation
498
499A Perl extension to the POSIX character class is the ability to
500negate it. This is done by prefixing the class name with a caret (C<^>).
501Some examples:
502
503 POSIX Unicode Backslash
504 [[:^digit:]] \P{IsDigit} \D
505 [[:^space:]] \P{IsSpace} \S
506 [[:^word:]] \P{IsWord} \W
507
508=head4 [= =] and [. .]
509
510Perl will recognize the POSIX character classes C<[=class=]>, and
511C<[.class.]>, but does not (yet?) support this construct. Use of
740bae87 512such a construct will lead to an error.
8a118206 513
514
515=head4 Examples
516
517 /[[:digit:]]/ # Matches a character that is a digit.
518 /[01[:lower:]]/ # Matches a character that is either a
519 # lowercase letter, or '0' or '1'.
520 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
521 # but the letters 'a' to 'f' in either case.
522 # This is because the character class contains
523 # all digits, and anything that isn't a
524 # hex digit, resulting in a class containing
525 # all characters, but the letters 'a' to 'f'
526 # and 'A' to 'F'.
527
528
529=head2 Locale, Unicode and UTF-8
530
531Some of the character classes have a somewhat different behaviour depending
532on the internal encoding of the source string, and the locale that is
533in effect.
534
535C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
536including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
537
538The rule is that if the source string is in UTF-8 format, the character
539classes match according to the Unicode properties. If the source string
540isn't, then the character classes match according to whatever locale is
541in effect. If there is no locale, they match the ASCII defaults
542(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
543
544This usually means that if you are matching against characters whose C<ord()>
545values are between 128 and 255 inclusive, your character class may match
546or not depending on the current locale, and whether the source string is
547in UTF-8 format. The string will be in UTF-8 format if it contains
548characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
549format without it having such characters.
550
551For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
552or the POSIX character classes, and use the Unicode properties instead.
553
554=head4 Examples
555
556 $str = "\xDF"; # $str is not in UTF-8 format.
557 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
558 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
559 $str =~ /^\w/; # Match! $str is now in UTF-8 format.
560 chop $str;
561 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
562
563=cut