Commit | Line | Data |
8a118206 |
1 | =head1 NAME |
ea449505 |
2 | X<character class> |
8a118206 |
3 | |
4 | perlrecharclass - Perl Regular Expression Character Classes |
5 | |
6 | =head1 DESCRIPTION |
7 | |
8 | The top level documentation about Perl regular expressions |
9 | is found in L<perlre>. |
10 | |
11 | This manual page discusses the syntax and use of character |
6b83a163 |
12 | classes in Perl regular expressions. |
8a118206 |
13 | |
6b83a163 |
14 | A character class is a way of denoting a set of characters |
8a118206 |
15 | in such a way that one character of the set is matched. |
6b83a163 |
16 | It's important to remember that: matching a character class |
8a118206 |
17 | consumes exactly one character in the source string. (The source |
18 | string is the string the regular expression is matched against.) |
19 | |
20 | There are three types of character classes in Perl regular |
6b83a163 |
21 | expressions: the dot, backslash sequences, and the form enclosed in square |
ea449505 |
22 | brackets. Keep in mind, though, that often the term "character class" is used |
6b83a163 |
23 | to mean just the bracketed form. Certainly, most Perl documentation does that. |
8a118206 |
24 | |
25 | =head2 The dot |
26 | |
27 | The dot (or period), C<.> is probably the most used, and certainly |
28 | the most well-known character class. By default, a dot matches any |
29 | character, except for the newline. The default can be changed to |
6b83a163 |
30 | add matching the newline by using the I<single line> modifier: either |
31 | for the entire regular expression with the C</s> modifier, or |
32 | locally with C<(?s)>. (The experimental C<\N> backslash sequence, described |
33 | below, matches any character except newline without regard to the |
34 | I<single line> modifier.) |
8a118206 |
35 | |
36 | Here are some examples: |
37 | |
38 | "a" =~ /./ # Match |
39 | "." =~ /./ # Match |
40 | "" =~ /./ # No match (dot has to match a character) |
41 | "\n" =~ /./ # No match (dot does not match a newline) |
42 | "\n" =~ /./s # Match (global 'single line' modifier) |
43 | "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) |
44 | "ab" =~ /^.$/ # No match (dot matches one character) |
45 | |
6b83a163 |
46 | =head2 Backslash sequences |
ea449505 |
47 | X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> |
48 | X<\N> X<\v> X<\V> X<\h> X<\H> |
49 | X<word> X<whitespace> |
8a118206 |
50 | |
6b83a163 |
51 | A backslash sequence is a sequence of characters, the first one of which is a |
52 | backslash. Perl ascribes special meaning to many such sequences, and some of |
53 | these are character classes. That is, they match a single character each, |
54 | provided that the character belongs to the specific set of characters defined |
55 | by the sequence. |
8a118206 |
56 | |
6b83a163 |
57 | Here's a list of the backslash sequences that are character classes. They |
58 | are discussed in more detail below. (For the backslash sequences that aren't |
59 | character classes, see L<perlrebackslash>.) |
8a118206 |
60 | |
6b83a163 |
61 | \d Match a decimal digit character. |
62 | \D Match a non-decimal-digit character. |
8a118206 |
63 | \w Match a "word" character. |
64 | \W Match a non-"word" character. |
ea449505 |
65 | \s Match a whitespace character. |
66 | \S Match a non-whitespace character. |
67 | \h Match a horizontal whitespace character. |
68 | \H Match a character that isn't horizontal whitespace. |
ea449505 |
69 | \v Match a vertical whitespace character. |
70 | \V Match a character that isn't vertical whitespace. |
6b83a163 |
71 | \N Match a character that isn't a newline. Experimental. |
72 | \pP, \p{Prop} Match a character that has the given Unicode property. |
6c5a041f |
73 | \PP, \P{Prop} Match a character that doesn't have the Unicode property |
8a118206 |
74 | |
75 | =head3 Digits |
76 | |
6b83a163 |
77 | C<\d> matches a single character that is considered to be a decimal I<digit>. |
78 | What is considered a decimal digit depends on the internal encoding of the |
79 | source string and the locale that is in effect. If the source string is in |
80 | UTF-8 format, C<\d> not only matches the digits '0' - '9', but also Arabic, |
81 | Devanagari and digits from other languages. Otherwise, if there is a locale in |
82 | effect, it will match whatever characters the locale considers decimal digits. |
83 | Without a locale, C<\d> matches just the digits '0' to '9'. |
84 | See L</Locale, EBCDIC, Unicode and UTF-8>. |
85 | |
86 | Unicode digits may cause some confusion, and some security issues. In UTF-8 |
87 | strings, C<\d> matches the same characters matched by |
88 | C<\p{General_Category=Decimal_Number}>, or synonymously, |
89 | C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the |
90 | same set of characters matched by C<\p{Numeric_Type=Decimal}>. |
91 | |
92 | But Unicode also has a different property with a similar name, |
93 | C<\p{Numeric_Type=Digit}>, which matches a completely different set of |
94 | characters. These characters are things such as subscripts. |
95 | |
96 | The design intent is for C<\d> to match all the digits (and no other characters) |
97 | that can be used with "normal" big-endian positional decimal syntax, whereby a |
98 | sequence of such digits {N0, N1, N2, ...Nn} has the numeric value (...(N0 * 10 |
99 | + N1) * 10 + N2) * 10 ... + Nn). In Unicode 5.2, the Tamil digits (U+0BE6 - |
100 | U+0BEF) can also legally be used in old-style Tamil numbers in which they would |
101 | appear no more than one in a row, separated by characters that mean "times 10", |
102 | "times 100", etc. (See L<http://www.unicode.org/notes/tn21>.) |
103 | |
104 | Some of the non-European digits that C<\d> matches look like European ones, but |
105 | have different values. For example, BENGALI DIGIT FOUR (U+09A) looks very much |
106 | like an ASCII DIGIT EIGHT (U+0038). |
107 | |
108 | It may be useful for security purposes for an application to require that all |
109 | digits in a row be from the same script. See L<Unicode::UCD/charscript()>. |
8a118206 |
110 | |
111 | Any character that isn't matched by C<\d> will be matched by C<\D>. |
112 | |
113 | =head3 Word characters |
114 | |
ea449505 |
115 | A C<\w> matches a single alphanumeric character (an alphabetic character, or a |
6b83a163 |
116 | decimal digit) or an underscore (C<_>), not a whole word. To match a whole |
117 | word, use C<\w+>. This isn't the same thing as matching an English word, but |
118 | is the same as a string of Perl-identifier characters. What is considered a |
119 | word character depends on the internal |
ea449505 |
120 | encoding of the string and the locale or EBCDIC code page that is in effect. If |
121 | it's in UTF-8 format, C<\w> matches those characters that are considered word |
122 | characters in the Unicode database. That is, it not only matches ASCII letters, |
123 | but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8 |
124 | format, C<\w> matches those characters that are considered word characters by |
125 | the current locale or EBCDIC code page. Without a locale or EBCDIC code page, |
126 | C<\w> matches the ASCII letters, digits and the underscore. |
127 | See L</Locale, EBCDIC, Unicode and UTF-8>. |
8a118206 |
128 | |
6b83a163 |
129 | There are a number of security issues with the full Unicode list of word |
130 | characters. See L<http://unicode.org/reports/tr36>. |
131 | |
132 | Also, for a somewhat finer-grained set of characters that are in programming |
133 | language identifiers beyond the ASCII range, you may wish to instead use the |
134 | more customized Unicode properties, "ID_Start", ID_Continue", "XID_Start", and |
135 | "XID_Continue". See L<http://unicode.org/reports/tr31>. |
136 | |
8a118206 |
137 | Any character that isn't matched by C<\w> will be matched by C<\W>. |
138 | |
ea449505 |
139 | =head3 Whitespace |
140 | |
6b83a163 |
141 | C<\s> matches any single character that is considered whitespace. The exact |
142 | set of characters matched by C<\s> depends on whether the source string is in |
143 | UTF-8 format and the locale or EBCDIC code page that is in effect. If it's in |
144 | UTF-8 format, C<\s> matches what is considered whitespace in the Unicode |
145 | database; the complete list is in the table below. Otherwise, if there is a |
146 | locale or EBCDIC code page in effect, C<\s> matches whatever is considered |
147 | whitespace by the current locale or EBCDIC code page. Without a locale or |
148 | EBCDIC code page, C<\s> matches the horizontal tab (C<\t>), the newline |
149 | (C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the space. |
150 | (Note that it doesn't match the vertical tab, C<\cK>.) Perhaps the most notable |
151 | possible surprise is that C<\s> matches a non-breaking space only if the |
152 | non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC code |
153 | page that is in effect has that character. |
ea449505 |
154 | See L</Locale, EBCDIC, Unicode and UTF-8>. |
8a118206 |
155 | |
156 | Any character that isn't matched by C<\s> will be matched by C<\S>. |
157 | |
ea449505 |
158 | C<\h> will match any character that is considered horizontal whitespace; |
6b83a163 |
159 | this includes the space and the tab characters and a number other characters, |
160 | all of which are listed in the table below. C<\H> will match any character |
ea449505 |
161 | that is not considered horizontal whitespace. |
162 | |
ea449505 |
163 | C<\v> will match any character that is considered vertical whitespace; |
6b83a163 |
164 | this includes the carriage return and line feed characters (newline) plus several |
165 | other characters, all listed in the table below. |
ea449505 |
166 | C<\V> will match any character that is not considered vertical whitespace. |
8a118206 |
167 | |
168 | C<\R> matches anything that can be considered a newline under Unicode |
169 | rules. It's not a character class, as it can match a multi-character |
170 | sequence. Therefore, it cannot be used inside a bracketed character |
ea449505 |
171 | class; use C<\v> instead (vertical whitespace). |
172 | Details are discussed in L<perlrebackslash>. |
8a118206 |
173 | |
174 | Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match |
175 | the same characters, regardless whether the source string is in UTF-8 |
176 | format or not. The set of characters they match is also not influenced |
c1c4ae3a |
177 | by locale nor EBCDIC code page. |
8a118206 |
178 | |
ea449505 |
179 | One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The |
180 | vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered |
181 | vertical whitespace. Furthermore, if the source string is not in UTF-8 format, |
182 | and any locale or EBCDIC code page that is in effect doesn't include them, the |
6b83a163 |
183 | next line (ASCII-platform C<"\x85">) and the no-break space (ASCII-platform |
184 | C<"\xA0">) characters are not matched by C<\s>, but are by C<\v> and C<\h> |
185 | respectively. If the source string is in UTF-8 format, both the next line and |
186 | the no-break space are matched by C<\s>. |
8a118206 |
187 | |
188 | The following table is a complete listing of characters matched by |
ea449505 |
189 | C<\s>, C<\h> and C<\v> as of Unicode 5.2. |
8a118206 |
190 | |
191 | The first column gives the code point of the character (in hex format), |
192 | the second column gives the (Unicode) name. The third column indicates |
ea449505 |
193 | by which class(es) the character is matched (assuming no locale or EBCDIC code |
194 | page is in effect that changes the C<\s> matching). |
8a118206 |
195 | |
196 | 0x00009 CHARACTER TABULATION h s |
197 | 0x0000a LINE FEED (LF) vs |
198 | 0x0000b LINE TABULATION v |
199 | 0x0000c FORM FEED (FF) vs |
200 | 0x0000d CARRIAGE RETURN (CR) vs |
201 | 0x00020 SPACE h s |
202 | 0x00085 NEXT LINE (NEL) vs [1] |
203 | 0x000a0 NO-BREAK SPACE h s [1] |
204 | 0x01680 OGHAM SPACE MARK h s |
205 | 0x0180e MONGOLIAN VOWEL SEPARATOR h s |
206 | 0x02000 EN QUAD h s |
207 | 0x02001 EM QUAD h s |
208 | 0x02002 EN SPACE h s |
209 | 0x02003 EM SPACE h s |
210 | 0x02004 THREE-PER-EM SPACE h s |
211 | 0x02005 FOUR-PER-EM SPACE h s |
212 | 0x02006 SIX-PER-EM SPACE h s |
213 | 0x02007 FIGURE SPACE h s |
214 | 0x02008 PUNCTUATION SPACE h s |
215 | 0x02009 THIN SPACE h s |
216 | 0x0200a HAIR SPACE h s |
217 | 0x02028 LINE SEPARATOR vs |
218 | 0x02029 PARAGRAPH SEPARATOR vs |
219 | 0x0202f NARROW NO-BREAK SPACE h s |
220 | 0x0205f MEDIUM MATHEMATICAL SPACE h s |
221 | 0x03000 IDEOGRAPHIC SPACE h s |
222 | |
223 | =over 4 |
224 | |
225 | =item [1] |
226 | |
227 | NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in |
ea449505 |
228 | UTF-8 format, or the locale or EBCDIC code page that is in effect includes them. |
8a118206 |
229 | |
230 | =back |
231 | |
232 | It is worth noting that C<\d>, C<\w>, etc, match single characters, not |
233 | complete numbers or words. To match a number (that consists of integers), |
234 | use C<\d+>; to match a word, use C<\w+>. |
235 | |
6b83a163 |
236 | =head3 \N |
237 | |
238 | C<\N> is new in 5.12, and is experimental. It, like the dot, will match any |
239 | character that is not a newline. The difference is that C<\N> is not influenced |
240 | by the I<single line> regular expression modifier (see L</The dot> above). Note |
241 | that the form C<\N{...}> may mean something completely different. When the |
242 | C<{...}> is a L<quantifier|perlre/Quantifiers>, it means to match a non-newline |
243 | character that many times. For example, C<\N{3}> means to match 3 |
244 | non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> |
245 | is not a legal quantifier, it is presumed to be a named character. See |
246 | L<charnames> for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and |
247 | C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose |
248 | names are, respectively, C<COLON>, C<4F>, and C<F4>. |
8a118206 |
249 | |
250 | =head3 Unicode Properties |
251 | |
c1c4ae3a |
252 | C<\pP> and C<\p{Prop}> are character classes to match characters that fit given |
253 | Unicode properties. One letter property names can be used in the C<\pP> form, |
254 | with the property name following the C<\p>, otherwise, braces are required. |
255 | When using braces, there is a single form, which is just the property name |
256 | enclosed in the braces, and a compound form which looks like C<\p{name=value}>, |
257 | which means to match if the property "name" for the character has the particular |
258 | "value". |
e1b711da |
259 | For instance, a match for a number can be written as C</\pN/> or as |
260 | C</\p{Number}/>, or as C</\p{Number=True}/>. |
261 | Lowercase letters are matched by the property I<Lowercase_Letter> which |
262 | has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or |
263 | C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> |
264 | (the underscores are optional). |
265 | C</\pLl/> is valid, but means something different. |
8a118206 |
266 | It matches a two character string: a letter (Unicode property C<\pL>), |
267 | followed by a lowercase C<l>. |
268 | |
e1b711da |
269 | For more details, see L<perlunicode/Unicode Character Properties>; for a |
270 | complete list of possible properties, see |
271 | L<perluniprops/Properties accessible through \p{} and \P{}>. |
272 | It is also possible to define your own properties. This is discussed in |
8a118206 |
273 | L<perlunicode/User-Defined Character Properties>. |
274 | |
275 | |
276 | =head4 Examples |
277 | |
278 | "a" =~ /\w/ # Match, "a" is a 'word' character. |
279 | "7" =~ /\w/ # Match, "7" is a 'word' character as well. |
280 | "a" =~ /\d/ # No match, "a" isn't a digit. |
281 | "7" =~ /\d/ # Match, "7" is a digit. |
ea449505 |
282 | " " =~ /\s/ # Match, a space is whitespace. |
8a118206 |
283 | "a" =~ /\D/ # Match, "a" is a non-digit. |
284 | "7" =~ /\D/ # No match, "7" is not a non-digit. |
ea449505 |
285 | " " =~ /\S/ # No match, a space is not non-whitespace. |
8a118206 |
286 | |
ea449505 |
287 | " " =~ /\h/ # Match, space is horizontal whitespace. |
288 | " " =~ /\v/ # No match, space is not vertical whitespace. |
289 | "\r" =~ /\v/ # Match, a return is vertical whitespace. |
8a118206 |
290 | |
291 | "a" =~ /\pL/ # Match, "a" is a letter. |
292 | "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. |
293 | |
294 | "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character |
295 | # 'THAI CHARACTER SO SO', and that's in |
296 | # Thai Unicode class. |
ea449505 |
297 | "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. |
8a118206 |
298 | |
299 | |
300 | =head2 Bracketed Character Classes |
301 | |
302 | The third form of character class you can use in Perl regular expressions |
6b83a163 |
303 | is the bracketed character class. In its simplest form, it lists the characters |
c1c4ae3a |
304 | that may be matched, surrounded by square brackets, like this: C<[aeiou]>. |
ea449505 |
305 | This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other |
8a118206 |
306 | character classes, exactly one character will be matched. To match |
ea449505 |
307 | a longer string consisting of characters mentioned in the character |
6b83a163 |
308 | class, follow the character class with a L<quantifier|perlre/Quantifiers>. For |
309 | instance, C<[aeiou]+> matches a string of one or more lowercase English vowels. |
8a118206 |
310 | |
311 | Repeating a character in a character class has no |
312 | effect; it's considered to be in the set only once. |
313 | |
314 | Examples: |
315 | |
316 | "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. |
317 | "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. |
318 | "ae" =~ /^[aeiou]$/ # No match, a character class only matches |
319 | # a single character. |
320 | "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. |
321 | |
322 | =head3 Special Characters Inside a Bracketed Character Class |
323 | |
324 | Most characters that are meta characters in regular expressions (that |
df225385 |
325 | is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose |
8a118206 |
326 | their special meaning and can be used inside a character class without |
327 | the need to escape them. For instance, C<[()]> matches either an opening |
328 | parenthesis, or a closing parenthesis, and the parens inside the character |
329 | class don't group or capture. |
330 | |
331 | Characters that may carry a special meaning inside a character class are: |
332 | C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be |
333 | escaped with a backslash, although this is sometimes not needed, in which |
334 | case the backslash may be omitted. |
335 | |
336 | The sequence C<\b> is special inside a bracketed character class. While |
6b83a163 |
337 | outside the character class, C<\b> is an assertion indicating a point |
8a118206 |
338 | that does not have either two word characters or two non-word characters |
339 | on either side, inside a bracketed character class, C<\b> matches a |
340 | backspace character. |
341 | |
df225385 |
342 | The sequences |
343 | C<\a>, |
344 | C<\c>, |
345 | C<\e>, |
346 | C<\f>, |
347 | C<\n>, |
e526e8bb |
348 | C<\N{I<NAME>}>, |
349 | C<\N{U+I<wide hex char>}>, |
df225385 |
350 | C<\r>, |
351 | C<\t>, |
352 | and |
353 | C<\x> |
354 | are also special and have the same meanings as they do outside a bracketed character |
355 | class. |
356 | |
ea449505 |
357 | Also, a backslash followed by two or three octal digits is considered an octal |
358 | number. |
df225385 |
359 | |
6b83a163 |
360 | A C<[> is not special inside a character class, unless it's the start of a |
361 | POSIX character class (see L</POSIX Character Classes> below). It normally does |
362 | not need escaping. |
8a118206 |
363 | |
6b83a163 |
364 | A C<]> is normally either the end of a POSIX character class (see |
365 | L</POSIX Character Classes> below), or it signals the end of the bracketed |
366 | character class. If you want to include a C<]> in the set of characters, you |
367 | must generally escape it. |
8a118206 |
368 | However, if the C<]> is the I<first> (or the second if the first |
369 | character is a caret) character of a bracketed character class, it |
370 | does not denote the end of the class (as you cannot have an empty class) |
371 | and is considered part of the set of characters that can be matched without |
372 | escaping. |
373 | |
374 | Examples: |
375 | |
376 | "+" =~ /[+?*]/ # Match, "+" in a character class is not special. |
377 | "\cH" =~ /[\b]/ # Match, \b inside in a character class |
c1c4ae3a |
378 | # is equivalent to a backspace. |
8a118206 |
379 | "]" =~ /[][]/ # Match, as the character class contains. |
380 | # both [ and ]. |
381 | "[]" =~ /[[]]/ # Match, the pattern contains a character class |
382 | # containing just ], and the character class is |
383 | # followed by a ]. |
384 | |
385 | =head3 Character Ranges |
386 | |
387 | It is not uncommon to want to match a range of characters. Luckily, instead |
388 | of listing all the characters in the range, one may use the hyphen (C<->). |
389 | If inside a bracketed character class you have two characters separated |
390 | by a hyphen, it's treated as if all the characters between the two are in |
391 | the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> |
392 | matches any lowercase letter from the first half of the ASCII alphabet. |
393 | |
394 | Note that the two characters on either side of the hyphen are not |
395 | necessary both letters or both digits. Any character is possible, |
396 | although not advisable. C<['-?]> contains a range of characters, but |
397 | most people will not know which characters that will be. Furthermore, |
398 | such ranges may lead to portability problems if the code has to run on |
399 | a platform that uses a different character set, such as EBCDIC. |
400 | |
ea449505 |
401 | If a hyphen in a character class cannot syntactically be part of a range, for |
402 | instance because it is the first or the last character of the character class, |
8a118206 |
403 | or if it immediately follows a range, the hyphen isn't special, and will be |
6b83a163 |
404 | considered a character that is to be matched literally. You have to escape the |
c1c4ae3a |
405 | hyphen with a backslash if you want to have a hyphen in your set of characters |
406 | to be matched, and its position in the class is such that it could be |
407 | considered part of a range. |
8a118206 |
408 | |
409 | Examples: |
410 | |
411 | [a-z] # Matches a character that is a lower case ASCII letter. |
c1c4ae3a |
412 | [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or |
413 | # the letter 'z'. |
8a118206 |
414 | [-z] # Matches either a hyphen ('-') or the letter 'z'. |
415 | [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the |
416 | # hyphen ('-'), or the letter 'm'. |
417 | ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? |
418 | # (But not on an EBCDIC platform). |
419 | |
420 | |
421 | =head3 Negation |
422 | |
423 | It is also possible to instead list the characters you do not want to |
424 | match. You can do so by using a caret (C<^>) as the first character in the |
425 | character class. For instance, C<[^a-z]> matches a character that is not a |
426 | lowercase ASCII letter. |
427 | |
428 | This syntax make the caret a special character inside a bracketed character |
429 | class, but only if it is the first character of the class. So if you want |
430 | to have the caret as one of the characters you want to match, you either |
431 | have to escape the caret, or not list it first. |
432 | |
433 | Examples: |
434 | |
435 | "e" =~ /[^aeiou]/ # No match, the 'e' is listed. |
436 | "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. |
437 | "^" =~ /[^^]/ # No match, matches anything that isn't a caret. |
438 | "^" =~ /[x^]/ # Match, caret is not special here. |
439 | |
440 | =head3 Backslash Sequences |
441 | |
ea449505 |
442 | You can put any backslash sequence character class (with the exception of |
443 | C<\N>) inside a bracketed character class, and it will act just |
df225385 |
444 | as if you put all the characters matched by the backslash sequence inside the |
6b83a163 |
445 | character class. For instance, C<[a-f\d]> will match any decimal digit, or any |
446 | of the lowercase letters between 'a' and 'f' inclusive. |
447 | |
448 | C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> |
449 | or C<\N{U+I<wide hex char>}>, and NOT be the form that matches non-newlines, |
450 | for the same reason that a dot C<.> inside a bracketed character class loses |
451 | its special meaning: it matches nearly anything, which generally isn't what you |
452 | want to happen. |
df225385 |
453 | |
8a118206 |
454 | |
455 | Examples: |
456 | |
457 | /[\p{Thai}\d]/ # Matches a character that is either a Thai |
458 | # character, or a digit. |
459 | /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic |
460 | # character, nor a parenthesis. |
461 | |
462 | Backslash sequence character classes cannot form one of the endpoints |
6b83a163 |
463 | of a range. Thus, you can't say: |
464 | |
465 | /[\p{Thai}-\d]/ # Wrong! |
8a118206 |
466 | |
6b83a163 |
467 | =head3 POSIX Character Classes |
ea449505 |
468 | X<character class> X<\p> X<\p{}> |
ea449505 |
469 | X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> |
470 | X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> |
8a118206 |
471 | |
6b83a163 |
472 | POSIX character classes have the form C<[:class:]>, where I<class> is |
473 | name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear |
8a118206 |
474 | I<inside> bracketed character classes, and are a convenient and descriptive |
c1c4ae3a |
475 | way of listing a group of characters, though they currently suffer from |
6b83a163 |
476 | portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). |
477 | |
478 | Be careful about the syntax, |
8a118206 |
479 | |
480 | # Correct: |
481 | $string =~ /[[:alpha:]]/ |
482 | |
483 | # Incorrect (will warn): |
484 | $string =~ /[:alpha:]/ |
485 | |
486 | The latter pattern would be a character class consisting of a colon, |
487 | and the letters C<a>, C<l>, C<p> and C<h>. |
6b83a163 |
488 | POSIX character classes can be part of a larger bracketed character class. For |
ea449505 |
489 | example, |
490 | |
491 | [01[:alpha:]%] |
492 | |
493 | is valid and matches '0', '1', any alphabetic character, and the percent sign. |
8a118206 |
494 | |
495 | Perl recognizes the following POSIX character classes: |
496 | |
ea449505 |
497 | alpha Any alphabetical character ("[A-Za-z]"). |
498 | alnum Any alphanumerical character. ("[A-Za-z0-9]") |
499 | ascii Any character in the ASCII character set. |
ea8b8ad2 |
500 | blank A GNU extension, equal to a space or a horizontal tab ("\t"). |
ea449505 |
501 | cntrl Any control character. See Note [2] below. |
502 | digit Any decimal digit ("[0-9]"), equivalent to "\d". |
503 | graph Any printable character, excluding a space. See Note [3] below. |
504 | lower Any lowercase character ("[a-z]"). |
505 | print Any printable character, including a space. See Note [4] below. |
c1c4ae3a |
506 | punct Any graphical character excluding "word" characters. Note [5]. |
ea449505 |
507 | space Any whitespace character. "\s" plus the vertical tab ("\cK"). |
508 | upper Any uppercase character ("[A-Z]"). |
509 | word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". |
510 | xdigit Any hexadecimal digit ("[0-9a-fA-F]"). |
511 | |
512 | Most POSIX character classes have two Unicode-style C<\p> property |
513 | counterparts. (They are not official Unicode properties, but Perl extensions |
514 | derived from official Unicode properties.) The table below shows the relation |
515 | between POSIX character classes and these counterparts. |
516 | |
517 | One counterpart, in the column labelled "ASCII-range Unicode" in |
6b83a163 |
518 | the table, will only match characters in the ASCII character set. |
ea449505 |
519 | |
520 | The other counterpart, in the column labelled "Full-range Unicode", matches any |
521 | appropriate characters in the full Unicode character set. For example, |
522 | C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any |
523 | character in the entire Unicode character set that is considered to be |
524 | alphabetic. |
525 | |
526 | (Each of the counterparts has various synonyms as well. |
527 | L<perluniprops/Properties accessible through \p{} and \P{}> lists all the |
528 | synonyms, plus all the characters matched by each of the ASCII-range |
529 | properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, |
530 | and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) |
531 | |
532 | Both the C<\p> forms are unaffected by any locale that is in effect, or whether |
533 | the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. |
534 | In contrast, the POSIX character classes are affected. If the source string is |
535 | in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see |
6b83a163 |
536 | Note [5] below) behave like their "Full-range" Unicode counterparts. If the |
537 | source string is not in UTF-8 format, and no locale is in effect, and the |
538 | platform is not EBCDIC, all the POSIX classes behave like their ASCII-range |
539 | counterparts. Otherwise, they behave based on the rules of the locale or |
540 | EBCDIC code page. |
541 | |
ea449505 |
542 | It is proposed to change this behavior in a future release of Perl so that the |
543 | the UTF8ness of the source string will be irrelevant to the behavior of the |
544 | POSIX character classes. This means they will always behave in strict |
545 | accordance with the official POSIX standard. That is, if either locale or |
546 | EBCDIC code page is present, they will behave in accordance with those; if |
547 | absent, the classes will match only their ASCII-range counterparts. If you |
548 | disagree with this proposal, send email to C<perl5-porters@perl.org>. |
549 | |
550 | [[:...:]] ASCII-range Full-range backslash Note |
551 | Unicode Unicode sequence |
552 | ----------------------------------------------------- |
553 | alpha \p{PosixAlpha} \p{Alpha} |
554 | alnum \p{PosixAlnum} \p{Alnum} |
555 | ascii \p{ASCII} |
c1c4ae3a |
556 | blank \p{PosixBlank} \p{Blank} = [1] |
ea449505 |
557 | \p{HorizSpace} \h [1] |
558 | cntrl \p{PosixCntrl} \p{Cntrl} [2] |
559 | digit \p{PosixDigit} \p{Digit} \d |
560 | graph \p{PosixGraph} \p{Graph} [3] |
561 | lower \p{PosixLower} \p{Lower} |
562 | print \p{PosixPrint} \p{Print} [4] |
563 | punct \p{PosixPunct} \p{Punct} [5] |
564 | \p{PerlSpace} \p{SpacePerl} \s [6] |
565 | space \p{PosixSpace} \p{Space} [6] |
566 | upper \p{PosixUpper} \p{Upper} |
567 | word \p{PerlWord} \p{Word} \w |
568 | xdigit \p{ASCII_Hex_Digit} \p{XDigit} |
8a118206 |
569 | |
570 | =over 4 |
571 | |
ea449505 |
572 | =item [1] |
573 | |
574 | C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. |
575 | |
576 | =item [2] |
8a118206 |
577 | |
ea449505 |
578 | Control characters don't produce output as such, but instead usually control |
579 | the terminal somehow: for example newline and backspace are control characters. |
580 | In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, |
581 | plus 127 (C<DEL>) are control characters. |
8a118206 |
582 | |
c1c4ae3a |
583 | On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> |
584 | to be the EBCDIC equivalents of the ASCII controls, plus the controls |
6b83a163 |
585 | that in Unicode have ordinals from 128 through 159. |
ea449505 |
586 | |
587 | =item [3] |
8a118206 |
588 | |
589 | Any character that is I<graphical>, that is, visible. This class consists |
590 | of all the alphanumerical characters and all punctuation characters. |
591 | |
ea449505 |
592 | =item [4] |
8a118206 |
593 | |
594 | All printable characters, which is the set of all the graphical characters |
ea449505 |
595 | plus whitespace characters that are not also controls. |
596 | |
6c5a041f |
597 | =item [5] (punct) |
ea449505 |
598 | |
599 | C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the |
600 | non-controls, non-alphanumeric, non-space characters: |
601 | C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, |
602 | it could alter the behavior of C<[[:punct:]]>). |
603 | |
6c5a041f |
604 | C<\p{Punct}> matches a somewhat different set in the ASCII range, namely |
605 | C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing C<[$+E<lt>=E<gt>^`|~]>. |
606 | This is because Unicode splits what POSIX considers to be punctuation into two |
607 | categories, Punctuation and Symbols. |
608 | |
609 | When the matching string is in UTF-8 format, C<[[:punct:]]> matches what it |
610 | matches in the ASCII range, plus what C<\p{Punct}> matches. This is different |
611 | than strictly matching according to C<\p{Punct}>. Another way to say it is that |
612 | for a UTF-8 string, C<[[:punct:]]> matches all the characters that Unicode |
613 | considers to be punctuation, plus all the ASCII-range characters that Unicode |
614 | considers to be symbols. |
8a118206 |
615 | |
ea449505 |
616 | =item [6] |
8a118206 |
617 | |
ea449505 |
618 | C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally |
619 | matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. |
8a118206 |
620 | |
621 | =back |
622 | |
623 | =head4 Negation |
ea449505 |
624 | X<character class, negation> |
8a118206 |
625 | |
626 | A Perl extension to the POSIX character class is the ability to |
627 | negate it. This is done by prefixing the class name with a caret (C<^>). |
628 | Some examples: |
629 | |
ea449505 |
630 | POSIX ASCII-range Full-range backslash |
631 | Unicode Unicode sequence |
632 | ----------------------------------------------------- |
c1c4ae3a |
633 | [[:^digit:]] \P{PosixDigit} \P{Digit} \D |
ea449505 |
634 | [[:^space:]] \P{PosixSpace} \P{Space} |
c1c4ae3a |
635 | \P{PerlSpace} \P{SpacePerl} \S |
636 | [[:^word:]] \P{PerlWord} \P{Word} \W |
8a118206 |
637 | |
638 | =head4 [= =] and [. .] |
639 | |
640 | Perl will recognize the POSIX character classes C<[=class=]>, and |
ea449505 |
641 | C<[.class.]>, but does not (yet?) support them. Use of |
740bae87 |
642 | such a construct will lead to an error. |
8a118206 |
643 | |
644 | |
645 | =head4 Examples |
646 | |
647 | /[[:digit:]]/ # Matches a character that is a digit. |
648 | /[01[:lower:]]/ # Matches a character that is either a |
649 | # lowercase letter, or '0' or '1'. |
c1c4ae3a |
650 | /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything |
651 | # except the letters 'a' to 'f'. This is |
652 | # because the main character class is composed |
653 | # of two POSIX character classes that are ORed |
654 | # together, one that matches any digit, and |
655 | # the other that matches anything that isn't a |
656 | # hex digit. The result matches all |
657 | # characters except the letters 'a' to 'f' and |
658 | # 'A' to 'F'. |
8a118206 |
659 | |
660 | |
ea449505 |
661 | =head2 Locale, EBCDIC, Unicode and UTF-8 |
8a118206 |
662 | |
663 | Some of the character classes have a somewhat different behaviour depending |
664 | on the internal encoding of the source string, and the locale that is |
ea449505 |
665 | in effect, and if the program is running on an EBCDIC platform. |
8a118206 |
666 | |
667 | C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, |
c1c4ae3a |
668 | including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash |
669 | sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are |
670 | affected.) |
8a118206 |
671 | |
672 | The rule is that if the source string is in UTF-8 format, the character |
673 | classes match according to the Unicode properties. If the source string |
ea449505 |
674 | isn't, then the character classes match according to whatever locale or EBCDIC |
675 | code page is in effect. If there is no locale nor EBCDIC, they match the ASCII |
6b83a163 |
676 | defaults (0 to 9 for C<\d>; 52 letters, 10 digits and underscore for C<\w>; |
c1c4ae3a |
677 | etc.). |
8a118206 |
678 | |
679 | This usually means that if you are matching against characters whose C<ord()> |
680 | values are between 128 and 255 inclusive, your character class may match |
ea449505 |
681 | or not depending on the current locale or EBCDIC code page, and whether the |
682 | source string is in UTF-8 format. The string will be in UTF-8 format if it |
683 | contains characters whose C<ord()> value exceeds 255. But a string may be in |
6b83a163 |
684 | UTF-8 format without it having such characters. See L<perlunicode/The |
ea449505 |
685 | "Unicode Bug">. |
8a118206 |
686 | |
687 | For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> |
688 | or the POSIX character classes, and use the Unicode properties instead. |
689 | |
690 | =head4 Examples |
691 | |
692 | $str = "\xDF"; # $str is not in UTF-8 format. |
693 | $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. |
694 | $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. |
695 | $str =~ /^\w/; # Match! $str is now in UTF-8 format. |
696 | chop $str; |
697 | $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. |
698 | |
699 | =cut |