add specific thank yous to Ack section for a couple things I can't see fitting into...
[p5sagit/p5-mst-13.2.git] / pod / perlreref.pod
CommitLineData
30487ceb 1=head1 NAME
2
3perlreref - Perl Regular Expressions Reference
4
5=head1 DESCRIPTION
6
7This is a quick reference to Perl's regular expressions.
8For full information see L<perlre> and L<perlop>, as well
6d014f17 9as the L</"SEE ALSO"> section in this document.
30487ceb 10
a5365663 11=head2 OPERATORS
30487ceb 12
e17472c5 13C<=~> determines to which variable the regex is applied.
14In its absence, $_ is used.
30487ceb 15
e17472c5 16 $var =~ /foo/;
30487ceb 17
e17472c5 18C<!~> determines to which variable the regex is applied,
19and negates the result of the match; it returns
20false if the match succeeds, and true if it fails.
6d014f17 21
e17472c5 22 $var !~ /foo/;
6d014f17 23
e17472c5 24C<m/pattern/msixpogc> searches a string for a pattern match,
25applying the given options.
30487ceb 26
e17472c5 27 m Multiline mode - ^ and $ match internal lines
28 s match as a Single line - . matches \n
29 i case-Insensitive
30 x eXtended legibility - free whitespace and comments
31 p Preserve a copy of the matched string -
32 ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
33 o compile pattern Once
34 g Global - all occurrences
35 c don't reset pos on failed matches when using /g
30487ceb 36
e17472c5 37If 'pattern' is an empty string, the last I<successfully> matched
38regex is used. Delimiters other than '/' may be used for both this
64c5a566 39operator and the following ones. The leading C<m> can be omitted
e17472c5 40if the delimiter is '/'.
30487ceb 41
e17472c5 42C<qr/pattern/msixpo> lets you store a regex in a variable,
43or pass one around. Modifiers as for C<m//>, and are stored
44within the regex.
30487ceb 45
e17472c5 46C<s/pattern/replacement/msixpogce> substitutes matches of
47'pattern' with 'replacement'. Modifiers as for C<m//>,
4f4d7508 48with two additions:
30487ceb 49
e17472c5 50 e Evaluate 'replacement' as an expression
4f4d7508 51 r Return substitution and leave the original string untouched.
30487ceb 52
e17472c5 53'e' may be specified multiple times. 'replacement' is interpreted
54as a double quoted string unless a single-quote (C<'>) is the delimiter.
30487ceb 55
e17472c5 56C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
57delimiters can be used. Must be reset with reset().
30487ceb 58
a5365663 59=head2 SYNTAX
30487ceb 60
9f4a55d4 61 \ Escapes the character immediately following it
62 . Matches any single character except a newline (unless /s is
63 used)
64 ^ Matches at the beginning of the string (or line, if /m is used)
65 $ Matches at the end of the string (or line, if /m is used)
66 * Matches the preceding element 0 or more times
67 + Matches the preceding element 1 or more times
68 ? Matches the preceding element 0 or 1 times
69 {...} Specifies a range of occurrences for the element preceding it
70 [...] Matches any one of the characters contained within the brackets
71 (...) Groups subexpressions for capturing to $1, $2...
72 (?:...) Groups subexpressions without capturing (cluster)
73 | Matches either the subexpression preceding or following it
74 \1, \2, \3 ... Matches the text from the Nth group
75 \g1 or \g{1}, \g2 ... Matches the text from the Nth group
76 \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
77 \g{name} Named backreference
78 \k<name> Named backreference
79 \k'name' Named backreference
80 (?P=name) Named backreference (python syntax)
30487ceb 81
82=head2 ESCAPE SEQUENCES
83
84These work as in normal strings.
85
86 \a Alarm (beep)
87 \e Escape
88 \f Formfeed
89 \n Newline
90 \r Carriage return
91 \t Tab
6ed007ae 92 \037 Any octal ASCII value
30487ceb 93 \x7f Any hexadecimal ASCII value
94 \x{263a} A wide hexadecimal value
95 \cx Control-x
96 \N{name} A named character
e526e8bb 97 \N{U+263D} A Unicode character by hex ordinal
30487ceb 98
6d014f17 99 \l Lowercase next character
d3b55b48 100 \u Titlecase next character
30487ceb 101 \L Lowercase until \E
d3b55b48 102 \U Uppercase until \E
30487ceb 103 \Q Disable pattern metacharacters until \E
e17472c5 104 \E End modification
30487ceb 105
47e8a552 106For Titlecase, see L</Titlecase>.
107
30487ceb 108This one works differently from normal strings:
109
110 \b An assertion, not backspace, except in a character class
111
112=head2 CHARACTER CLASSES
113
114 [amy] Match 'a', 'm' or 'y'
115 [f-j] Dash specifies "range"
116 [f-j-] Dash escaped or at start or end means 'dash'
6d014f17 117 [^f-j] Caret indicates "match any character _except_ these"
30487ceb 118
df225385 119The following sequences (except C<\N>) work within or without a character class.
e17472c5 120The first six are locale aware, all are Unicode aware. See L<perllocale>
121and L<perlunicode> for details.
122
123 \d A digit
124 \D A nondigit
125 \w A word character
126 \W A non-word character
127 \s A whitespace character
128 \S A non-whitespace character
418e7b04 129 \h An horizontal whitespace
130 \H A non horizontal whitespace
9f4a55d4 131 \N A non newline (when not followed by '{NAME}'; experimental;
132 not valid in a character class; equivalent to [^\n]; it's
133 like '.' without /s modifier)
418e7b04 134 \v A vertical whitespace
135 \V A non vertical whitespace
e17472c5 136 \R A generic newline (?>\v|\x0D\x0A)
e04a154e 137
138 \C Match a byte (with Unicode, '.' matches a character)
30487ceb 139 \pP Match P-named (Unicode) property
e1b711da 140 \p{...} Match Unicode property with name longer than 1 character
30487ceb 141 \PP Match non-P
e1b711da 142 \P{...} Match lack of Unicode property with name longer than 1 char
0111a78f 143 \X Match Unicode extended grapheme cluster
30487ceb 144
145POSIX character classes and their Unicode and Perl equivalents:
146
9f4a55d4 147 ASCII- Full-
148 range range backslash
149 POSIX \p{...} \p{} sequence Description
150 -----------------------------------------------------------------------
151 alnum PosixAlnum Alnum Alpha plus Digit
152 alpha PosixAlpha Alpha Alphabetic characters
153 ascii ASCII Any ASCII character
154 blank PosixBlank Blank \h Horizontal whitespace;
155 full-range also written
156 as \p{HorizSpace} (GNU
157 extension)
158 cntrl PosixCntrl Cntrl Control characters
159 digit PosixDigit Digit \d Decimal digits
160 graph PosixGraph Graph Alnum plus Punct
161 lower PosixLower Lower Lowercase characters
162 print PosixPrint Print Graph plus Print, but not
163 any Cntrls
164 punct PosixPunct Punct These aren't precisely
165 equivalent. See NOTE,
166 below.
167 space PosixSpace Space [\s\cK] Whitespace
168 PerlSpace SpacePerl \s Perl's whitespace
169 definition
170 upper PosixUpper Upper Uppercase characters
171 word PerlWord Word \w Alnum plus '_' (Perl
172 extension)
173 xdigit ASCII_Hex_Digit XDigit Hexadecimal digit,
174 ASCII-range is
175 [0-9A-Fa-f]
176
177NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
178In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
179C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
180effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
181matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>. When matching a UTF-8 string,
182C<[[:punct:]]> matches what it does in the ASCII range, plus what
183C<\p{Punct}> matches. C<\p{Punct}> matches, anything that isn't a
184control, an alphanumeric, a space, nor a symbol.
30487ceb 185
186Within a character class:
187
9f4a55d4 188 POSIX traditional Unicode
189 [:digit:] \d \p{Digit}
190 [:^digit:] \D \P{Digit}
30487ceb 191
192=head2 ANCHORS
193
194All are zero-width assertions.
195
196 ^ Match string start (or line, if /m is used)
197 $ Match string end (or line, if /m is used) or before newline
198 \b Match word boundary (between \w and \W)
6d014f17 199 \B Match except at word boundary (between \w and \w or \W and \W)
30487ceb 200 \A Match string start (regardless of /m)
6d014f17 201 \Z Match string end (before optional newline)
30487ceb 202 \z Match absolute string end
203 \G Match where previous m//g left off
64c5a566 204 \K Keep the stuff left of the \K, don't include it in $&
205
30487ceb 206=head2 QUANTIFIERS
207
ac036724 208Quantifiers are greedy by default and match the B<longest> leftmost.
30487ceb 209
64c5a566 210 Maximal Minimal Possessive Allowed range
211 ------- ------- ---------- -------------
212 {n,m} {n,m}? {n,m}+ Must occur at least n times
213 but no more than m times
214 {n,} {n,}? {n,}+ Must occur at least n times
215 {n} {n}? {n}+ Must occur exactly n times
216 * *? *+ 0 or more times (same as {0,})
217 + +? ++ 1 or more times (same as {1,})
218 ? ?? ?+ 0 or 1 time (same as {0,1})
219
220The possessive forms (new in Perl 5.10) prevent backtracking: what gets
221matched by a pattern with a possessive quantifier will not be backtracked
222into, even if that causes the whole match to fail.
30487ceb 223
ac036724 224There is no quantifier C<{,n}>. That's interpreted as a literal string.
6d014f17 225
30487ceb 226=head2 EXTENDED CONSTRUCTS
227
64c5a566 228 (?#text) A comment
229 (?:...) Groups subexpressions without capturing (cluster)
230 (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
231 (?=...) Zero-width positive lookahead assertion
232 (?!...) Zero-width negative lookahead assertion
233 (?<=...) Zero-width positive lookbehind assertion
234 (?<!...) Zero-width negative lookbehind assertion
235 (?>...) Grab what we can, prohibit backtracking
236 (?|...) Branch reset
237 (?<name>...) Named capture
238 (?'name'...) Named capture
239 (?P<name>...) Named capture (python syntax)
240 (?{ code }) Embedded code, return value becomes $^R
241 (??{ code }) Dynamic regex, return value used as regex
242 (?N) Recurse into subpattern number N
243 (?-N), (?+N) Recurse into Nth previous/next subpattern
244 (?R), (?0) Recurse at the beginning of the whole pattern
245 (?&name) Recurse into a named subpattern
246 (?P>name) Recurse into a named subpattern (python syntax)
247 (?(cond)yes|no)
248 (?(cond)yes) Conditional expression, where "cond" can be:
249 (N) subpattern N has matched something
250 (<name>) named subpattern has matched something
251 ('name') named subpattern has matched something
252 (?{code}) code condition
253 (R) true if recursing
254 (RN) true if recursing into Nth subpattern
255 (R&name) true if recursing into named subpattern
256 (DEFINE) always false, no no-pattern allowed
30487ceb 257
a5365663 258=head2 VARIABLES
30487ceb 259
260 $_ Default variable for operators to use
30487ceb 261
30487ceb 262 $` Everything prior to matched string
e17472c5 263 $& Entire matched string
30487ceb 264 $' Everything after to matched string
265
e17472c5 266 ${^PREMATCH} Everything prior to matched string
267 ${^MATCH} Entire matched string
268 ${^POSTMATCH} Everything after to matched string
269
270The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
64c5a566 271within your program. Consult L<perlvar> for C<@->
30487ceb 272to see equivalent expressions that won't cause slow down.
e17472c5 273See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
274can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
275and C<${^POSTMATCH}>, but for them to be defined, you have to
276specify the C</p> (preserve) modifier on your regular expression.
30487ceb 277
278 $1, $2 ... hold the Xth captured expr
279 $+ Last parenthesized pattern match
280 $^N Holds the most recently closed capture
281 $^R Holds the result of the last (?{...}) expr
6d014f17 282 @- Offsets of starts of groups. $-[0] holds start of whole match
283 @+ Offsets of ends of groups. $+[0] holds end of whole match
e17472c5 284 %+ Named capture buffers
285 %- Named capture buffers, as array refs
30487ceb 286
6d014f17 287Captured groups are numbered according to their I<opening> paren.
30487ceb 288
a5365663 289=head2 FUNCTIONS
30487ceb 290
291 lc Lowercase a string
292 lcfirst Lowercase first char of a string
293 uc Uppercase a string
47e8a552 294 ucfirst Titlecase first char of a string
295
30487ceb 296 pos Return or set current match position
297 quotemeta Quote metacharacters
298 reset Reset ?pattern? status
299 study Analyze string for optimizing matching
300
e17472c5 301 split Use a regex to split a string into parts
30487ceb 302
d3b55b48 303The first four of these are like the escape sequences C<\L>, C<\l>,
304C<\U>, and C<\u>. For Titlecase, see L</Titlecase>.
47e8a552 305
1501d360 306=head2 TERMINOLOGY
47e8a552 307
a5365663 308=head3 Titlecase
47e8a552 309
310Unicode concept which most often is equal to uppercase, but for
311certain characters like the German "sharp s" there is a difference.
312
40506b5d 313=head1 AUTHOR
30487ceb 314
64c5a566 315Iain Truskett. Updated by the Perl 5 Porters.
30487ceb 316
317This document may be distributed under the same terms as Perl itself.
318
40506b5d 319=head1 SEE ALSO
30487ceb 320
321=over 4
322
323=item *
324
325L<perlretut> for a tutorial on regular expressions.
326
327=item *
328
329L<perlrequick> for a rapid tutorial.
330
331=item *
332
333L<perlre> for more details.
334
335=item *
336
337L<perlvar> for details on the variables.
338
339=item *
340
341L<perlop> for details on the operators.
342
343=item *
344
345L<perlfunc> for details on the functions.
346
347=item *
348
349L<perlfaq6> for FAQs on regular expressions.
350
351=item *
352
64c5a566 353L<perlrebackslash> for a reference on backslash sequences.
354
355=item *
356
357L<perlrecharclass> for a reference on character classes.
358
359=item *
360
30487ceb 361The L<re> module to alter behaviour and aid
362debugging.
363
364=item *
365
366L<perldebug/"Debugging regular expressions">
367
368=item *
369
e17472c5 370L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
30487ceb 371for details on regexes and internationalisation.
372
373=item *
374
375I<Mastering Regular Expressions> by Jeffrey Friedl
08d7a6b2 376(F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
30487ceb 377reference on the topic.
378
379=back
380
40506b5d 381=head1 THANKS
30487ceb 382
383David P.C. Wollmann,
384Richard Soderberg,
385Sean M. Burke,
386Tom Christiansen,
e5a7b003 387Jim Cromie,
30487ceb 388and
389Jeffrey Goff
390for useful advice.
6d014f17 391
392=cut