pod/perlreref.pod

   1 =head1 NAME
   2
   3 perlreref - Perl Regular Expressions Reference
   4
   5 =head1 DESCRIPTION
   6
   7 This is a quick reference to Perl's regular expressions.
   8 For full information see L<perlre> and L<perlop>, as well
   9 as the L</"SEE ALSO"> section in this document.
  10
  11 =head2 OPERATORS
  12
  13 C<=~> determines to which variable the regex is applied.
  14 In its absence, $_ is used.
  15
  16     $var =~ /foo/;
  17
  18 C<!~> determines to which variable the regex is applied,
  19 and negates the result of the match; it returns
  20 false if the match succeeds, and true if it fails.
  21
  22     $var !~ /foo/;
  23
  24 C<m/pattern/msixpogc> searches a string for a pattern match,
  25 applying the given options.
  26
  27     m  Multiline mode - ^ and $ match internal lines
  28     s  match as a Single line - . matches \n
  29     i  case-Insensitive
  30     x  eXtended legibility - free whitespace and comments
  31     p  Preserve a copy of the matched string -
  32        ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
  33     o  compile pattern Once
  34     g  Global - all occurrences
  35     c  don't reset pos on failed matches when using /g
  36
  37 If 'pattern' is an empty string, the last I<successfully> matched
  38 regex is used. Delimiters other than '/' may be used for both this
  39 operator and the following ones. The leading C<m> can be omitted
  40 if the delimiter is '/'.
  41
  42 C<qr/pattern/msixpo> lets you store a regex in a variable,
  43 or pass one around. Modifiers as for C<m//>, and are stored
  44 within the regex.
  45
  46 C<s/pattern/replacement/msixpogce> substitutes matches of
  47 'pattern' with 'replacement'. Modifiers as for C<m//>,
  48 with two additions:
  49
  50     e  Evaluate 'replacement' as an expression
  51     r  Return substitution and leave the original string untouched.
  52
  53 'e' may be specified multiple times. 'replacement' is interpreted
  54 as a double quoted string unless a single-quote (C<'>) is the delimiter.
  55
  56 C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
  57 delimiters can be used.  Must be reset with reset().
  58
  59 =head2 SYNTAX
  60
  61  \       Escapes the character immediately following it
  62  .       Matches any single character except a newline (unless /s is
  63            used)
  64  ^       Matches at the beginning of the string (or line, if /m is used)
  65  $       Matches at the end of the string (or line, if /m is used)
  66  *       Matches the preceding element 0 or more times
  67  +       Matches the preceding element 1 or more times
  68  ?       Matches the preceding element 0 or 1 times
  69  {...}   Specifies a range of occurrences for the element preceding it
  70  [...]   Matches any one of the characters contained within the brackets
  71  (...)   Groups subexpressions for capturing to $1, $2...
  72  (?:...) Groups subexpressions without capturing (cluster)
  73  |       Matches either the subexpression preceding or following it
  74  \1, \2, \3 ...           Matches the text from the Nth group
  75  \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
  76  \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
  77  \g{name}     Named backreference
  78  \k<name>     Named backreference
  79  \k'name'     Named backreference
  80  (?P=name)    Named backreference (python syntax)
  81
  82 =head2 ESCAPE SEQUENCES
  83
  84 These work as in normal strings.
  85
  86    \a       Alarm (beep)
  87    \e       Escape
  88    \f       Formfeed
  89    \n       Newline
  90    \r       Carriage return
  91    \t       Tab
  92    \037     Any octal ASCII value
  93    \x7f     Any hexadecimal ASCII value
  94    \x{263a} A wide hexadecimal value
  95    \cx      Control-x
  96    \N{name} A named character
  97    \N{U+263D} A Unicode character by hex ordinal
  98
  99    \l  Lowercase next character
 100    \u  Titlecase next character
 101    \L  Lowercase until \E
 102    \U  Uppercase until \E
 103    \Q  Disable pattern metacharacters until \E
 104    \E  End modification
 105
 106 For Titlecase, see L</Titlecase>.
 107
 108 This one works differently from normal strings:
 109
 110    \b  An assertion, not backspace, except in a character class
 111
 112 =head2 CHARACTER CLASSES
 113
 114    [amy]    Match 'a', 'm' or 'y'
 115    [f-j]    Dash specifies "range"
 116    [f-j-]   Dash escaped or at start or end means 'dash'
 117    [^f-j]   Caret indicates "match any character _except_ these"
 118
 119 The following sequences (except C<\N>) work within or without a character class.
 120 The first six are locale aware, all are Unicode aware. See L<perllocale>
 121 and L<perlunicode> for details.
 122
 123    \d      A digit
 124    \D      A nondigit
 125    \w      A word character
 126    \W      A non-word character
 127    \s      A whitespace character
 128    \S      A non-whitespace character
 129    \h      An horizontal whitespace
 130    \H      A non horizontal whitespace
 131    \N      A non newline (when not followed by '{NAME}'; experimental;
 132            not valid in a character class; equivalent to [^\n]; it's
 133            like '.' without /s modifier)
 134    \v      A vertical whitespace
 135    \V      A non vertical whitespace
 136    \R      A generic newline           (?>\v|\x0D\x0A)
 137
 138    \C      Match a byte (with Unicode, '.' matches a character)
 139    \pP     Match P-named (Unicode) property
 140    \p{...} Match Unicode property with name longer than 1 character
 141    \PP     Match non-P
 142    \P{...} Match lack of Unicode property with name longer than 1 char
 143    \X      Match Unicode extended grapheme cluster
 144
 145 POSIX character classes and their Unicode and Perl equivalents:
 146
 147            ASCII-         Full-
 148            range          range   backslash
 149  POSIX    \p{...}         \p{}    sequence       Description
 150  -----------------------------------------------------------------------
 151  alnum   PosixAlnum       Alnum               Alpha plus Digit
 152  alpha   PosixAlpha       Alpha               Alphabetic characters
 153  ascii   ASCII                                Any ASCII character
 154  blank   PosixBlank       Blank     \h        Horizontal whitespace;
 155                                                 full-range also written
 156                                                 as \p{HorizSpace} (GNU
 157                                                 extension)
 158  cntrl   PosixCntrl       Cntrl               Control characters
 159  digit   PosixDigit       Digit     \d        Decimal digits
 160  graph   PosixGraph       Graph               Alnum plus Punct
 161  lower   PosixLower       Lower               Lowercase characters
 162  print   PosixPrint       Print               Graph plus Print, but not
 163                                                 any Cntrls
 164  punct   PosixPunct       Punct               These aren't precisely
 165                                                 equivalent.  See NOTE,
 166                                                 below.
 167  space   PosixSpace       Space     [\s\cK]   Whitespace
 168          PerlSpace        SpacePerl \s        Perl's whitespace
 169                                                 definition
 170  upper   PosixUpper       Upper               Uppercase characters
 171  word    PerlWord         Word      \w        Alnum plus '_' (Perl
 172                                                 extension)
 173  xdigit  ASCII_Hex_Digit  XDigit              Hexadecimal digit,
 174                                                 ASCII-range is
 175                                                 [0-9A-Fa-f]
 176
 177 NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
 178 In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
 179 C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
 180 effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
 181 matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  When matching a UTF-8 string,
 182 C<[[:punct:]]> matches what it does in the ASCII range, plus what
 183 C<\p{Punct}> matches.  C<\p{Punct}> matches, anything that isn't a
 184 control, an alphanumeric, a space, nor a symbol.
 185
 186 Within a character class:
 187
 188     POSIX      traditional   Unicode
 189   [:digit:]       \d        \p{Digit}
 190   [:^digit:]      \D        \P{Digit}
 191
 192 =head2 ANCHORS
 193
 194 All are zero-width assertions.
 195
 196    ^  Match string start (or line, if /m is used)
 197    $  Match string end (or line, if /m is used) or before newline
 198    \b Match word boundary (between \w and \W)
 199    \B Match except at word boundary (between \w and \w or \W and \W)
 200    \A Match string start (regardless of /m)
 201    \Z Match string end (before optional newline)
 202    \z Match absolute string end
 203    \G Match where previous m//g left off
 204    \K Keep the stuff left of the \K, don't include it in $&
 205
 206 =head2 QUANTIFIERS
 207
 208 Quantifiers are greedy by default and match the B<longest> leftmost.
 209
 210    Maximal Minimal Possessive Allowed range
 211    ------- ------- ---------- -------------
 212    {n,m}   {n,m}?  {n,m}+     Must occur at least n times
 213                               but no more than m times
 214    {n,}    {n,}?   {n,}+      Must occur at least n times
 215    {n}     {n}?    {n}+       Must occur exactly n times
 216    *       *?      *+         0 or more times (same as {0,})
 217    +       +?      ++         1 or more times (same as {1,})
 218    ?       ??      ?+         0 or 1 time (same as {0,1})
 219
 220 The possessive forms (new in Perl 5.10) prevent backtracking: what gets
 221 matched by a pattern with a possessive quantifier will not be backtracked
 222 into, even if that causes the whole match to fail.
 223
 224 There is no quantifier C<{,n}>. That's interpreted as a literal string.
 225
 226 =head2 EXTENDED CONSTRUCTS
 227
 228    (?#text)          A comment
 229    (?:...)           Groups subexpressions without capturing (cluster)
 230    (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
 231    (?=...)           Zero-width positive lookahead assertion
 232    (?!...)           Zero-width negative lookahead assertion
 233    (?<=...)          Zero-width positive lookbehind assertion
 234    (?<!...)          Zero-width negative lookbehind assertion
 235    (?>...)           Grab what we can, prohibit backtracking
 236    (?|...)           Branch reset
 237    (?<name>...)      Named capture
 238    (?'name'...)      Named capture
 239    (?P<name>...)     Named capture (python syntax)
 240    (?{ code })       Embedded code, return value becomes $^R
 241    (??{ code })      Dynamic regex, return value used as regex
 242    (?N)              Recurse into subpattern number N
 243    (?-N), (?+N)      Recurse into Nth previous/next subpattern
 244    (?R), (?0)        Recurse at the beginning of the whole pattern
 245    (?&name)          Recurse into a named subpattern
 246    (?P>name)         Recurse into a named subpattern (python syntax)
 247    (?(cond)yes|no)
 248    (?(cond)yes)      Conditional expression, where "cond" can be:
 249                      (N)       subpattern N has matched something
 250                      (<name>)  named subpattern has matched something
 251                      ('name')  named subpattern has matched something
 252                      (?{code}) code condition
 253                      (R)       true if recursing
 254                      (RN)      true if recursing into Nth subpattern
 255                      (R&name)  true if recursing into named subpattern
 256                      (DEFINE)  always false, no no-pattern allowed
 257
 258 =head2 VARIABLES
 259
 260    $_    Default variable for operators to use
 261
 262    $`    Everything prior to matched string
 263    $&    Entire matched string
 264    $'    Everything after to matched string
 265
 266    ${^PREMATCH}   Everything prior to matched string
 267    ${^MATCH}      Entire matched string
 268    ${^POSTMATCH}  Everything after to matched string
 269
 270 The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
 271 within your program. Consult L<perlvar> for C<@->
 272 to see equivalent expressions that won't cause slow down.
 273 See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
 274 can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
 275 and C<${^POSTMATCH}>, but for them to be defined, you have to
 276 specify the C</p> (preserve) modifier on your regular expression.
 277
 278    $1, $2 ...  hold the Xth captured expr
 279    $+    Last parenthesized pattern match
 280    $^N   Holds the most recently closed capture
 281    $^R   Holds the result of the last (?{...}) expr
 282    @-    Offsets of starts of groups. $-[0] holds start of whole match
 283    @+    Offsets of ends of groups. $+[0] holds end of whole match
 284    %+    Named capture buffers
 285    %-    Named capture buffers, as array refs
 286
 287 Captured groups are numbered according to their I<opening> paren.
 288
 289 =head2 FUNCTIONS
 290
 291    lc          Lowercase a string
 292    lcfirst     Lowercase first char of a string
 293    uc          Uppercase a string
 294    ucfirst     Titlecase first char of a string
 295
 296    pos         Return or set current match position
 297    quotemeta   Quote metacharacters
 298    reset       Reset ?pattern? status
 299    study       Analyze string for optimizing matching
 300
 301    split       Use a regex to split a string into parts
 302
 303 The first four of these are like the escape sequences C<\L>, C<\l>,
 304 C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.
 305
 306 =head2 TERMINOLOGY
 307
 308 =head3 Titlecase
 309
 310 Unicode concept which most often is equal to uppercase, but for
 311 certain characters like the German "sharp s" there is a difference.
 312
 313 =head1 AUTHOR
 314
 315 Iain Truskett. Updated by the Perl 5 Porters.
 316
 317 This document may be distributed under the same terms as Perl itself.
 318
 319 =head1 SEE ALSO
 320
 321 =over 4
 322
 323 =item *
 324
 325 L<perlretut> for a tutorial on regular expressions.
 326
 327 =item *
 328
 329 L<perlrequick> for a rapid tutorial.
 330
 331 =item *
 332
 333 L<perlre> for more details.
 334
 335 =item *
 336
 337 L<perlvar> for details on the variables.
 338
 339 =item *
 340
 341 L<perlop> for details on the operators.
 342
 343 =item *
 344
 345 L<perlfunc> for details on the functions.
 346
 347 =item *
 348
 349 L<perlfaq6> for FAQs on regular expressions.
 350
 351 =item *
 352
 353 L<perlrebackslash> for a reference on backslash sequences.
 354
 355 =item *
 356
 357 L<perlrecharclass> for a reference on character classes.
 358
 359 =item *
 360
 361 The L<re> module to alter behaviour and aid
 362 debugging.
 363
 364 =item *
 365
 366 L<perldebug/"Debugging regular expressions">
 367
 368 =item *
 369
 370 L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
 371 for details on regexes and internationalisation.
 372
 373 =item *
 374
 375 I<Mastering Regular Expressions> by Jeffrey Friedl
 376 (F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
 377 reference on the topic.
 378
 379 =back
 380
 381 =head1 THANKS
 382
 383 David P.C. Wollmann,
 384 Richard Soderberg,
 385 Sean M. Burke,
 386 Tom Christiansen,
 387 Jim Cromie,
 388 and
 389 Jeffrey Goff
 390 for useful advice.
 391
 392 =cut