X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=fcf3d510e5849dd6e85b6661c67b997521a3121b;hb=cdfeb707a2638190212953e4a52d8460de223429;hp=1324642f715798565d8611431f52c2b7b50e5211;hpb=a0d0e21ea6ea90a22318550944fe6cb09ae10cda;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 1324642..fcf3d51 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -1,69 +1,131 @@ =head1 NAME +X X X perlre - Perl regular expressions =head1 DESCRIPTION -For a description of how to use regular expressions in matching -operations, see C and C in L. The matching operations can -have various modifiers, some of which relate to the interpretation of -the regular expression inside. These are: +This page describes the syntax of regular expressions in Perl. - i Do case-insensitive pattern matching. - m Treat string as multiple lines. - s Treat string as single line. - x Use extended regular expressions. +If you haven't used regular expressions before, a quick-start +introduction is available in L, and a longer tutorial +introduction is available in L. + +For reference on how regular expressions are used in matching +operations, plus various examples of the same, see discussions of +C, C, C and C in L. + +Matching operations can have various modifiers. Modifiers +that relate to the interpretation of the regular expression inside +are listed below. Modifiers that alter the way a regular expression +is used by Perl are detailed in L and +L. + +=over 4 + +=item i +X X X +X + +Do case-insensitive pattern matching. + +If C is in effect, the case map is taken from the current +locale. See L. + +=item m +X X X X + +Treat string as multiple lines. That is, change "^" and "$" from matching +the start or end of the string to matching the start or end of any +line anywhere within the string. + +=item s +X X X +X + +Treat string as single line. That is, change "." to match any character +whatsoever, even a newline, which normally it would not match. + +Used together, as /ms, they let the "." match any character whatsoever, +while still allowing "^" and "$" to match, respectively, just after +and just before newlines within the string. + +=item x +X + +Extend your pattern's legibility by permitting whitespace and comments. + +=back These are usually written as "the C modifier", even though the delimiter -in question might not actually be a slash. In fact, any of these +in question might not really be a slash. Any of these modifiers may also be embedded within the regular expression itself using -the new C<(?...)> construct. See below. +the C<(?...)> construct. See below. -The C modifier itself needs a little more explanation. It tells the -regular expression parser to ignore whitespace that is not backslashed -or within a character class. You can use this to break up your regular -expression into (slightly) more readable parts. Together with the -capability of embedding comments described later, this goes a long -way towards making Perl 5 a readable language. See the C comment -deletion code in L. +The C modifier itself needs a little more explanation. It tells +the regular expression parser to ignore whitespace that is neither +backslashed nor within a character class. You can use this to break up +your regular expression into (slightly) more readable parts. The C<#> +character is also treated as a metacharacter introducing a comment, +just as in ordinary Perl code. This also means that if you want real +whitespace or C<#> characters in the pattern (outside a character +class, where they are unaffected by C), then you'll either have to +escape them (using backslashes or C<\Q...\E>) or encode them using octal +or hex escapes. Taken together, these features go a long way towards +making Perl's regular expressions more readable. Note that you have to +be careful not to include the pattern delimiter in the comment--perl has +no way of knowing you did not intend to close the pattern early. See +the C-comment deletion code in L. Also note that anything inside +a C<\Q...\E> stays unaffected by C. +X =head2 Regular Expressions -The patterns used in pattern matching are regular expressions such as -those supplied in the Version 8 regexp routines. (In fact, the -routines are derived (distantly) from Henry Spencer's freely -redistributable reimplementation of the V8 routines.) -See L for details. +=head3 Metacharacters + +The patterns used in Perl pattern matching derive from supplied in +the Version 8 regex routines. (The routines are derived +(distantly) from Henry Spencer's freely redistributable reimplementation +of the V8 routines.) See L for +details. In particular the following metacharacters have their standard I-ish meanings: +X +X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> + \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) - $ Match the end of the line + $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character class -By default, the "^" character is guaranteed to match only at the -beginning of the string, the "$" character only at the end (or before the -newline at the end) and Perl does certain optimizations with the +By default, the "^" character is guaranteed to match only the +beginning of the string, the "$" character only the end (or before the +newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any -newline within the string, and "$" will match before any newline. At the +newline within the string (except if the newline is the last character in +the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting C<$*>, -but this practice is deprecated in Perl 5.) +but this practice has been removed in perl 5.9.) +X<^> X<$> X -To facilitate multi-line substitutions, the "." character never matches a -newline unless you use the C modifier, which tells Perl to pretend -the string is a single line--even if it isn't. The C modifier also -overrides the setting of C<$*>, in case you have some (badly behaved) older -code that sets it in another module. +To simplify multi-line substitutions, the "." character never matches a +newline unless you use the C modifier, which in effect tells Perl to pretend +the string is a single line--even if it isn't. +X<.> X + +=head3 Quantifiers The following standard quantifiers are recognized: +X X X<*> X<+> X X<{n}> X<{n,}> X<{n,m}> * Match 0 or more times + Match 1 or more times @@ -73,17 +135,22 @@ The following standard quantifiers are recognized: {n,m} Match at least n but not more than m times (If a curly bracket occurs in any other context, it is treated -as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+" -modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. There is no limit to the -size of n or m, but large numbers will chew up more memory. +as a regular character. In particular, the lower bound +is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+" +modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited +to integral values less than a preset limit defined when perl is built. +This is usually 32766 on the most common platforms. The actual limit can +be seen in the error message generated by code such as this: + + $_ **= $_ , / {$_} / for 2 .. 42; By default, a quantified subpattern is "greedy", that is, it will match as -many times as possible without causing the rest pattern not to match. The -standard quantifiers are all "greedy", in that they match as many -occurrences as possible (given a particular starting location) without -causing the pattern to fail. If you want it to match the minimum number -of times possible, follow the quantifier with a "?" after any of them. -Note that the meanings don't change, just the "gravity": +many times as possible (given a particular starting location) while still +allowing the rest of the pattern to match. If you want it to match the +minimum number of times possible, follow the quantifier with a "?". Note +that the meanings don't change, just the "greediness": +X X X +X X<*?> X<+?> X X<{n}?> X<{n,}?> X<{n,m}?> *? Match 0 or more times +? Match 1 or more times @@ -92,224 +159,1734 @@ Note that the meanings don't change, just the "gravity": {n,}? Match at least n times {n,m}? Match at least n but not more than m times -Since patterns are processed as double quoted strings, the following +By default, when a quantified subpattern does not allow the rest of the +overall pattern to match, Perl will backtrack. However, this behaviour is +sometimes undesirable. Thus Perl provides the "possesive" quantifier form +as well. + + *+ Match 0 or more times and give nothing back + ++ Match 1 or more times and give nothing back + ?+ Match 0 or 1 time and give nothing back + {n}+ Match exactly n times and give nothing back (redundant) + {n,}+ Match at least n times and give nothing back + {n,m}+ Match at least n but not more than m times and give nothing back + +For instance, + + 'aaaa' =~ /a++a/ + +will never match, as the C will gobble up all the C's in the +string and won't leave any for the remaining part of the pattern. This +feature can be extremely useful to give perl hints about where it +shouldn't backtrack. For instance, the typical "match a double-quoted +string" problem can be most efficiently performed when written as: + + /"(?:[^"\\]++|\\.)*+"/ + +as we know that if the final quote does not match, bactracking will not +help. See the independent subexpression C<< (?>...) >> for more details; +possessive quantifiers are just syntactic sugar for that construct. For +instance the above example could also be written as follows: + + /"(?>(?:(?>[^"\\]+)|\\.)*)"/ + +=head3 Escape sequences + +Because patterns are processed as double quoted strings, the following also work: +X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> +X<\0> X<\c> X<\N> X<\x> - \t tab - \n newline - \r return - \f form feed - \v vertical tab, whatever that is - \a alarm (bell) - \e escape - \033 octal char - \x1b hex char + \t tab (HT, TAB) + \n newline (LF, NL) + \r return (CR) + \f form feed (FF) + \a alarm (bell) (BEL) + \e escape (think troff) (ESC) + \033 octal char (think of a PDP-11) + \x1B hex char + \x{263a} wide hex char (Unicode SMILEY) \c[ control char - \l lowercase next char - \u uppercase next char - \L lowercase till \E - \U uppercase till \E - \E end case modification - \Q quote regexp metacharacters till \E + \N{name} named char + \l lowercase next char (think vi) + \u uppercase next char (think vi) + \L lowercase till \E (think vi) + \U uppercase till \E (think vi) + \E end case modification (think vi) + \Q quote (disable) pattern metacharacters till \E + +If C is in effect, the case map used by C<\l>, C<\L>, C<\u> +and C<\U> is taken from the current locale. See L. For +documentation of C<\N{name}>, see L. + +You cannot include a literal C<$> or C<@> within a C<\Q> sequence. +An unescaped C<$> or C<@> interpolates the corresponding variable, +while escaping will cause the literal string C<\$> to be matched. +You'll need to write something like C. + +=head3 Character classes In addition, Perl defines the following: +X +X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C> +X X + + \w Match a "word" character (alphanumeric plus "_") + \W Match a non-"word" character + \s Match a whitespace character + \S Match a non-whitespace character + \d Match a digit character + \D Match a non-digit character + \pP Match P, named property. Use \p{Prop} for longer names. + \PP Match non-P + \X Match eXtended Unicode "combining character sequence", + equivalent to (?:\PM\pM*) + \C Match a single C char (octet) even under Unicode. + NOTE: breaks up characters into their UTF-8 bytes, + so you may end up with malformed pieces of UTF-8. + Unsupported in lookbehind. + \1 Backreference to a a specific group. + '1' may actually be any positive integer + \k Named backreference + \N{name} Named unicode character, or unicode escape. + \x12 Hexadecimal escape sequence + \x{1234} Long hexadecimal escape sequence + +A C<\w> matches a single alphanumeric character (an alphabetic +character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> +to match a string of Perl-identifier characters (which isn't the same +as matching an English word). If C is in effect, the list +of alphabetic characters generated by C<\w> is taken from the current +locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, +C<\d>, and C<\D> within character classes, but if you try to use them +as endpoints of a range, that's not a range, the "-" is understood +literally. If Unicode is in effect, C<\s> matches also "\x{85}", +"\x{2028}, and "\x{2029}", see L for more details about +C<\pP>, C<\PP>, and C<\X>, and L about Unicode in general. +You can define your own C<\p> and C<\P> properties, see L. +X<\w> X<\W> X + +The POSIX character class syntax +X + + [:class:] + +is also available. Note that the C<[> and C<]> braces are I; +they must always be used within a character class expression. + + # this is correct: + $string =~ /[[:alpha:]]/; + + # this is not, and will generate a warning: + $string =~ /[:alpha:]/; + +The available classes and their backslash equivalents (if available) are +as follows: +X +X X X X X X X +X X X X X X X + + alpha + alnum + ascii + blank [1] + cntrl + digit \d + graph + lower + print + punct + space \s [2] + upper + word \w [3] + xdigit + +=over + +=item [1] + +A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". + +=item [2] + +Not exactly equivalent to C<\s> since the C<[[:space:]]> includes +also the (very rare) "vertical tabulator", "\ck", chr(11). + +=item [3] + +A Perl extension, see above. + +=back + +For example use C<[:upper:]> to match all the uppercase characters. +Note that the C<[]> are part of the C<[::]> construct, not part of the +whole character class. For example: + + [01[:alpha:]%] + +matches zero, one, any alphabetic character, and the percentage sign. - \w Match a "word" character (alphanumeric plus "_") - \W Match a non-word character - \s Match a whitespace character - \S Match a non-whitespace character - \d Match a digit character - \D Match a non-digit character +The following equivalences to Unicode \p{} constructs and equivalent +backslash character classes (if available), will hold: +X X<\p> X<\p{}> -Note that C<\w> matches a single alphanumeric character, not a whole -word. To match a word you'd need to say C<\w+>. You may use C<\w>, C<\W>, C<\s>, -C<\S>, C<\d> and C<\D> within character classes (though not as either end of a -range). + [[:...:]] \p{...} backslash + + alpha IsAlpha + alnum IsAlnum + ascii IsASCII + blank IsSpace + cntrl IsCntrl + digit IsDigit \d + graph IsGraph + lower IsLower + print IsPrint + punct IsPunct + space IsSpace + IsSpacePerl \s + upper IsUpper + word IsWord + xdigit IsXDigit + +For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. + +If the C pragma is not used but the C pragma is, the +classes correlate with the usual isalpha(3) interface (except for +"word" and "blank"). + +The assumedly non-obviously named classes are: + +=over 4 + +=item cntrl +X + +Any control character. Usually characters that don't produce output as +such but instead control the terminal somehow: for example newline and +backspace are control characters. All characters with ord() less than +32 are most often classified as control characters (assuming ASCII, +the ISO Latin character sets, and Unicode), as is the character with +the ord() value of 127 (C). + +=item graph +X + +Any alphanumeric or punctuation (special) character. + +=item print +X + +Any alphanumeric or punctuation (special) character or the space character. + +=item punct +X + +Any punctuation (special) character. + +=item xdigit +X + +Any hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] would +work just fine) it is included for completeness. + +=back + +You can negate the [::] character classes by prefixing the class name +with a '^'. This is a Perl extension. For example: +X + + POSIX traditional Unicode + + [[:^digit:]] \D \P{IsDigit} + [[:^space:]] \S \P{IsSpace} + [[:^word:]] \W \P{IsWord} + +Perl respects the POSIX standard in that POSIX character classes are +only supported within a character class. The POSIX character classes +[.cc.] and [=cc=] are recognized but B supported and trying to +use them will cause an error. + +=head3 Assertions Perl defines the following zero-width assertions: +X X X +X +X +X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> \b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string - \Z Match only at end of string - \G Match only where previous m//g left off - -A word boundary (C<\b>) is defined as a spot between two characters that -has a C<\w> on one side of it and and a C<\W> on the other side of it (in -either order), counting the imaginary characters off the beginning and -end of the string as matching a C<\W>. (Within character classes C<\b> -represents backspace rather than a word boundary.) The C<\A> and C<\Z> are -just like "^" and "$" except that they won't match multiple times when the -C modifier is used, while "^" and "$" will match at every internal line -boundary. - -When the bracketing construct C<( ... )> is used, \ matches the -digit'th substring. (Outside of the pattern, always use "$" instead of -"\" in front of the digit. The scope of $ (and C<$`>, C<$&>, and C<$')> -extends to the end of the enclosing BLOCK or eval string, or to the -next pattern match with subexpressions. -If you want to -use parentheses to delimit subpattern (e.g. a set of alternatives) without -saving it as a subpattern, follow the ( with a ?. -The \ notation -sometimes works outside the current pattern, but should not be relied -upon.) You may have as many parentheses as you wish. If you have more -than 9 substrings, the variables $10, $11, ... refer to the -corresponding substring. Within the pattern, \10, \11, etc. refer back -to substrings if there have been at least that many left parens before -the backreference. Otherwise (for backward compatibilty) \10 is the -same as \010, a backspace, and \11 the same as \011, a tab. And so -on. (\1 through \9 are always backreferences.) - -C<$+> returns whatever the last bracket match matched. C<$&> returns the -entire matched string. ($0 used to return the same thing, but not any -more.) C<$`> returns everything before the matched string. C<$'> returns -everything after the matched string. Examples: + \Z Match only at end of string, or before newline at the end + \z Match only at end of string + \G Match only at pos() (e.g. at the end-of-match position + of prior m//g) + +A word boundary (C<\b>) is a spot between two characters +that has a C<\w> on one side of it and a C<\W> on the other side +of it (in either order), counting the imaginary characters off the +beginning and end of the string as matching a C<\W>. (Within +character classes C<\b> represents backspace rather than a word +boundary, just as it normally does in any double-quoted string.) +The C<\A> and C<\Z> are just like "^" and "$", except that they +won't match multiple times when the C modifier is used, while +"^" and "$" will match at every internal line boundary. To match +the actual end of the string and not ignore an optional trailing +newline, use C<\z>. +X<\b> X<\A> X<\Z> X<\z> X + +The C<\G> assertion can be used to chain global matches (using +C), as described in L. +It is also useful when writing C-like scanners, when you have +several patterns that you want to match against consequent substrings +of your string, see the previous reference. The actual location +where C<\G> will match can also be influenced by using C as +an lvalue: see L. Currently C<\G> is only fully +supported when anchored to the start of the pattern; while it +is permitted to use it elsewhere, as in C, some +such uses (C, for example) currently cause problems, and +it is recommended that you avoid such usage for now. +X<\G> + +=head3 Capture buffers + +The bracketing construct C<( ... )> creates capture buffers. To +refer to the digit'th buffer use \ within the +match. Outside the match use "$" instead of "\". (The +\ notation works in certain circumstances outside +the match. See the warning below about \1 vs $1 for details.) +Referring back to another part of the match is called a +I. +X X +X X + +There is no limit to the number of captured substrings that you may +use. However Perl also uses \10, \11, etc. as aliases for \010, +\011, etc. (Recall that 0 means octal, so \011 is the character at +number 9 in your coded character set; which would be the 10th character, +a horizontal tab under ASCII.) Perl resolves this +ambiguity by interpreting \10 as a backreference only if at least 10 +left parentheses have opened before it. Likewise \11 is a +backreference only if at least 11 left parentheses have opened +before it. And so on. \1 through \9 are always interpreted as +backreferences. + +Additionally, as of Perl 5.10 you may use named capture buffers and named +backreferences. The notation is C<< (?...) >> and C<< \k >> +(you may also use single quotes instead of angle brackets to quote the +name). The only difference with named capture buffers and unnamed ones is +that multiple buffers may have the same name and that the contents of +named capture buffers is available via the C<%+> hash. When multiple +groups share the same name C<$+{name}> and C<< \k >> refer to the +leftmost defined group, thus it's possible to do things with named capture +buffers that would otherwise require C<(??{})> code to accomplish. Named +capture buffers are numbered just as normal capture buffers are and may be +referenced via the magic numeric variables or via numeric backreferences +as well as by name. + +Examples: s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words - if (/Time: (..):(..):(..)/) { + /(.)\1/ # find first doubled char + and print "'$1' is the first doubled character\n"; + + /(?.)\k/ # ... a different way + and print "'$+{char}' is the first doubled character\n"; + + /(?.)\1/ # ... mix and match + and print "'$1' is the first doubled character\n"; + + if (/Time: (..):(..):(..)/) { # parse out values $hours = $1; $minutes = $2; $seconds = $3; } -You will note that all backslashed metacharacters in Perl are -alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression -languages, there are no backslashed symbols that aren't alphanumeric. -So anything that looks like \\, \(, \), \<, \>, \{, or \} is always -interpreted as a literal character, not a metacharacter. This makes it -simple to quote a string that you want to use for a pattern but that -you are afraid might contain metacharacters. Simply quote all the -non-alphanumeric characters: +Several special variables also refer back to portions of the previous +match. C<$+> returns whatever the last bracket match matched. +C<$&> returns the entire matched string. (At one point C<$0> did +also, but now it returns the name of the program.) C<$`> returns +everything before the matched string. C<$'> returns everything +after the matched string. And C<$^N> contains whatever was matched by +the most-recently closed group (submatch). C<$^N> can be used in +extended patterns (see below), for example to assign a submatch to a +variable. +X<$+> X<$^N> X<$&> X<$`> X<$'> + +The numbered match variables ($1, $2, $3, etc.) and the related punctuation +set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped +until the end of the enclosing block or until the next successful +match, whichever comes first. (See L.) +X<$+> X<$^N> X<$&> X<$`> X<$'> +X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> + + +B: failed matches in Perl do not reset the match variables, +which makes it easier to write code that tests for a series of more +specific cases and remembers the best match. + +B: Once Perl sees that you need one of C<$&>, C<$`>, or +C<$'> anywhere in the program, it has to provide them for every +pattern match. This may substantially slow your program. Perl +uses the same mechanism to produce $1, $2, etc, so you also pay a +price for each pattern that contains capturing parentheses. (To +avoid this cost while retaining the grouping behaviour, use the +extended regular expression C<(?: ... )> instead.) But if you never +use C<$&>, C<$`> or C<$'>, then patterns I capturing +parentheses will not be penalized. So avoid C<$&>, C<$'>, and C<$`> +if you can, but if you can't (and some algorithms really appreciate +them), once you've used them once, use them at will, because you've +already paid the price. As of 5.005, C<$&> is not so costly as the +other two. +X<$&> X<$`> X<$'> + +Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, +C<\w>, C<\n>. Unlike some other regular expression languages, there +are no backslashed symbols that aren't alphanumeric. So anything +that looks like \\, \(, \), \<, \>, \{, or \} is always +interpreted as a literal character, not a metacharacter. This was +once used in a common idiom to disable or quote the special meanings +of regular expression metacharacters in a string that you want to +use for a pattern. Simply quote all non-"word" characters: $pattern =~ s/(\W)/\\$1/g; -You can also use the built-in quotemeta() function to do this. -An even easier way to quote metacharacters right in the match operator -is to say +(If C is set, then this depends on the current locale.) +Today it is more common to use the quotemeta() function or the C<\Q> +metaquoting escape sequence to disable all metacharacters' special +meanings like this: /$unquoted\Q$quoted\E$unquoted/ -Perl 5 defines a consistent extension syntax for regular expressions. -The syntax is a pair of parens with a question mark as the first thing -within the parens (this was a syntax error in Perl 4). The character -after the question mark gives the function of the extension. Several -extensions are already supported: +Beware that if you put literal backslashes (those not inside +interpolated variables) between C<\Q> and C<\E>, double-quotish +backslash interpolation may lead to confusing results. If you +I to use literal backslashes within C<\Q...\E>, +consult L. + +=head2 Extended Patterns + +Perl also defines a consistent extension syntax for features not +found in standard tools like B and B. The syntax is a +pair of parentheses with a question mark as the first thing within +the parentheses. The character after the question mark indicates +the extension. + +The stability of these extensions varies widely. Some have been +part of the core language for many years. Others are experimental +and may change without warning or be completely removed. Check +the documentation on an individual feature to verify its current +status. + +A question mark was chosen for this and for the minimal-matching +construct because 1) question marks are rare in older regular +expressions, and 2) whenever you see one, you should stop and +"question" exactly what is going on. That's psychology... =over 10 -=item (?#text) +=item C<(?#text)> +X<(?#)> + +A comment. The text is ignored. If the C modifier enables +whitespace formatting, a simple C<#> will suffice. Note that Perl closes +the comment as soon as it sees a C<)>, so there is no way to put a literal +C<)> in the comment. + +=item C<(?imsx-imsx)> +X<(?)> + +One or more embedded pattern-match modifiers, to be turned on (or +turned off, if preceded by C<->) for the remainder of the pattern or +the remainder of the enclosing pattern group (if any). This is +particularly useful for dynamic patterns, such as those read in from a +configuration file, read in as an argument, are specified in a table +somewhere, etc. Consider the case that some of which want to be case +sensitive and some do not. The case insensitive ones need to include +merely C<(?i)> at the front of the pattern. For example: + + $pattern = "foobar"; + if ( /$pattern/i ) { } + + # more flexible: + + $pattern = "(?i)foobar"; + if ( /$pattern/ ) { } + +These modifiers are restored at the end of the enclosing group. For example, + + ( (?i) blah ) \s+ \1 -A comment. The text is ignored. +will match a repeated (I!) word C in any +case, assuming C modifier, and no C modifier outside this +group. -=item (?:regexp) +=item C<(?:pattern)> +X<(?:)> -This groups things like "()" but doesn't make backrefences like "()" does. So +=item C<(?imsx-imsx:pattern)> - split(/\b(?:a|b|c)\b/) +This is for clustering, not capturing; it groups subexpressions like +"()", but doesn't make backreferences as "()" does. So + + @fields = split(/\b(?:a|b|c)\b/) is like - split(/\b(a|b|c)\b/) + @fields = split(/\b(a|b|c)\b/) + +but doesn't spit out extra fields. It's also cheaper not to capture +characters if you don't need to. + +Any letters between C and C<:> act as flags modifiers as with +C<(?imsx-imsx)>. For example, + + /(?s-i:more.*than).*million/i -but doesn't spit out extra fields. +is equivalent to the more verbose -=item (?=regexp) + /(?:(?s-i)more.*than).*million/i -A zero-width positive lookahead assertion. For example, C +=item C<(?=pattern)> +X<(?=)> X X + +A zero-width positive look-ahead assertion. For example, C matches a word followed by a tab, without including the tab in C<$&>. -=item (?!regexp) +=item C<(?!pattern)> +X<(?!)> X X -A zero-width negative lookahead assertion. For example C +A zero-width negative look-ahead assertion. For example C matches any occurrence of "foo" that isn't followed by "bar". Note -however that lookahead and lookbehind are NOT the same thing. You cannot -use this for lookbehind: C will not find an occurrence of -"bar" that is preceded by something which is not "foo". That's because -the C<(?!foo)> is just saying that the next thing cannot be "foo"--and -it's not, it's a "bar", so "foobar" will match. You would have to do -something like C for that. We say "like" because there's -the case of your "bar" not having three characters before it. You could -cover that this way: C. Sometimes it's still -easier just to say: +however that look-ahead and look-behind are NOT the same thing. You cannot +use this for look-behind. - if (/foo/ && $` =~ /bar$/) +If you are looking for a "bar" that isn't preceded by a "foo", C +will not do what you want. That's because the C<(?!foo)> is just saying that +the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will +match. You would have to do something like C for that. We +say "like" because there's the case of your "bar" not having three characters +before it. You could cover that this way: C. +Sometimes it's still easier just to say: + if (/bar/ && $` !~ /foo$/) -=item (?imsx) +For look-behind see below. -One or more embedded pattern-match modifiers. This is particularly -useful for patterns that are specified in a table somewhere, some of -which want to be case sensitive, and some of which don't. The case -insensitive ones merely need to include C<(?i)> at the front of the -pattern. For example: +=item C<(?<=pattern)> +X<(?<=)> X X - $pattern = "foobar"; - if ( /$pattern/i ) +A zero-width positive look-behind assertion. For example, C +matches a word that follows a tab, without including the tab in C<$&>. +Works only for fixed-width look-behind. - # more flexible: +=item C<(? +X<(? X X - $pattern = "(?i)foobar"; - if ( /$pattern/ ) +A zero-width negative look-behind assertion. For example C +matches any occurrence of "foo" that does not follow "bar". Works +only for fixed-width look-behind. + +=item C<(?'NAME'pattern)> + +=item C<< (?pattern) >> +X<< (?) >> X<(?'NAME')> X X + +A named capture buffer. Identical in every respect to normal capturing +parens C<()> but for the additional fact that C<%+> may be used after +a succesful match to refer to a named buffer. See C for more +details on the C<%+> hash. + +If multiple distinct capture buffers have the same name then the +$+{NAME} will refer to the leftmost defined buffer in the match. + +The forms C<(?'NAME'pattern)> and C<(?pattern)> are equivalent. + +B While the notation of this construct is the same as the similar +function in .NET regexes, the behavior is not, in Perl the buffers are +numbered sequentially regardless of being named or not. Thus in the +pattern + + /(x)(?y)(z)/ + +$+{foo} will be the same as $2, and $3 will contain 'z' instead of +the opposite which is what a .NET regex hacker might expect. + +Currently NAME is restricted to word chars only. In other words, it +must match C. + +=item C<< \k >> + +=item C<< \k'name' >> + +Named backreference. Similar to numeric backreferences, except that +the group is designated by name and not number. If multiple groups +have the same name then it refers to the leftmost defined group in +the current match. + +It is an error to refer to a name not defined by a C<(?)> +earlier in the pattern. + +Both forms are equivalent. + +=item C<(?{ code })> +X<(?{})> X X X + +B: This extended regular expression feature is considered +experimental, and may be changed without notice. Code executed that +has side effects may not perform identically from version to version +due to the effect of future optimisations in the regex engine. + +This zero-width assertion evaluates any embedded Perl code. It +always succeeds, and its C is not interpolated. Currently, +the rules to determine where the C ends are somewhat convoluted. + +This feature can be used together with the special variable C<$^N> to +capture the results of submatches in variables without having to keep +track of the number of nested parentheses. For example: + + $_ = "The brown fox jumps over the lazy dog"; + /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; + print "color = $color, animal = $animal\n"; + +Inside the C<(?{...})> block, C<$_> refers to the string the regular +expression is matching against. You can also use C to know what is +the current position of matching within this string. + +The C is properly scoped in the following sense: If the assertion +is backtracked (compare L<"Backtracking">), all changes introduced after +Cization are undone, so that + + $_ = 'a' x 8; + m< + (?{ $cnt = 0 }) # Initialize $cnt. + ( + a + (?{ + local $cnt = $cnt + 1; # Update $cnt, backtracking-safe. + }) + )* + aaaa + (?{ $res = $cnt }) # On success copy to non-localized + # location. + >x; + +will set C<$res = 4>. Note that after the match, $cnt returns to the globally +introduced value, because the scopes that restrict C operators +are unwound. + +This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> +switch. If I used in this way, the result of evaluation of +C is put into the special variable C<$^R>. This happens +immediately, so C<$^R> can be used from other C<(?{ code })> assertions +inside the same regular expression. + +The assignment to C<$^R> above is properly localized, so the old +value of C<$^R> is restored if the assertion is backtracked; compare +L<"Backtracking">. + +Due to an unfortunate implementation issue, the Perl code contained in these +blocks is treated as a compile time closure that can have seemingly bizarre +consequences when used with lexically scoped variables inside of subroutines +or loops. There are various workarounds for this, including simply using +global variables instead. If you are using this construct and strange results +occur then check for the use of lexically scoped variables. + +For reasons of security, this construct is forbidden if the regular +expression involves run-time interpolation of variables, unless the +perilous C pragma has been used (see L), or the +variables contain results of C operator (see +L). + +This restriction is because of the wide-spread and remarkably convenient +custom of using run-time determined strings as patterns. For example: + + $re = <>; + chomp $re; + $string =~ /$re/; + +Before Perl knew how to execute interpolated code within a pattern, +this operation was completely safe from a security point of view, +although it could raise an exception from an illegal pattern. If +you turn on the C, though, it is no longer secure, +so you should only do so if you are also using taint checking. +Better yet, use the carefully constrained evaluation within a Safe +compartment. See L for details about both these mechanisms. + +Because perl's regex engine is not currently re-entrant, interpolated +code may not invoke the regex engine either directly with C or C), +or indirectly with functions such as C. + +=item C<(??{ code })> +X<(??{})> +X X X + +B: This extended regular expression feature is considered +experimental, and may be changed without notice. Code executed that +has side effects may not perform identically from version to version +due to the effect of future optimisations in the regex engine. + +This is a "postponed" regular subexpression. The C is evaluated +at run time, at the moment this subexpression may match. The result +of evaluation is considered as a regular expression and matched as +if it were inserted instead of this construct. Note that this means +that the contents of capture buffers defined inside an eval'ed pattern +are not available outside of the pattern, and vice versa, there is no +way for the inner pattern to refer to a capture buffer defined outside. +Thus, + + ('a' x 100)=~/(??{'(.)' x 100})/ + +B match, it will B set $1. + +The C is not interpolated. As before, the rules to determine +where the C ends are currently somewhat convoluted. + +The following pattern matches a parenthesized group: + + $re = qr{ + \( + (?: + (?> [^()]+ ) # Non-parens without backtracking + | + (??{ $re }) # Group with matching parens + )* + \) + }x; + +See also C<(?PARNO)> for a different, more efficient way to accomplish +the same task. + +Because perl's regex engine is not currently re-entrant, delayed +code may not invoke the regex engine either directly with C or C), +or indirectly with functions such as C. + +Recursing deeper than 50 times without consuming any input string will +result in a fatal error. The maximum depth is compiled into perl, so +changing it requires a custom build. + +=item C<(?PARNO)> C<(?R)> C<(?0)> +X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> +X X X + +Similar to C<(??{ code })> except it does not involve compiling any code, +instead it treats the contents of a capture buffer as an independent +pattern that must match at the current position. Capture buffers +contained by the pattern will have the value as determined by the +outermost recursion. + +PARNO is a sequence of digits (not starting with 0) whose value reflects +the paren-number of the capture buffer to recurse to. C<(?R)> recurses to +the beginning of the whole pattern. C<(?0)> is an alternate syntax for +C<(?R)>. + +The following pattern matches a function foo() which may contain +balanced parentheses as the argument. + + $re = qr{ ( # paren group 1 (full function) + foo + ( # paren group 2 (parens) + \( + ( # paren group 3 (contents of parens) + (?: + (?> [^()]+ ) # Non-parens without backtracking + | + (?2) # Recurse to start of paren group 2 + )* + ) + \) + ) + ) + }x; + +If the pattern was used as follows + + 'foo(bar(baz)+baz(bop))'=~/$re/ + and print "\$1 = $1\n", + "\$2 = $2\n", + "\$3 = $3\n"; + +the output produced should be the following: + + $1 = foo(bar(baz)+baz(bop)) + $2 = (bar(baz)+baz(bop)) + $3 = bar(baz)+baz(bop) + +If there is no corresponding capture buffer defined, then it is a +fatal error. Recursing deeper than 50 times without consuming any input +string will also result in a fatal error. The maximum depth is compiled +into perl, so changing it requires a custom build. + +B that this pattern does not behave the same way as the equivalent +PCRE or Python construct of the same form. In perl you can backtrack into +a recursed group, in PCRE and Python the recursed into group is treated +as atomic. Also, constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect +the pattern being recursed into. + +=item C<(?&NAME)> +X<(?&NAME)> + +Recurse to a named subpattern. Identical to (?PARNO) except that the +parenthesis to recurse to is determined by name. If multiple parens have +the same name, then it recurses to the leftmost. + +It is an error to refer to a name that is not declared somewhere in the +pattern. + +=item C<(?(condition)yes-pattern|no-pattern)> +X<(?()> + +=item C<(?(condition)yes-pattern)> + +Conditional expression. C<(condition)> should be either an integer in +parentheses (which is valid if the corresponding pair of parentheses +matched), a look-ahead/look-behind/evaluate zero-width assertion, a +name in angle brackets or single quotes (which is valid if a buffer +with the given name matched), or the special symbol (R) (true when +evaluated inside of recursion or eval). Additionally the R may be +followed by a number, (which will be true when evaluated when recursing +inside of the appropriate group), or by C<&NAME>, in which case it will +be true only when evaluated during recursion in the named group. + +Here's a summary of the possible predicates: + +=over 4 + +=item (1) (2) ... + +Checks if the numbered capturing buffer has matched something. + +=item () ('NAME') + +Checks if a buffer with the given name has matched something. + +=item (?{ CODE }) + +Treats the code block as the condition. + +=item (R) + +Checks if the expression has been evaluated inside of recursion. + +=item (R1) (R2) ... + +Checks if the expression has been evaluated while executing directly +inside of the n-th capture group. This check is the regex equivalent of + + if ((caller(0))[3] eq 'subname') { ... } + +In other words, it does not check the full recursion stack. + +=item (R&NAME) + +Similar to C<(R1)>, this predicate checks to see if we're executing +directly inside of the leftmost group with a given name (this is the same +logic used by C<(?&NAME)> to disambiguate). It does not check the full +stack, but only the name of the innermost active recursion. + +=item (DEFINE) + +In this case, the yes-pattern is never directly executed, and no +no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. +See below for details. + +=back + +For example: + + m{ ( \( )? + [^()]+ + (?(1) \) ) + }x + +matches a chunk of non-parentheses, possibly included in parentheses +themselves. + +A special form is the C<(DEFINE)> predicate, which never executes directly +its yes-pattern, and does not allow a no-pattern. This allows to define +subpatterns which will be executed only by using the recursion mechanism. +This way, you can define a set of regular expression rules that can be +bundled into any pattern you choose. + +It is recommended that for this usage you put the DEFINE block at the +end of the pattern, and that you name any subpatterns defined within it. + +Also, it's worth noting that patterns defined this way probably will +not be as efficient, as the optimiser is not very clever about +handling them. + +An example of how this might be used is as follows: + + /(?(&NAME_PAT))(?(&ADDRESS_PAT)) + (?(DEFINE) + (....) + (....) + )/x + +Note that capture buffers matched inside of recursion are not accessible +after the recursion returns, so the extra layer of capturing buffers are +necessary. Thus C<$+{NAME_PAT}> would not be defined even though +C<$+{NAME}> would be. + +=item C<< (?>pattern) >> +X X X X + +An "independent" subexpression, one which matches the substring +that a I C would match if anchored at the given +position, and it matches I. This +construct is useful for optimizations of what would otherwise be +"eternal" matches, because it will not backtrack (see L<"Backtracking">). +It may also be useful in places where the "grab all you can, and do not +give anything back" semantic is desirable. + +For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >> +(anchored at the beginning of string, as above) will match I +characters C at the beginning of string, leaving no C for +C to match. In contrast, C will match the same as C, +since the match of the subgroup C is influenced by the following +group C (see L<"Backtracking">). In particular, C inside +C will match fewer characters than a standalone C, since +this makes the tail match. + +An effect similar to C<< (?>pattern) >> may be achieved by writing +C<(?=(pattern))\1>. This matches the same substring as a standalone +C, and the following C<\1> eats the matched string; it therefore +makes a zero-length assertion into an analogue of C<< (?>...) >>. +(The difference between these two constructs is that the second one +uses a capturing group, thus shifting ordinals of backreferences +in the rest of a regular expression.) + +Consider this pattern: + + m{ \( + ( + [^()]+ # x+ + | + \( [^()]* \) + )+ + \) + }x + +That will efficiently match a nonempty group with matching parentheses +two levels deep or less. However, if there is no such group, it +will take virtually forever on a long string. That's because there +are so many different ways to split a long string into several +substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar +to a subpattern of the above pattern. Consider how the pattern +above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several +seconds, but that each extra letter doubles this time. This +exponential performance will make it appear that your program has +hung. However, a tiny change to this pattern + + m{ \( + ( + (?> [^()]+ ) # change x+ above to (?> x+ ) + | + \( [^()]* \) + )+ + \) + }x + +which uses C<< (?>...) >> matches exactly when the one above does (verifying +this yourself would be a productive exercise), but finishes in a fourth +the time when used on a similar string with 1000000 Cs. Be aware, +however, that this pattern currently triggers a warning message under +the C pragma or B<-w> switch saying it +C<"matches null string many times in regex">. + +On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable +effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. +This was only 4 times slower on a string with 1000000 Cs. + +The "grab all you can, and do not give anything back" semantic is desirable +in many situations where on the first sight a simple C<()*> looks like +the correct solution. Suppose we parse text with comments being delimited +by C<#> followed by some optional (horizontal) whitespace. Contrary to +its appearance, C<#[ \t]*> I the correct subexpression to match +the comment delimiter, because it may "give up" some whitespace if +the remainder of the pattern can be made to match that way. The correct +answer is either one of these: + + (?>#[ \t]*) + #[ \t]*(?![ \t]) + +For example, to grab non-empty comments into $1, one should use either +one of these: + + / (?> \# [ \t]* ) ( .+ ) /x; + / \# [ \t]* ( [^ \t] .* ) /x; + +Which one you pick depends on which of these expressions better reflects +the above specification of comments. + +In some literature this construct is called "atomic matching" or +"possessive matching". + +Possessive quantifiers are equivalent to putting the item they are applied +to inside of one of these constructs. The following equivalences apply: + + Quantifier Form Bracketing Form + --------------- --------------- + PAT*+ (?>PAT*) + PAT++ (?>PAT+) + PAT?+ (?>PAT?) + PAT{min,max}+ (?>PAT{min,max}) + +=back + +=head2 Special Backtracking Control Verbs + +B These patterns are experimental and subject to change or +removal in a future version of perl. Their usage in production code should +be noted to avoid problems during upgrades. + +These special patterns are generally of the form C<(*VERB:ARG)>. Unless +otherwise stated the ARG argument is optional; in some cases, it is +forbidden. + +Any pattern containing a special backtracking verb that allows an argument +has the special behaviour that when executed it sets the current packages' +C<$REGERROR> variable. In this case, the following rules apply: + +On failure, this variable will be set to the ARG value of the verb +pattern, if the verb was involved in the failure of the match. If the ARG +part of the pattern was omitted, then C<$REGERROR> will be set to TRUE. + +On a successful match this variable will be set to FALSE. + +B C<$REGERROR> is not a magic variable in the same sense than +C<$1> and most other regex related variables. It is not local to a +scope, nor readonly but instead a volatile package variable similar to +C<$AUTOLOAD>. Use C to localize changes to it to a specific scope +if necessary. + +If a pattern does not contain a special backtracking verb that allows an +argument, then C<$REGERROR> is not touched at all. + +=over 4 + +=item Verbs that take an argument + +=over 4 + +=item C<(*NOMATCH)> C<(*NOMATCH:NAME)> +X<(*NOMATCH)> X<(*NOMATCH:NAME)> + +This zero-width pattern commits the match at the current point, preventing +the engine from backtracking on failure to the left of the this point. +Consider the pattern C, where A and B are complex patterns. +Until the C<(*NOMATCH)> is reached, A may backtrack as necessary to match. +Once it is reached, matching continues in B, which may also backtrack as +necessary; however, should B not match, then no further backtracking will +take place, and the pattern will fail outright at that starting position. + +The following example counts all the possible matching strings in a +pattern (without actually matching any of them). + + 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/; + print "Count=$count\n"; + +which produces: + + aaab + aaa + aa + a + aab + aa + a + ab + a + Count=9 + +If we add a C<(*NOMATCH)> before the count like the following + + 'aaab' =~ /a+b?(*NOMATCH)(?{print "$&\n"; $count++})(*FAIL)/; + print "Count=$count\n"; + +we prevent backtracking and find the count of the longest matching +at each matching startpoint like so: + + aaab + aab + ab + Count=3 + +Any number of C<(*NOMATCH)> assertions may be used in a pattern. + +See also C<< (?>pattern) >> and possessive quantifiers for other +ways to control backtracking. + +=item C<(*MARK)> C<(*MARK:NAME)> +X<(*MARK)> + +This zero-width pattern can be used to mark the point in a string +reached when a certain part of the pattern has been successfully +matched. This mark may be given a name. A later C<(*CUT)> pattern +will then cut at that point if backtracked into on failure. Any +number of (*MARK) patterns are allowed, and the NAME portion is +optional and may be duplicated. + +See C<*CUT> for more detail. + +=item C<(*CUT)> C<(*CUT:NAME)> +X<(*CUT)> + +This zero-width pattern is similar to C<(*NOMATCH)>, except that on +failure it also signifies that whatever text that was matched leading up +to the C<(*CUT)> pattern being executed cannot be part of a match, I. This effectively means that the regex +engine moves forward to this position on failure and tries to match +again, (assuming that there is sufficient room to match). + +The name of the C<(*CUT:NAME)> pattern has special significance. If a +C<(*MARK:NAME)> was encountered while matching, then it is the position +where that pattern was executed that is used for the "cut point" in the +string. If no mark of that name was encountered, then the cut is done at +the point where the C<(*CUT)> was. Similarly if no NAME is specified in +the C<(*CUT)>, and if a C<(*MARK)> with any name (or none) is encountered, +then that C<(*MARK)>'s cursor point will be used. If the C<(*CUT)> is not +preceded by a C<(*MARK)>, then the cut point is where the string was when +the C<(*CUT)> was encountered. + +Compare the following to the examples in C<(*NOMATCH)>, note the string +is twice as long: + + 'aaabaaab' =~ /a+b?(*CUT)(?{print "$&\n"; $count++})(*FAIL)/; + print "Count=$count\n"; + +outputs + + aaab + aaab + Count=2 + +Once the 'aaab' at the start of the string has matched, and the C<(*CUT)> +executed, the next startpoint will be where the cursor was when the +C<(*CUT)> was executed. + +=item C<(*COMMIT)> +X<(*COMMIT)> + +This zero-width pattern is similar to C<(*CUT)> except that it causes +the match to fail outright. No attempts to match will occur again. + + 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/; + print "Count=$count\n"; + +outputs + + aaab + Count=1 + +In other words, once the C<(*COMMIT)> has been entered, and if the pattern +does not match, the regex engine will not try any further matching on the +rest of the string. =back -The specific choice of question mark for this and the new minimal -matching construct was because 1) question mark is pretty rare in older -regular expressions, and 2) whenever you see one, you should stop -and "question" exactly what is going on. That's psychology... +=item Verbs without an argument + +=over 4 + +=item C<(*FAIL)> C<(*F)> +X<(*FAIL)> X<(*F)> + +This pattern matches nothing and always fails. It can be used to force the +engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In +fact, C<(?!)> gets optimised into C<(*FAIL)> internally. + +It is probably useful only when combined with C<(?{})> or C<(??{})>. + +=item C<(*ACCEPT)> +X<(*ACCEPT)> + +B This feature is highly experimental. It is not recommended +for production code. + +This pattern matches nothing and causes the end of successful matching at +the point at which the C<(*ACCEPT)> pattern was encountered, regardless of +whether there is actually more to match in the string. When inside of a +nested pattern, such as recursion or a dynamically generated subbpattern +via C<(??{})>, only the innermost pattern is ended immediately. + +If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are +marked as ended at the point at which the C<(*ACCEPT)> was encountered. +For instance: + + 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; + +will match, and C<$1> will be C and C<$2> will be C, C<$3> will not +be set. If another branch in the inner parens were matched, such as in the +string 'ACDE', then the C and C would have to be matched as well. + +=back + +=back + +=head2 Backtracking +X X + +NOTE: This section presents an abstract approximation of regular +expression behavior. For a more rigorous (and complicated) view of +the rules involved in selecting a match among possible alternatives, +see L. + +A fundamental feature of regular expression matching involves the +notion called I, which is currently used (when needed) +by all regular expression quantifiers, namely C<*>, C<*?>, C<+>, +C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized +internally, but the general principle outlined here is valid. + +For a regular expression to match, the I regular expression must +match, not just part of it. So if the beginning of a pattern containing a +quantifier succeeds in a way that causes later parts in the pattern to +fail, the matching engine backs up and recalculates the beginning +part--that's why it's called backtracking. + +Here is an example of backtracking: Let's say you want to find the +word following "foo" in the string "Food is on the foo table.": + + $_ = "Food is on the foo table."; + if ( /\b(foo)\s+(\w+)/i ) { + print "$2 follows $1.\n"; + } + +When the match runs, the first part of the regular expression (C<\b(foo)>) +finds a possible match right at the beginning of the string, and loads up +$1 with "Foo". However, as soon as the matching engine sees that there's +no whitespace following the "Foo" that it had saved in $1, it realizes its +mistake and starts over again one character after where it had the +tentative match. This time it goes all the way until the next occurrence +of "foo". The complete regular expression matches this time, and you get +the expected output of "table follows foo." + +Sometimes minimal matching can help a lot. Imagine you'd like to match +everything between "foo" and "bar". Initially, you write something +like this: + + $_ = "The food is under the bar in the barn."; + if ( /foo(.*)bar/ ) { + print "got <$1>\n"; + } + +Which perhaps unexpectedly yields: + + got + +That's because C<.*> was greedy, so you get everything between the +I "foo" and the I "bar". Here it's more effective +to use minimal matching to make sure you get the text between a "foo" +and the first "bar" thereafter. + + if ( /foo(.*?)bar/ ) { print "got <$1>\n" } + got + +Here's another example: let's say you'd like to match a number at the end +of a string, and you also want to keep the preceding part of the match. +So you write this: + + $_ = "I have 2 numbers: 53147"; + if ( /(.*)(\d*)/ ) { # Wrong! + print "Beginning is <$1>, number is <$2>.\n"; + } + +That won't work at all, because C<.*> was greedy and gobbled up the +whole string. As C<\d*> can match on an empty string the complete +regular expression matched successfully. + + Beginning is , number is <>. + +Here are some variants, most of which don't work: + + $_ = "I have 2 numbers: 53147"; + @pats = qw{ + (.*)(\d*) + (.*)(\d+) + (.*?)(\d*) + (.*?)(\d+) + (.*)(\d+)$ + (.*?)(\d+)$ + (.*)\b(\d+)$ + (.*\D)(\d+)$ + }; + + for $pat (@pats) { + printf "%-12s ", $pat; + if ( /$pat/ ) { + print "<$1> <$2>\n"; + } else { + print "FAIL\n"; + } + } + +That will print out: + + (.*)(\d*) <> + (.*)(\d+) <7> + (.*?)(\d*) <> <> + (.*?)(\d+) <2> + (.*)(\d+)$ <7> + (.*?)(\d+)$ <53147> + (.*)\b(\d+)$ <53147> + (.*\D)(\d+)$ <53147> + +As you see, this can be a bit tricky. It's important to realize that a +regular expression is merely a set of assertions that gives a definition +of success. There may be 0, 1, or several different ways that the +definition might succeed against a particular string. And if there are +multiple ways it might succeed, you need to understand backtracking to +know which variety of success you will achieve. + +When using look-ahead assertions and negations, this can all get even +trickier. Imagine you'd like to find a sequence of non-digits not +followed by "123". You might try to write that as + + $_ = "ABC123"; + if ( /^\D*(?!123)/ ) { # Wrong! + print "Yup, no 123 in $_\n"; + } + +But that isn't going to match; at least, not the way you're hoping. It +claims that there is no 123 in the string. Here's a clearer picture of +why that pattern matches, contrary to popular expectations: + + $x = 'ABC123'; + $y = 'ABC445'; + + print "1: got $1\n" if $x =~ /^(ABC)(?!123)/; + print "2: got $1\n" if $y =~ /^(ABC)(?!123)/; + + print "3: got $1\n" if $x =~ /^(\D*)(?!123)/; + print "4: got $1\n" if $y =~ /^(\D*)(?!123)/; + +This prints + + 2: got ABC + 3: got AB + 4: got ABC + +You might have expected test 3 to fail because it seems to a more +general purpose version of test 1. The important difference between +them is that test 3 contains a quantifier (C<\D*>) and so can use +backtracking, whereas test 1 will not. What's happening is +that you've asked "Is it true that at the start of $x, following 0 or more +non-digits, you have something that's not 123?" If the pattern matcher had +let C<\D*> expand to "ABC", this would have caused the whole pattern to +fail. + +The search engine will initially match C<\D*> with "ABC". Then it will +try to match C<(?!123> with "123", which fails. But because +a quantifier (C<\D*>) has been used in the regular expression, the +search engine can backtrack and retry the match differently +in the hope of matching the complete regular expression. + +The pattern really, I wants to succeed, so it uses the +standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this +time. Now there's indeed something following "AB" that is not +"123". It's "C123", which suffices. + +We can deal with this by using both an assertion and a negation. +We'll say that the first part in $1 must be followed both by a digit +and by something that's not "123". Remember that the look-aheads +are zero-width expressions--they only look, but don't consume any +of the string in their match. So rewriting this way produces what +you'd expect; that is, case 5 will fail, but case 6 succeeds: + + print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/; + print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/; + + 6: got ABC + +In other words, the two zero-width assertions next to each other work as though +they're ANDed together, just as you'd use any built-in assertions: C +matches only if you're at the beginning of the line AND the end of the +line simultaneously. The deeper underlying truth is that juxtaposition in +regular expressions always means AND, except when you write an explicit OR +using the vertical bar. C means match "a" AND (then) match "b", +although the attempted matches are made at different positions because "a" +is not a zero-width assertion, but a one-width assertion. + +B: particularly complicated regular expressions can take +exponential time to solve because of the immense number of possible +ways they can use backtracking to try match. For example, without +internal optimizations done by the regular expression engine, this will +take a painfully long time to run: + + 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/ + +And if you used C<*>'s in the internal groups instead of limiting them +to 0 through 5 matches, then it would take forever--or until you ran +out of stack space. Moreover, these internal optimizations are not +always applicable. For example, if you put C<{0,5}> instead of C<*> +on the external group, no current optimization is applicable, and the +match takes a long time to finish. + +A powerful tool for optimizing such beasts is what is known as an +"independent group", +which does not backtrack (see Lpattern) >>>). Note also that +zero-length look-ahead/look-behind assertions will not backtrack to make +the tail match, since they are in "logical" context: only +whether they match is considered relevant. For an example +where side-effects of look-ahead I have influenced the +following match, see Lpattern) >>>. =head2 Version 8 Regular Expressions +X X X -In case you're not familiar with the "regular" Version 8 regexp +In case you're not familiar with the "regular" Version 8 regex routines, here are the pattern-matching rules not described above. Any single character matches itself, unless it is a I with a special meaning described here or above. You can cause -characters which normally function as metacharacters to be interpreted -literally by prefixing them with a "\" (e.g. "\." matches a ".", not any +characters that normally function as metacharacters to be interpreted +literally by prefixing them with a "\" (e.g., "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern C would match "blurfl" in the target string. You can specify a character class, by enclosing a list of characters -in C<[]>, which will match any one of the characters in the list. If the +in C<[]>, which will match any one character from the list. If the first character after the "[" is "^", the class matches any character not -in the list. Within a list, the "-" character is used to specify a -range, so that C represents all the characters between "a" and "z", -inclusive. +in the list. Within a list, the "-" character specifies a +range, so that C represents all characters between "a" and "z", +inclusive. If you want either "-" or "]" itself to be a member of a +class, put it at the start of the list (possibly after a "^"), or +escape it with a backslash. "-" is also taken literally when it is +at the end of the list, just before the closing "]". (The +following all specify the same class of three characters: C<[-az]>, +C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which +specifies a class containing twenty-six characters, even on EBCDIC +based coded character sets.) Also, if you try to use the character +classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of +a range, that's not a range, the "-" is understood literally. + +Note also that the whole range idea is rather unportable between +character sets--and even within character sets they may cause results +you probably didn't expect. A sound principle is to use only ranges +that begin from and end at either alphabets of equal case ([a-e], +[A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, +spell out the character sets in full. Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, "\f" a form feed, etc. More generally, \I, where I is a string -of octal digits, matches the character whose ASCII value is I. -Similarly, \xI, where I are hexidecimal digits, matches the -character whose ASCII value is I. The expression \cI matches the -ASCII character control-I. Finally, the "." metacharacter matches any -character except "\n" (unless you use C). +of octal digits, matches the character whose coded character set value +is I. Similarly, \xI, where I are hexadecimal digits, +matches the character whose numeric value is I. The expression \cI +matches the character control-I. Finally, the "." metacharacter +matches any character except "\n" (unless you use C). You can specify a series of alternatives for a pattern using "|" to separate them, so that C will match any of "fee", "fie", -or "foe" in the target string (as would C). Note that the +or "foe" in the target string (as would C). The first alternative includes everything from the last pattern delimiter ("(", "[", or the beginning of the pattern) up to the first "|", and the last alternative contains everything from the last "|" to the next -pattern delimiter. For this reason, it's common practice to include -alternatives in parentheses, to minimize confusion about where they -start and end. Note also that the pattern C<(fee|fie|foe)> differs -from the pattern C<[fee|fie|foe]> in that the former matches "fee", -"fie", or "foe" in the target string, while the latter matches -anything matched by the classes C<[fee]>, C<[fie]>, or C<[foe]> (i.e. -the class C<[feio]>). - -Within a pattern, you may designate subpatterns for later reference by -enclosing them in parentheses, and you may refer back to the Ith -subpattern later in the pattern using the metacharacter \I. -Subpatterns are numbered based on the left to right order of their -opening parenthesis. Note that a backreference matches whatever -actually matched the subpattern in the string being examined, not the -rules for that subpattern. Therefore, C<([0|0x])\d*\s\1\d*> will -match "0x1234 0x4321",but not "0x1234 01234", since subpattern 1 -actually matched "0x", even though the rule C<[0|0x]> could -potentially match the leading 0 in the second number. +pattern delimiter. That's why it's common practice to include +alternatives in parentheses: to minimize confusion about where they +start and end. + +Alternatives are tried from left to right, so the first +alternative found for which the entire expression matches, is the one that +is chosen. This means that alternatives are not necessarily greedy. For +example: when matching C against "barefoot", only the "foo" +part will match, as that is the first alternative tried, and it successfully +matches the target string. (This might not seem important, but it is +important when you are capturing matched text using parentheses.) + +Also remember that "|" is interpreted as a literal within square brackets, +so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>. + +Within a pattern, you may designate subpatterns for later reference +by enclosing them in parentheses, and you may refer back to the +Ith subpattern later in the pattern using the metacharacter +\I. Subpatterns are numbered based on the left to right order +of their opening parenthesis. A backreference matches whatever +actually matched the subpattern in the string being examined, not +the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will +match "0x1234 0x4321", but not "0x1234 01234", because subpattern +1 matched "0x", even though the rule C<0|0x> could potentially match +the leading 0 in the second number. + +=head2 Warning on \1 vs $1 + +Some people get too used to writing things like: + + $pattern =~ s/(\W)/\\\1/g; + +This is grandfathered for the RHS of a substitute to avoid shocking the +B addicts, but it's a dirty habit to get into. That's because in +PerlThink, the righthand side of an C is a double-quoted string. C<\1> in +the usual double-quoted string means a control-A. The customary Unix +meaning of C<\1> is kludged in for C. However, if you get into the habit +of doing that, you get yourself into trouble if you then add an C +modifier. + + s/(\d+)/ \1 + 1 /eg; # causes warning under -w + +Or if you try to do + + s/(\d+)/\1000/; + +You can't disambiguate that by saying C<\{1}000>, whereas you can fix it with +C<${1}000>. The operation of interpolation should not be confused +with the operation of matching a backreference. Certainly they mean two +different things on the I side of the C. + +=head2 Repeated patterns matching zero-length substring + +B: Difficult material (and prose) ahead. This section needs a rewrite. + +Regular expressions provide a terse and powerful programming language. As +with most other power tools, power comes together with the ability +to wreak havoc. + +A common abuse of this power stems from the ability to make infinite +loops using regular expressions, with something as innocuous as: + + 'foo' =~ m{ ( o? )* }x; + +The C can match at the beginning of C<'foo'>, and since the position +in the string is not moved by the match, C would match again and again +because of the C<*> modifier. Another common way to create a similar cycle +is with the looping modifier C: + + @matches = ( 'foo' =~ m{ o? }xg ); + +or + + print "match: <$&>\n" while 'foo' =~ m{ o? }xg; + +or the loop implied by split(). + +However, long experience has shown that many programming tasks may +be significantly simplified by using repeated subexpressions that +may match zero-length substrings. Here's a simple example being: + + @chars = split //, $string; # // is not magic in split + ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / + +Thus Perl allows such constructs, by I. The rules for this are different for lower-level +loops given by the greedy modifiers C<*+{}>, and for higher-level +ones like the C modifier or split() operator. + +The lower-level loops are I (that is, the loop is +broken) when Perl detects that a repeated expression matched a +zero-length substring. Thus + + m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; + +is made equivalent to + + m{ (?: NON_ZERO_LENGTH )* + | + (?: ZERO_LENGTH )? + }x; + +The higher level-loops preserve an additional state between iterations: +whether the last match was zero-length. To break the loop, the following +match after a zero-length match is prohibited to have a length of zero. +This prohibition interacts with backtracking (see L<"Backtracking">), +and so the I match is chosen if the I match is of +zero length. + +For example: + + $_ = 'bar'; + s/\w??/<$&>/g; + +results in C<< <><><><> >>. At each position of the string the best +match given by non-greedy C is the zero-length match, and the I match is what is matched by C<\w>. Thus zero-length matches +alternate with one-character-long matches. + +Similarly, for repeated C the second-best match is the match at the +position one notch further in the string. + +The additional state of being I is associated with +the matched string, and is reset by each assignment to pos(). +Zero-length matches at the end of the previous match are ignored +during C. + +=head2 Combining pieces together + +Each of the elementary pieces of regular expressions which were described +before (such as C or C<\Z>) could match at most one substring +at the given position of the input string. However, in a typical regular +expression these elementary pieces are combined into more complicated +patterns using combining operators C, C, C etc +(in these examples C and C are regular subexpressions). + +Such combinations can include alternatives, leading to a problem of choice: +if we match a regular expression C against C<"abc">, will it match +substring C<"a"> or C<"ab">? One way to describe which substring is +actually matched is the concept of backtracking (see L<"Backtracking">). +However, this description is too low-level and makes you think +in terms of a particular implementation. + +Another description starts with notions of "better"/"worse". All the +substrings which may be matched by the given regular expression can be +sorted from the "best" match to the "worst" match, and it is the "best" +match which is chosen. This substitutes the question of "what is chosen?" +by the question of "which matches are better, and which are worse?". + +Again, for elementary pieces there is no such question, since at most +one match at a given position is possible. This section describes the +notion of better/worse for combining operators. In the description +below C and C are regular subexpressions. + +=over 4 + +=item C + +Consider two possible matches, C and C, C and C are +substrings which can be matched by C, C and C are substrings +which can be matched by C. + +If C is better match for C than C, C is a better +match than C. + +If C and C coincide: C is a better match than C if +C is better match for C than C. + +=item C + +When C can match, it is a better match than when only C can match. + +Ordering of two matches for C is the same as for C. Similar for +two matches for C. + +=item C + +Matches as C (repeated as many times as necessary). + +=item C + +Matches as C. + +=item C + +Matches as C. + +=item C, C, C + +Same as C, C, C respectively. + +=item C, C, C + +Same as C, C, C respectively. + +=item C<< (?>S) >> + +Matches the best match for C and only that. + +=item C<(?=S)>, C<(?<=S)> + +Only the best match for C is considered. (This is important only if +C has capturing parentheses, and backreferences are used somewhere +else in the whole regular expression.) + +=item C<(?!S)>, C<(? + +For this grouping operator there is no need to describe the ordering, since +only whether or not C can match is important. + +=item C<(??{ EXPR })>, C<(?PARNO)> + +The ordering is the same as for the regular expression which is +the result of EXPR, or the pattern contained by capture buffer PARNO. + +=item C<(?(condition)yes-pattern|no-pattern)> + +Recall that which of C or C actually matches is +already determined. The ordering of the matches is the same as for the +chosen subexpression. + +=back + +The above recipes describe the ordering of matches I. +One more rule is needed to understand how a match is determined for the +whole regular expression: a match at an earlier position is always better +than a match at a later position. + +=head2 Creating custom RE engines + +Overloaded constants (see L) provide a simple way to extend +the functionality of the RE engine. + +Suppose that we want to enable a new RE escape-sequence C<\Y|> which +matches at boundary between whitespace characters and non-whitespace +characters. Note that C<(?=\S)(? matches exactly +at these positions, so we want to have each C<\Y|> in the place of the +more complicated version. We can create a module C to do +this: + + package customre; + use overload; + + sub import { + shift; + die "No argument to customre::import allowed" if @_; + overload::constant 'qr' => \&convert; + } + + sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} + + # We must also take care of not escaping the legitimate \\Y| + # sequence, hence the presence of '\\' in the conversion rules. + my %rules = ( '\\' => '\\\\', + 'Y|' => qr/(?=\S)(? enables the new escape in constant regular +expressions, i.e., those without any runtime variable interpolations. +As documented in L, this conversion will work only over +literal parts of regular expressions. For C<\Y|$re\Y|> the variable +part of this regular expression needs to be converted explicitly +(but only if the special meaning of C<\Y|> should be enabled inside $re): + + use customre; + $re = <>; + chomp $re; + $re = customre::convert $re; + /\Y|$re\Y|/; + +=head1 BUGS + +This document varies from difficult to understand to completely +and utterly opaque. The wandering prose riddled with jargon is +hard to fathom in several places. + +This document needs a rewrite that separates the tutorial content +from the reference content. + +=head1 SEE ALSO + +L. + +L. + +L. + +L. + +L. + +L. + +L. + +L. + +I by Jeffrey Friedl, published +by O'Reilly and Associates.