X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=529c44adf16f28f8c6d9153975631fd5031906f8;hb=197afce1e759b5f0a1885a151064a83b27a7324e;hp=6c2049628c542d56ab9fbd2444b7ac375d2d5472;hpb=1f1031fe96c14865e4f60fdd3a6a6ce073d190c1;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 6c20496..529c44a 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -16,6 +16,9 @@ operations, plus various examples of the same, see discussions of C, C, C and C in L. + +=head2 Modifiers + Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside are listed below. Modifiers that alter the way a regular expression @@ -24,15 +27,6 @@ L. =over 4 -=item i -X X X -X - -Do case-insensitive pattern matching. - -If C is in effect, the case map is taken from the current -locale. See L. - =item m X X X X @@ -51,11 +45,35 @@ Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. +=item i +X X X +X + +Do case-insensitive pattern matching. + +If C is in effect, the case map is taken from the current +locale. See L. + =item x X Extend your pattern's legibility by permitting whitespace and comments. +=item p +X

X X + +Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and +${^POSTMATCH} are available for use after matching. + +=item g and c +X X + +Global matching, and keep the Current position after failed matching. +Unlike i, m, s and x, these two flags affect the way the regex is used +rather than the regex itself. See +L for further explanation +of the g and c modifiers. + =back These are usually written as "the C modifier", even though the delimiter @@ -84,7 +102,7 @@ X =head3 Metacharacters -The patterns used in Perl pattern matching derive from supplied in +The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See L for @@ -136,8 +154,8 @@ X X X<*> X<+> X X<{n}> X<{n,}> X<{n,m}> (If a curly bracket occurs in any other context, it is treated as a regular character. In particular, the lower bound -is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+" -modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited +is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+" +quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can be seen in the error message generated by code such as this: @@ -149,24 +167,24 @@ many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness": -X X X +X X X X X<*?> X<+?> X X<{n}?> X<{n,}?> X<{n,m}?> - *? Match 0 or more times - +? Match 1 or more times - ?? Match 0 or 1 time - {n}? Match exactly n times - {n,}? Match at least n times - {n,m}? Match at least n but not more than m times + *? Match 0 or more times, not greedily + +? Match 1 or more times, not greedily + ?? Match 0 or 1 time, not greedily + {n}? Match exactly n times, not greedily + {n,}? Match at least n times, not greedily + {n,m}? Match at least n but not more than m times, not greedily By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is -sometimes undesirable. Thus Perl provides the "possesive" quantifier form +sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well. - *+ Match 0 or more times and give nothing back - ++ Match 1 or more times and give nothing back - ?+ Match 0 or 1 time and give nothing back + *+ Match 0 or more times and give nothing back + ++ Match 1 or more times and give nothing back + ?+ Match 0 or 1 time and give nothing back {n}+ Match exactly n times and give nothing back (redundant) {n,}+ Match at least n times and give nothing back {n,m}+ Match at least n but not more than m times and give nothing back @@ -183,7 +201,7 @@ string" problem can be most efficiently performed when written as: /"(?:[^"\\]++|\\.)*+"/ -as we know that if the final quote does not match, bactracking will not +as we know that if the final quote does not match, backtracking will not help. See the independent subexpression C<< (?>...) >> for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows: @@ -194,7 +212,7 @@ instance the above example could also be written as follows: Because patterns are processed as double quoted strings, the following also work: -X<\t> X<\n> X<\r> X<\f> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> +X<\t> X<\n> X<\r> X<\f> X<\e> X<\a> X<\l> X<\u> X<\L> X<\U> X<\E> X<\Q> X<\0> X<\c> X<\N> X<\x> \t tab (HT, TAB) @@ -203,11 +221,11 @@ X<\0> X<\c> X<\N> X<\x> \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) - \033 octal char (think of a PDP-11) - \x1B hex char - \x{263a} wide hex char (Unicode SMILEY) - \c[ control char - \N{name} named char + \033 octal char (example: ESC) + \x1B hex char (example: ESC) + \x{263a} long hex char (example: Unicode SMILEY) + \cK control char (example: VT) + \N{name} named Unicode character \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) @@ -224,12 +242,12 @@ An unescaped C<$> or C<@> interpolates the corresponding variable, while escaping will cause the literal string C<\$> to be matched. You'll need to write something like C. -=head3 Character classes +=head3 Character Classes and other Special Escapes In addition, Perl defines the following: -X X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C> -X X +X<\g> X<\k> X<\N> X<\K> X<\v> X<\V> X<\h> X<\H> +X X X X \w Match a "word" character (alphanumeric plus "_") \W Match a non-"word" character @@ -239,8 +257,7 @@ X X \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P - \X Match eXtended Unicode "combining character sequence", - equivalent to (?:\PM\pM*) + \X Match Unicode "eXtended grapheme cluster" \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. @@ -252,9 +269,13 @@ X X optionally be wrapped in curly brackets for safer parsing. \g{name} Named backreference \k Named backreference - \N{name} Named unicode character, or unicode escape - \x12 Hexadecimal escape sequence - \x{1234} Long hexadecimal escape sequence + \K Keep the stuff left of the \K, don't include it in $& + \N Any character but \n + \v Vertical whitespace + \V Not vertical whitespace + \h Horizontal whitespace + \H Not horizontal whitespace + \R Linebreak A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> @@ -262,20 +283,30 @@ to match a string of Perl-identifier characters (which isn't the same as matching an English word). If C is in effect, the list of alphabetic characters generated by C<\w> is taken from the current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, -C<\d>, and C<\D> within character classes, but if you try to use them -as endpoints of a range, that's not a range, the "-" is understood -literally. If Unicode is in effect, C<\s> matches also "\x{85}", -"\x{2028}, and "\x{2029}", see L for more details about -C<\pP>, C<\PP>, and C<\X>, and L about Unicode in general. -You can define your own C<\p> and C<\P> properties, see L. +C<\d>, and C<\D> within character classes, but they aren't usable +as either end of a range. If any of them precedes or follows a "-", +the "-" is understood literally. If Unicode is in effect, C<\s> matches +also "\x{85}", "\x{2028}", and "\x{2029}". See L for more +details about C<\pP>, C<\PP>, C<\X> and the possibility of defining +your own C<\p> and C<\P> properties, and L about Unicode +in general. X<\w> X<\W> X +C<\R> will atomically match a linebreak, including the network line-ending +"\x0D\x0A". Specifically, X<\R> is exactly equivalent to + + (?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}]) + +B C<\R> has no special meaning inside of a character class; +use C<\v> instead (vertical whitespace). +X<\R> + The POSIX character class syntax X [:class:] -is also available. Note that the C<[> and C<]> braces are I; +is also available. Note that the C<[> and C<]> brackets are I; they must always be used within a character class expression. # this is correct: @@ -284,26 +315,34 @@ they must always be used within a character class expression. # this is not, and will generate a warning: $string =~ /[:alpha:]/; -The available classes and their backslash equivalents (if available) are -as follows: -X +The following table shows the mapping of POSIX character class +names, common escapes, literal escape sequences and their equivalent +Unicode style property names. +X X<\p> X<\p{}> X X X X X X X X X X X X X X - alpha - alnum - ascii - blank [1] - cntrl - digit \d - graph - lower - print - punct - space \s [2] - upper - word \w [3] - xdigit +B up to Perl 5.10 the property names used were shared with +standard Unicode properties, this was changed in Perl 5.11, see +L for details. + + POSIX Esc Class Property Note + -------------------------------------------------------- + alnum [0-9A-Za-z] IsPosixAlnum + alpha [A-Za-z] IsPosixAlpha + ascii [\000-\177] IsASCII + blank [\011 ] IsPosixBlank [1] + cntrl [\0-\37\177] IsPosixCntrl + digit \d [0-9] IsPosixDigit + graph [!-~] IsPosixGraph + lower [a-z] IsPosixLower + print [ -~] IsPosixPrint + punct [!-/:-@[-`{-~] IsPosixPunct + space [\11-\15 ] IsPosixSpace [2] + \s [\11\12\14\15 ] IsPerlSpace [2] + upper [A-Z] IsPosixUpper + word \w [0-9A-Z_a-z] IsPerlWord [3] + xdigit [0-9A-Fa-f] IsXDigit =over @@ -313,8 +352,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] -Not exactly equivalent to C<\s> since the C<[[:space:]]> includes -also the (very rare) "vertical tabulator", "\ck", chr(11). +Note that C<\s> and C<[[:space:]]> are B equivalent as C<[[:space:]]> +includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in +ASCII. =item [3] @@ -328,37 +368,26 @@ whole character class. For example: [01[:alpha:]%] -matches zero, one, any alphabetic character, and the percentage sign. +matches zero, one, any alphabetic character, and the percent sign. -The following equivalences to Unicode \p{} constructs and equivalent -backslash character classes (if available), will hold: -X X<\p> X<\p{}> +=over 4 + +=item C<$> + +Currency symbol - [[:...:]] \p{...} backslash +=item C<+> C<< < >> C<=> C<< > >> C<|> C<~> - alpha IsAlpha - alnum IsAlnum - ascii IsASCII - blank IsSpace - cntrl IsCntrl - digit IsDigit \d - graph IsGraph - lower IsLower - print IsPrint - punct IsPunct - space IsSpace - IsSpacePerl \s - upper IsUpper - word IsWord - xdigit IsXDigit +Mathematical symbols -For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. +=item C<^> C<`> -If the C pragma is not used but the C pragma is, the -classes correlate with the usual isalpha(3) interface (except for -"word" and "blank"). +Modifier symbols (accents) -The assumedly non-obviously named classes are: + +=back + +The other named classes are: =over 4 @@ -368,7 +397,7 @@ X Any control character. Usually characters that don't produce output as such but instead control the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than -32 are most often classified as control characters (assuming ASCII, +32 are usually classified as control characters (assuming ASCII, the ISO Latin character sets, and Unicode), as is the character with the ord() value of 127 (C). @@ -401,9 +430,9 @@ X POSIX traditional Unicode - [[:^digit:]] \D \P{IsDigit} - [[:^space:]] \S \P{IsSpace} - [[:^word:]] \W \P{IsWord} + [[:^digit:]] \D \P{IsPosixDigit} + [[:^space:]] \S \P{IsPosixSpace} + [[:^word:]] \W \P{IsPerlWord} Perl respects the POSIX standard in that POSIX character classes are only supported within a character class. The POSIX character classes @@ -419,7 +448,7 @@ X X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> \b Match a word boundary - \B Match a non-(word boundary) + \B Match except at a word boundary \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string @@ -466,9 +495,10 @@ loop. Take care when using patterns that include C<\G> in an alternation. =head3 Capture buffers -The bracketing construct C<( ... )> creates capture buffers. To -refer to the digit'th buffer use \ within the -match. Outside the match use "$" instead of "\". (The +The bracketing construct C<( ... )> creates capture buffers. To refer +to the current contents of a buffer later on, within the same pattern, +use \1 for the first, \2 for the second, and so on. +Outside the match use "$" instead of "\". (The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.) Referring back to another part of the match is called a @@ -486,17 +516,20 @@ left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. +If the bracketing group did not match, the associated backreference won't +match either. (This can happen if the bracketing group is optional, or +in a different branch of an alternation.) X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using -backrefs, in Perl 5.10 the C<\g{N}> notation is provided. The curly -brackets are optional, however omitting them is less safe as the meaning -of the pattern can be changed by text (such as digits) following it. -When N is a positive integer the C<\g{N}> notation is exactly equivalent -to using normal backreferences. When N is a negative integer then it is -a relative backreference referring to the previous N'th capturing group. -When the bracket form is used and N is not an integer, it is treated as a -reference to a named buffer. +backreferences, Perl provides the C<\g{N}> notation (starting with perl +5.10.0). The curly brackets are optional, however omitting them is less +safe as the meaning of the pattern can be changed by text (such as digits) +following it. When N is a positive integer the C<\g{N}> notation is +exactly equivalent to using normal backreferences. When N is a negative +integer then it is a relative backreference referring to the previous N'th +capturing group. When the bracket form is used and N is not an integer, it +is treated as a reference to a named buffer. Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the buffer before that. For example: @@ -512,19 +545,18 @@ buffer before that. For example: and would match the same as C. -Additionally, as of Perl 5.10 you may use named capture buffers and named +Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is C<< (?...) >> to declare and C<< \k >> -to reference. You may also use single quotes instead of angle brackets to quote the -name; and you may use the bracketed C<< \g{name} >> back reference syntax. -The only difference between named capture buffers and unnamed ones is -that multiple buffers may have the same name and that the contents of -named capture buffers are available via the C<%+> hash. When multiple -groups share the same name C<$+{name}> and C<< \k >> refer to the -leftmost defined group, thus it's possible to do things with named capture -buffers that would otherwise require C<(??{})> code to accomplish. Named -capture buffers are numbered just as normal capture buffers are and may be -referenced via the magic numeric variables or via numeric backreferences -as well as by name. +to reference. You may also use apostrophes instead of angle brackets to delimit the +name; and you may use the bracketed C<< \g{name} >> backreference syntax. +It's possible to refer to a named capture buffer by absolute and relative number as well. +Outside the pattern, a named capture buffer is available via the C<%+> hash. +When different buffers within the same pattern have the same name, C<$+{name}> +and C<< \k >> refer to the leftmost defined group. (Thus it's possible +to do things with named capture buffers that would otherwise require C<(??{})> +code to accomplish.) +X X +X<%+> X<$+{name}> X<< \k >> Examples: @@ -536,7 +568,7 @@ Examples: /(?.)\k/ # ... a different way and print "'$+{char}' is the first doubled character\n"; - /(?.)\1/ # ... mix and match + /(?'char'.)\1/ # ... mix and match and print "'$1' is the first doubled character\n"; if (/Time: (..):(..):(..)/) { # parse out values @@ -564,7 +596,7 @@ X<$+> X<$^N> X<$&> X<$`> X<$'> X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> -B: failed matches in Perl do not reset the match variables, +B: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match. @@ -583,6 +615,15 @@ already paid the price. As of 5.005, C<$&> is not so costly as the other two. X<$&> X<$`> X<$'> +As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>, +C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> +and C<$'>, B that they are only guaranteed to be defined after a +successful match that was executed with the C

(preserve) modifier. +The use of these variables incurs no global performance penalty, unlike +their punctuation char equivalents, however at the trade-off that you +have to tell perl when you want to use them. +X

X

+ Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything @@ -636,17 +677,17 @@ whitespace formatting, a simple C<#> will suffice. Note that Perl closes the comment as soon as it sees a C<)>, so there is no way to put a literal C<)> in the comment. -=item C<(?imsx-imsx)> +=item C<(?pimsx-imsx)> X<(?)> One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by C<->) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). This is particularly useful for dynamic patterns, such as those read in from a -configuration file, read in as an argument, are specified in a table -somewhere, etc. Consider the case that some of which want to be case -sensitive and some do not. The case insensitive ones need to include -merely C<(?i)> at the front of the pattern. For example: +configuration file, taken from an argument, or specified in a table +somewhere. Consider the case where some patterns want to be case +sensitive and some do not: The case insensitive ones merely need to +include C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } @@ -660,9 +701,18 @@ These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 -will match a repeated (I!) word C in any -case, assuming C modifier, and no C modifier outside this -group. +will match C in any case, some spaces, and an exact (I!) +repetition of the previous word, assuming the C modifier, and no C +modifier outside this group. + +These modifiers do not carry over into named subpatterns called in the +enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not +change the case-sensitivity of the "NAME" pattern. + +Note that the C

modifier is special in that it can only be enabled, +not disabled, and that its presence anywhere in a pattern has a global +effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn +when executed under C. =item C<(?:pattern)> X<(?:)> @@ -690,6 +740,62 @@ is equivalent to the more verbose /(?:(?s-i)more.*than).*million/i +=item C<(?|pattern)> +X<(?|)> X + +This is the "branch reset" pattern, which has the special property +that the capture buffers are numbered from the same starting point +in each alternation branch. It is available starting from perl 5.10.0. + +Capture buffers are numbered from left to right, but inside this +construct the numbering is restarted for each branch. + +The numbering within each branch will be as normal, and any buffers +following this construct will be numbered as though the construct +contained only one branch, that being the one with the most capture +buffers in it. + +This construct will be useful when you want to capture one of a +number of alternative matches. + +Consider the following pattern. The numbers underneath show in +which buffer the captured content will be stored. + + + # before ---------------branch-reset----------- after + / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x + # 1 2 2 3 2 3 4 + +Be careful when using the branch reset pattern in combination with +named captures. Named captures are implemented as being aliases to +numbered buffers holding the captures, and that interferes with the +implementation of the branch reset pattern. If you are using named +captures in a branch reset pattern, it's best to use the same names, +in the same order, in each of the alternations: + + /(?| (? x ) (? y ) + | (? z ) (? w )) /x + +Not doing so may lead to surprises: + + "12" =~ /(?| (? \d+ ) | (? \D+))/x; + say $+ {a}; # Prints '12' + say $+ {b}; # *Also* prints '12'. + +The problem here is that both the buffer named C<< a >> and the buffer +named C<< b >> are aliases for the buffer belonging to C<< $1 >>. + +=item Look-Around Assertions +X X X X + +Look-around assertions are zero width patterns which match a specific +pattern without including it in C<$&>. Positive assertions match when +their subpattern matches, negative assertions match when their subpattern +fails. Look-behind matches text up to the current match position, +look-ahead matches text following the current match position. + +=over 4 + =item C<(?=pattern)> X<(?=)> X X @@ -716,13 +822,30 @@ Sometimes it's still easier just to say: For look-behind see below. -=item C<(?<=pattern)> -X<(?<=)> X X +=item C<(?<=pattern)> C<\K> +X<(?<=)> X X X<\K> A zero-width positive look-behind assertion. For example, C matches a word that follows a tab, without including the tab in C<$&>. Works only for fixed-width look-behind. +There is a special form of this construct, called C<\K>, which causes the +regex engine to "keep" everything it had matched prior to the C<\K> and +not include it in C<$&>. This effectively provides variable length +look-behind. The use of C<\K> inside of another look-around assertion +is allowed, but the behaviour is currently not well defined. + +For various reasons C<\K> may be significantly more efficient than the +equivalent C<< (?<=...) >> construct, and it is especially useful in +situations where you want to efficiently remove something following +something else in a string. For instance + + s/(foo)bar/$1/g; + +can be rewritten as the much more efficient + + s/foo\Kbar//g; + =item C<(? X<(? X X @@ -730,23 +853,25 @@ A zero-width negative look-behind assertion. For example C matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind. +=back + =item C<(?'NAME'pattern)> =item C<< (?pattern) >> X<< (?) >> X<(?'NAME')> X X A named capture buffer. Identical in every respect to normal capturing -parens C<()> but for the additional fact that C<%+> may be used after -a succesful match to refer to a named buffer. See C for more -details on the C<%+> hash. +parentheses C<()> but for the additional fact that C<%+> or C<%-> may be +used after a successful match to refer to a named buffer. See C +for more details on the C<%+> and C<%-> hashes. If multiple distinct capture buffers have the same name then the $+{NAME} will refer to the leftmost defined buffer in the match. -The forms C<(?'NAME'pattern)> and C<(?pattern)> are equivalent. +The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. B While the notation of this construct is the same as the similar -function in .NET regexes, the behavior is not, in Perl the buffers are +function in .NET regexes, the behavior is not. In Perl the buffers are numbered sequentially regardless of being named or not. Thus in the pattern @@ -761,10 +886,9 @@ its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience -with the Python or PCRE regex engines the pattern C<< (?Ppattern) >> -maybe be used instead of C<< (?pattern) >>; however this form does not -support the use of single quotes as a delimiter for the name. This is -only available in Perl 5.10 or later. +with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> +may be used instead of C<< (?pattern) >>; however this form does not +support the use of single quotes as a delimiter for the name. =item C<< \k >> @@ -775,14 +899,14 @@ the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. -It is an error to refer to a name not defined by a C<(?)> +It is an error to refer to a name not defined by a C<< (?) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience -with the Python or PCRE regex engines the pattern C<< (?P=NAME) >> -maybe be used instead of C<< \k >> in Perl 5.10 or later. +with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> +may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X @@ -826,7 +950,7 @@ Cization are undone, so that # location. >x; -will set C<$res = 4>. Note that after the match, $cnt returns to the globally +will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. @@ -840,20 +964,13 @@ The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. -Due to an unfortunate implementation issue, the Perl code contained in these -blocks is treated as a compile time closure that can have seemingly bizarre -consequences when used with lexically scoped variables inside of subroutines -or loops. There are various workarounds for this, including simply using -global variables instead. If you are using this construct and strange results -occur then check for the use of lexically scoped variables. - For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of C operator (see L). -This restriction is because of the wide-spread and remarkably convenient +This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; @@ -868,9 +985,15 @@ so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. -Because perl's regex engine is not currently re-entrant, interpolated -code may not invoke the regex engine either directly with C or C), -or indirectly with functions such as C. +B: Use of lexical (C) variables in these blocks is +broken. The result is unpredictable and will make perl unstable. The +workaround is to use global (C) variables. + +B: Because Perl's regex engine is currently not re-entrant, +interpolated code may not invoke the regex engine either directly with +C or C), or indirectly with functions such as +C. Invoking the regex engine in these blocks will make perl +unstable. =item C<(??{ code })> X<(??{})> @@ -912,6 +1035,12 @@ The following pattern matches a parenthesized group: See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. +For reasons of security, this construct is forbidden if the regular +expression involves run-time interpolation of variables, unless the +perilous C pragma has been used (see L), or the +variables contain results of C operator (see +L). + Because perl's regex engine is not currently re-entrant, delayed code may not invoke the regex engine either directly with C or C), or indirectly with functions such as C. @@ -989,7 +1118,7 @@ for later use: } B that this pattern does not behave the same way as the equivalent -PCRE or Python construct of the same form. In perl you can backtrack into +PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will @@ -998,8 +1127,8 @@ be processed. =item C<(?&NAME)> X<(?&NAME)> -Recurse to a named subpattern. Identical to (?PARNO) except that the -parenthesis to recurse to is determined by name. If multiple parens have +Recurse to a named subpattern. Identical to C<(?PARNO)> except that the +parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the @@ -1007,7 +1136,7 @@ pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> -maybe be used instead of C<< (?&NAME) >> as of Perl 5.10. +may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> @@ -1100,7 +1229,7 @@ An example of how this might be used is as follows: )/x Note that capture buffers matched inside of recursion are not accessible -after the recursion returns, so the extra layer of capturing buffers are +after the recursion returns, so the extra layer of capturing buffers is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. @@ -1213,7 +1342,7 @@ to inside of one of these constructs. The following equivalences apply: =head2 Special Backtracking Control Verbs B These patterns are experimental and subject to change or -removal in a future version of perl. Their usage in production code should +removal in a future version of Perl. Their usage in production code should be noted to avoid problems during upgrades. These special patterns are generally of the form C<(*VERB:ARG)>. Unless @@ -1221,7 +1350,7 @@ otherwise stated the ARG argument is optional; in some cases, it is forbidden. Any pattern containing a special backtracking verb that allows an argument -has the special behaviour that when executed it sets the current packages' +has the special behaviour that when executed it sets the current package's C<$REGERROR> and C<$REGMARK> variables. When doing so the following rules apply: @@ -1286,7 +1415,7 @@ If we add a C<(*PRUNE)> before the count like the following print "Count=$count\n"; we prevent backtracking and find the count of the longest matching -at each matching startpoint like so: +at each matching starting point like so: aaab aab @@ -1332,7 +1461,7 @@ outputs Count=2 Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> -executed, the next startpoint will be where the cursor was when the +executed, the next starting point will be where the cursor was when the C<(*SKIP)> was executed. =item C<(*MARK:NAME)> C<(*:NAME)> @@ -1342,8 +1471,7 @@ This zero-width pattern can be used to mark the point reached in a string when a certain part of the pattern has been successfully matched. This mark may be given a name. A later C<(*SKIP)> pattern will then skip forward to that point if backtracked into on failure. Any number of -C<(*MARK)> patterns are allowed, and the NAME portion is optional and may -be duplicated. +C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated. In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)> can be used to "label" a pattern branch, so that after matching, the @@ -1355,7 +1483,7 @@ name of the most recently executed C<(*MARK:NAME)> that was involved in the match. This can be used to determine which branch of a pattern was matched -without using a seperate capture buffer for each branch, which in turn +without using a separate capture buffer for each branch, which in turn can result in a performance improvement, as perl cannot optimize C as efficiently as something like C. @@ -1371,7 +1499,7 @@ As a shortcut C<(*MARK:NAME)> can be written C<(*:NAME)>. =item C<(*THEN)> C<(*THEN:NAME)> -This is similar to the "cut group" operator C<::> from Perl6. Like +This is similar to the "cut group" operator C<::> from Perl 6. Like C<(*PRUNE)>, this verb always matches, and when backtracked into on failure, it causes the regex engine to try the next alternation in the innermost enclosing group (capturing or otherwise). @@ -1405,7 +1533,7 @@ backtrack and try C; but the C<(*PRUNE)> verb will simply fail. =item C<(*COMMIT)> X<(*COMMIT)> -This is the Perl6 "commit pattern" C<< >> or C<:::>. It's a +This is the Perl 6 "commit pattern" C<< >> or C<:::>. It's a zero-width pattern similar to C<(*SKIP)>, except that when backtracked into on failure it causes the match to fail outright. No further attempts to find a valid match by advancing the start pointer will occur again. @@ -1447,7 +1575,7 @@ for production code. This pattern matches nothing and causes the end of successful matching at the point at which the C<(*ACCEPT)> pattern was encountered, regardless of whether there is actually more to match in the string. When inside of a -nested pattern, such as recursion or a dynamically generated subbpattern +nested pattern, such as recursion, or in a subpattern dynamically generated via C<(??{})>, only the innermost pattern is ended immediately. If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are @@ -1457,7 +1585,7 @@ For instance: 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; will match, and C<$1> will be C and C<$2> will be C, C<$3> will not -be set. If another branch in the inner parens were matched, such as in the +be set. If another branch in the inner parentheses were matched, such as in the string 'ACDE', then the C and C would have to be matched as well. =back @@ -1470,11 +1598,11 @@ X X NOTE: This section presents an abstract approximation of regular expression behavior. For a more rigorous (and complicated) view of the rules involved in selecting a match among possible alternatives, -see L. +see L. A fundamental feature of regular expression matching involves the notion called I, which is currently used (when needed) -by all regular expression quantifiers, namely C<*>, C<*?>, C<+>, +by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>, C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized internally, but the general principle outlined here is valid. @@ -1522,7 +1650,7 @@ and the first "bar" thereafter. if ( /foo(.*?)bar/ ) { print "got <$1>\n" } got -Here's another example: let's say you'd like to match a number at the end +Here's another example. Let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part of the match. So you write this: @@ -1647,9 +1775,9 @@ using the vertical bar. C means match "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not a zero-width assertion, but a one-width assertion. -B: particularly complicated regular expressions can take +B: Particularly complicated regular expressions can take exponential time to solve because of the immense number of possible -ways they can use backtracking to try match. For example, without +ways they can use backtracking to try for a match. For example, without internal optimizations done by the regular expression engine, this will take a painfully long time to run: @@ -1681,9 +1809,12 @@ Any single character matches itself, unless it is a I with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any -character; "\\" matches a "\"). A series of characters matches that -series of characters in the target string, so the pattern C -would match "blurfl" in the target string. +character; "\\" matches a "\"). This escape mechanism is also required +for the character used as the pattern delimiter. + +A series of characters matches that series of characters in the target +string, so the pattern C would match "blurfl" in the target +string. You can specify a character class, by enclosing a list of characters in C<[]>, which will match any character from the list. If the @@ -1704,7 +1835,7 @@ a range, the "-" is understood literally. Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges -that begin from and end at either alphabets of equal case ([a-e], +that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, spell out the character sets in full. @@ -1749,7 +1880,7 @@ match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 matched "0x", even though the rule C<0|0x> could potentially match the leading 0 in the second number. -=head2 Warning on \1 vs $1 +=head2 Warning on \1 Instead of $1 Some people get too used to writing things like: @@ -1774,7 +1905,7 @@ C<${1}000>. The operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the I side of the C. -=head2 Repeated patterns matching zero-length substring +=head2 Repeated Patterns Matching a Zero-length Substring B: Difficult material (and prose) ahead. This section needs a rewrite. @@ -1787,9 +1918,9 @@ loops using regular expressions, with something as innocuous as: 'foo' =~ m{ ( o? )* }x; -The C can match at the beginning of C<'foo'>, and since the position +The C matches at the beginning of C<'foo'>, and since the position in the string is not moved by the match, C would match again and again -because of the C<*> modifier. Another common way to create a similar cycle +because of the C<*> quantifier. Another common way to create a similar cycle is with the looping modifier C: @matches = ( 'foo' =~ m{ o? }xg ); @@ -1809,7 +1940,7 @@ may match zero-length substrings. Here's a simple example being: Thus Perl allows such constructs, by I. The rules for this are different for lower-level -loops given by the greedy modifiers C<*+{}>, and for higher-level +loops given by the greedy quantifiers C<*+{}>, and for higher-level ones like the C modifier or split() operator. The lower-level loops are I (that is, the loop is @@ -1850,7 +1981,7 @@ the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during C. -=head2 Combining pieces together +=head2 Combining RE Pieces Each of the elementary pieces of regular expressions which were described before (such as C or C<\Z>) could match at most one substring @@ -1951,13 +2082,13 @@ One more rule is needed to understand how a match is determined for the whole regular expression: a match at an earlier position is always better than a match at a later position. -=head2 Creating custom RE engines +=head2 Creating Custom RE Engines Overloaded constants (see L) provide a simple way to extend the functionality of the RE engine. Suppose that we want to enable a new RE escape-sequence C<\Y|> which -matches at boundary between whitespace characters and non-whitespace +matches at a boundary between whitespace characters and non-whitespace characters. Note that C<(?=\S)(? matches exactly at these positions, so we want to have each C<\Y|> in the place of the more complicated version. We can create a module C to do @@ -2002,13 +2133,13 @@ part of this regular expression needs to be converted explicitly =head1 PCRE/Python Support -As of Perl 5.10 Perl supports several Python/PCRE specific extensions +As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions to the regex syntax. While Perl programmers are encouraged to use the -Perl specific syntax, the following are legal in Perl 5.10: +Perl specific syntax, the following are also accepted: =over 4 -=item C<< (?Ppattern) >> +=item C<< (?PENAMEEpattern) >> Define a named capture buffer. Equivalent to C<< (?pattern) >>. @@ -2020,7 +2151,7 @@ Backreference to a named capture buffer. Equivalent to C<< \g{NAME} >>. Subroutine call to a named capture buffer. Equivalent to C<< (?&NAME) >>. -=back 4 +=back =head1 BUGS