X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=f0d76453531a025bf62933411f199c66ea71f096;hb=27bcc0a7e6b15b7b0d6f632d5f31918abd005ef4;hp=fef8ce3b6f976cb896d4fe6ee1a6aa2128938564;hpb=54c18d0455d4f9550786bea467f5a04c96e86890;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index fef8ce3..f0d7645 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -6,7 +6,7 @@ perlre - Perl regular expressions This page describes the syntax of regular expressions in Perl. -if you haven't used regular expressions before, a quick-start +If you haven't used regular expressions before, a quick-start introduction is available in L, and a longer tutorial introduction is available in L. @@ -41,11 +41,7 @@ line anywhere within the string. Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. -The C and C modifiers both override the C<$*> setting. That -is, no matter what C<$*> contains, C without C will force -"^" to match only at the beginning of the string and "$" to match -only at the end (or just before a newline at the end) of the string. -Together, as /ms, they let the "." match any character whatsoever, +Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. @@ -103,13 +99,11 @@ string as a multi-line buffer, such that the "^" will match after any newline within the string, and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting C<$*>, -but this practice is now deprecated.) +but this practice has been removed in perl 5.9.) To simplify multi-line substitutions, the "." character never matches a newline unless you use the C modifier, which in effect tells Perl to pretend -the string is a single line--even if it isn't. The C modifier also -overrides the setting of C<$*>, in case you have some (badly behaved) older -code that sets it in another module. +the string is a single line--even if it isn't. The following standard quantifiers are recognized: @@ -121,7 +115,8 @@ The following standard quantifiers are recognized: {n,m} Match at least n but not more than m times (If a curly bracket occurs in any other context, it is treated -as a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+" +as a regular character. In particular, the lower bound +is not optional.) The "*" modifier is equivalent to C<{0,}>, the "+" modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can @@ -187,6 +182,7 @@ In addition, Perl defines the following: \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. + Unsupported in lookbehind. A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> @@ -198,8 +194,8 @@ C<\d>, and C<\D> within character classes, but if you try to use them as endpoints of a range, that's not a range, the "-" is understood literally. If Unicode is in effect, C<\s> matches also "\x{85}", "\x{2028}, and "\x{2029}", see L for more details about -C<\pP>, C<\PP>, and C<\X>, and L about Unicode in -general. +C<\pP>, C<\PP>, and C<\X>, and L about Unicode in general. +You can define your own C<\p> and C<\P> properties, see L. The POSIX character class syntax @@ -349,8 +345,11 @@ It is also useful when writing C-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where C<\G> will match can also be influenced by using C as -an lvalue. Currently C<\G> only works when used at the -beginning of the pattern. See L. +an lvalue: see L. Currently C<\G> is only fully +supported when anchored to the start of the pattern; while it +is permitted to use it elsewhere, as in C, some +such uses (C, for example) currently cause problems, and +it is recommended that you avoid such usage for now. The bracketing construct C<( ... )> creates capture buffers. To refer to the digit'th buffer use \ within the @@ -389,14 +388,21 @@ Several special variables also refer back to portions of the previous match. C<$+> returns whatever the last bracket match matched. C<$&> returns the entire matched string. (At one point C<$0> did also, but now it returns the name of the program.) C<$`> returns -everything before the matched string. And C<$'> returns everything -after the matched string. - -The numbered variables ($1, $2, $3, etc.) and the related punctuation -set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped +everything before the matched string. C<$'> returns everything +after the matched string. And C<$^N> contains whatever was matched by +the most-recently closed group (submatch). C<$^N> can be used in +extended patterns (see below), for example to assign a submatch to a +variable. + +The numbered match variables ($1, $2, $3, etc.) and the related punctuation +set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L.) +B: failed matches in Perl do not reset the match variables, +which makes easier to write code that tests for a series of more +specific cases and remembers the best match. + B: Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl @@ -556,10 +562,22 @@ only for fixed-width look-behind. B: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice. -This zero-width assertion evaluate any embedded Perl code. It +This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its C is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. +This feature can be used together with the special variable C<$^N> to +capture the results of submatches in variables without having to keep +track of the number of nested parentheses. For example: + + $_ = "The brown fox jumps over the lazy dog"; + /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; + print "color = $color, animal = $animal\n"; + +Inside the C<(?{...})> block, C<$_> refers to the string the regular +expression is matching against. You can also use C to know what is +the current position of matching within this string. + The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that @@ -611,7 +629,7 @@ although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe -module. See L for details about both these mechanisms. +compartment. See L for details about both these mechanisms. =item C<(??{ code })> @@ -870,7 +888,7 @@ multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve. When using look-ahead assertions and negations, this can all get even -tricker. Imagine you'd like to find a sequence of non-digits not +trickier. Imagine you'd like to find a sequence of non-digits not followed by "123". You might try to write that as $_ = "ABC123"; @@ -1264,7 +1282,7 @@ this: sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} - my %rules = ( '\\' => '\\', + my %rules = ( '\\' => '\\\\', 'Y|' => qr/(?=\S)(?