X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=529c44adf16f28f8c6d9153975631fd5031906f8;hb=197afce1e759b5f0a1885a151064a83b27a7324e;hp=a076d3ad66a8747e1a83a5c3be0d334ce0ba8411;hpb=fdf0a293a88d8a14c42b43c2f82c991c50f7dc39;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index a076d3a..529c44a 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -62,7 +62,7 @@ Extend your pattern's legibility by permitting whitespace and comments. =item p X

X X -Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and +Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching. =item g and c @@ -257,8 +257,7 @@ X X X X \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P - \X Match eXtended Unicode "combining character sequence", - equivalent to (?>\PM\pM*) + \X Match Unicode "eXtended grapheme cluster" \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. @@ -271,6 +270,7 @@ X X X X \g{name} Named backreference \k Named backreference \K Keep the stuff left of the \K, don't include it in $& + \N Any character but \n \v Vertical whitespace \V Not vertical whitespace \h Horizontal whitespace @@ -315,26 +315,34 @@ they must always be used within a character class expression. # this is not, and will generate a warning: $string =~ /[:alpha:]/; -The available classes and their backslash equivalents (if available) are -as follows: -X +The following table shows the mapping of POSIX character class +names, common escapes, literal escape sequences and their equivalent +Unicode style property names. +X X<\p> X<\p{}> X X X X X X X X X X X X X X - alpha - alnum - ascii - blank [1] - cntrl - digit \d - graph - lower - print - punct - space \s [2] - upper - word \w [3] - xdigit +B up to Perl 5.10 the property names used were shared with +standard Unicode properties, this was changed in Perl 5.11, see +L for details. + + POSIX Esc Class Property Note + -------------------------------------------------------- + alnum [0-9A-Za-z] IsPosixAlnum + alpha [A-Za-z] IsPosixAlpha + ascii [\000-\177] IsASCII + blank [\011 ] IsPosixBlank [1] + cntrl [\0-\37\177] IsPosixCntrl + digit \d [0-9] IsPosixDigit + graph [!-~] IsPosixGraph + lower [a-z] IsPosixLower + print [ -~] IsPosixPrint + punct [!-/:-@[-`{-~] IsPosixPunct + space [\11-\15 ] IsPosixSpace [2] + \s [\11\12\14\15 ] IsPerlSpace [2] + upper [A-Z] IsPosixUpper + word \w [0-9A-Z_a-z] IsPerlWord [3] + xdigit [0-9A-Fa-f] IsXDigit =over @@ -344,8 +352,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] -Not exactly equivalent to C<\s> since the C<[[:space:]]> includes -also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII. +Note that C<\s> and C<[[:space:]]> are B equivalent as C<[[:space:]]> +includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in +ASCII. =item [3] @@ -361,56 +370,6 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percent sign. -The following equivalences to Unicode \p{} constructs and equivalent -backslash character classes (if available), will hold: -X X<\p> X<\p{}> - - [[:...:]] \p{...} backslash - - alpha IsAlpha - alnum IsAlnum - ascii IsASCII - blank - cntrl IsCntrl - digit IsDigit \d - graph IsGraph - lower IsLower - print IsPrint (but see [2] below) - punct IsPunct (but see [3] below) - space IsSpace - IsSpacePerl \s - upper IsUpper - word IsWord \w - xdigit IsXDigit - -For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. - -However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}> -is not exact. - -=over 4 - -=item [1] - -If the C pragma is not used but the C pragma is, the -classes correlate with the usual isalpha(3) interface (except for -"word" and "blank"). - -But if the C or C pragmas are not used and -the string is not C, then C<[[:xxxxx:]]> (and C<\w>, etc.) -will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will -force the string to C and can match these characters -(as Unicode). - -=item [2] - -C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not. - -=item [3] - -C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not, -because they are classed as symbols (not punctuation) in Unicode. - =over 4 =item C<$> @@ -425,7 +384,6 @@ Mathematical symbols Modifier symbols (accents) -=back =back @@ -472,9 +430,9 @@ X POSIX traditional Unicode - [[:^digit:]] \D \P{IsDigit} - [[:^space:]] \S \P{IsSpace} - [[:^word:]] \W \P{IsWord} + [[:^digit:]] \D \P{IsPosixDigit} + [[:^space:]] \S \P{IsPosixSpace} + [[:^word:]] \W \P{IsPerlWord} Perl respects the POSIX standard in that POSIX character classes are only supported within a character class. The POSIX character classes @@ -558,6 +516,9 @@ left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. +If the bracketing group did not match, the associated backreference won't +match either. (This can happen if the bracketing group is optional, or +in a different branch of an alternation.) X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using @@ -744,6 +705,10 @@ will match C in any case, some spaces, and an exact (I repetition of the previous word, assuming the C modifier, and no C modifier outside this group. +These modifiers do not carry over into named subpatterns called in the +enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not +change the case-sensitivity of the "NAME" pattern. + Note that the C

modifier is special in that it can only be enabled, not disabled, and that its presence anywhere in a pattern has a global effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn @@ -801,8 +766,24 @@ which buffer the captured content will be stored. / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 -Note: as of Perl 5.10.0, branch resets interfere with the contents of -the C<%+> hash, that holds named captures. Consider using C<%-> instead. +Be careful when using the branch reset pattern in combination with +named captures. Named captures are implemented as being aliases to +numbered buffers holding the captures, and that interferes with the +implementation of the branch reset pattern. If you are using named +captures in a branch reset pattern, it's best to use the same names, +in the same order, in each of the alternations: + + /(?| (? x ) (? y ) + | (? z ) (? w )) /x + +Not doing so may lead to surprises: + + "12" =~ /(?| (? \d+ ) | (? \D+))/x; + say $+ {a}; # Prints '12' + say $+ {b}; # *Also* prints '12'. + +The problem here is that both the buffer named C<< a >> and the buffer +named C<< b >> are aliases for the buffer belonging to C<< $1 >>. =item Look-Around Assertions X X X X @@ -983,13 +964,6 @@ The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. -Due to an unfortunate implementation issue, the Perl code contained in these -blocks is treated as a compile time closure that can have seemingly bizarre -consequences when used with lexically scoped variables inside of subroutines -or loops. There are various workarounds for this, including simply using -global variables instead. If you are using this construct and strange results -occur then check for the use of lexically scoped variables. - For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the @@ -1011,9 +985,15 @@ so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. -Because Perl's regex engine is currently not re-entrant, interpolated -code may not invoke the regex engine either directly with C or C), -or indirectly with functions such as C. +B: Use of lexical (C) variables in these blocks is +broken. The result is unpredictable and will make perl unstable. The +workaround is to use global (C) variables. + +B: Because Perl's regex engine is currently not re-entrant, +interpolated code may not invoke the regex engine either directly with +C or C), or indirectly with functions such as +C. Invoking the regex engine in these blocks will make perl +unstable. =item C<(??{ code })> X<(??{})> @@ -1055,6 +1035,12 @@ The following pattern matches a parenthesized group: See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. +For reasons of security, this construct is forbidden if the regular +expression involves run-time interpolation of variables, unless the +perilous C pragma has been used (see L), or the +variables contain results of C operator (see +L). + Because perl's regex engine is not currently re-entrant, delayed code may not invoke the regex engine either directly with C or C), or indirectly with functions such as C. @@ -1364,7 +1350,7 @@ otherwise stated the ARG argument is optional; in some cases, it is forbidden. Any pattern containing a special backtracking verb that allows an argument -has the special behaviour that when executed it sets the current packages' +has the special behaviour that when executed it sets the current package's C<$REGERROR> and C<$REGMARK> variables. When doing so the following rules apply: @@ -1485,8 +1471,7 @@ This zero-width pattern can be used to mark the point reached in a string when a certain part of the pattern has been successfully matched. This mark may be given a name. A later C<(*SKIP)> pattern will then skip forward to that point if backtracked into on failure. Any number of -C<(*MARK)> patterns are allowed, and the NAME portion is optional and may -be duplicated. +C<(*MARK)> patterns are allowed, and the NAME portion may be duplicated. In addition to interacting with the C<(*SKIP)> pattern, C<(*MARK:NAME)> can be used to "label" a pattern branch, so that after matching, the