X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=42017ddf6629661a082bb46ab08273ae43864308;hb=408633379a1452b4e14d7c3b5e80f7dc05ea7986;hp=88023ef7b0316d84c2fba3aa2258d742fdec873a;hpb=241e73895f8f4dc685136e0956ef2d8b06636354;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 88023ef..42017dd 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -62,7 +62,7 @@ Extend your pattern's legibility by permitting whitespace and comments. =item p X

X X -Preserve the string matched such that ${^PREMATCH}, {$^MATCH}, and +Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching. =item g and c @@ -102,7 +102,7 @@ X =head3 Metacharacters -The patterns used in Perl pattern matching evolved from the ones supplied in +The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.) See L for @@ -223,9 +223,9 @@ X<\0> X<\c> X<\N> X<\x> \e escape (think troff) (ESC) \033 octal char (example: ESC) \x1B hex char (example: ESC) - \x{263a} wide hex char (example: Unicode SMILEY) + \x{263a} long hex char (example: Unicode SMILEY) \cK control char (example: VT) - \N{name} named char + \N{name} named Unicode character \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) @@ -258,7 +258,7 @@ X X X X \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", - equivalent to (?:\PM\pM*) + equivalent to (?>\PM\pM*) \C Match a single C char (octet) even under Unicode. NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. @@ -270,10 +270,8 @@ X X X X optionally be wrapped in curly brackets for safer parsing. \g{name} Named backreference \k Named backreference - \N{name} Named Unicode character, or Unicode escape - \x12 Hexadecimal escape sequence - \x{1234} Long hexadecimal escape sequence \K Keep the stuff left of the \K, don't include it in $& + \N Any character but \n \v Vertical whitespace \V Not vertical whitespace \h Horizontal whitespace @@ -318,26 +316,34 @@ they must always be used within a character class expression. # this is not, and will generate a warning: $string =~ /[:alpha:]/; -The available classes and their backslash equivalents (if available) are -as follows: -X +The following table shows the mapping of POSIX character class +names, common escapes, literal escape sequences and their equivalent +Unicode style property names. +X X<\p> X<\p{}> X X X X X X X X X X X X X X - alpha - alnum - ascii - blank [1] - cntrl - digit \d - graph - lower - print - punct - space \s [2] - upper - word \w [3] - xdigit +B up to Perl 5.10 the property names used were shared with +standard Unicode properties, this was changed in Perl 5.11, see +L for details. + + POSIX Esc Class Property Note + -------------------------------------------------------- + alnum [0-9A-Za-z] IsPosixAlnum + alpha [A-Za-z] IsPosixAlpha + ascii [\000-\177] IsASCII + blank [\011 ] IsPosixBlank [1] + cntrl [\0-\37\177] IsPosixCntrl + digit \d [0-9] IsPosixDigit + graph [!-~] IsPosixGraph + lower [a-z] IsPosixLower + print [ -~] IsPosixPrint + punct [!-/:-@[-`{-~] IsPosixPunct + space [\11-\15 ] IsPosixSpace [2] + \s [\11\12\14\15 ] IsPerlSpace [2] + upper [A-Z] IsPosixUpper + word \w [0-9A-Z_a-z] IsPerlWord [3] + xdigit [0-9A-Fa-f] IsXDigit =over @@ -347,8 +353,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace". =item [2] -Not exactly equivalent to C<\s> since the C<[[:space:]]> includes -also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII. +Note that C<\s> and C<[[:space:]]> are B equivalent as C<[[:space:]]> +includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in +ASCII. =item [3] @@ -364,35 +371,24 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percent sign. -The following equivalences to Unicode \p{} constructs and equivalent -backslash character classes (if available), will hold: -X X<\p> X<\p{}> +=over 4 + +=item C<$> + +Currency symbol + +=item C<+> C<< < >> C<=> C<< > >> C<|> C<~> + +Mathematical symbols - [[:...:]] \p{...} backslash +=item C<^> C<`> - alpha IsAlpha - alnum IsAlnum - ascii IsASCII - blank - cntrl IsCntrl - digit IsDigit \d - graph IsGraph - lower IsLower - print IsPrint - punct IsPunct - space IsSpace - IsSpacePerl \s - upper IsUpper - word IsWord - xdigit IsXDigit +Modifier symbols (accents) -For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent. -If the C pragma is not used but the C pragma is, the -classes correlate with the usual isalpha(3) interface (except for -"word" and "blank"). +=back -The assumedly non-obviously named classes are: +The other named classes are: =over 4 @@ -435,9 +431,9 @@ X POSIX traditional Unicode - [[:^digit:]] \D \P{IsDigit} - [[:^space:]] \S \P{IsSpace} - [[:^word:]] \W \P{IsWord} + [[:^digit:]] \D \P{IsPosixDigit} + [[:^space:]] \S \P{IsPosixSpace} + [[:^word:]] \W \P{IsPerlWord} Perl respects the POSIX standard in that POSIX character classes are only supported within a character class. The POSIX character classes @@ -522,16 +518,20 @@ backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. +If the bracketing group did not match, the associated backreference won't +match either. (This can happen if the bracketing group is optional, or +in a different branch of an alternation.) + X<\g{1}> X<\g{-1}> X<\g{name}> X X In order to provide a safer and easier way to construct patterns using -backreferences, Perl 5.10 provides the C<\g{N}> notation. The curly -brackets are optional, however omitting them is less safe as the meaning -of the pattern can be changed by text (such as digits) following it. -When N is a positive integer the C<\g{N}> notation is exactly equivalent -to using normal backreferences. When N is a negative integer then it is -a relative backreference referring to the previous N'th capturing group. -When the bracket form is used and N is not an integer, it is treated as a -reference to a named buffer. +backreferences, Perl provides the C<\g{N}> notation (starting with perl +5.10.0). The curly brackets are optional, however omitting them is less +safe as the meaning of the pattern can be changed by text (such as digits) +following it. When N is a positive integer the C<\g{N}> notation is +exactly equivalent to using normal backreferences. When N is a negative +integer then it is a relative backreference referring to the previous N'th +capturing group. When the bracket form is used and N is not an integer, it +is treated as a reference to a named buffer. Thus C<\g{-1}> refers to the last buffer, C<\g{-2}> refers to the buffer before that. For example: @@ -547,7 +547,7 @@ buffer before that. For example: and would match the same as C. -Additionally, as of Perl 5.10 you may use named capture buffers and named +Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is C<< (?...) >> to declare and C<< \k >> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed C<< \g{name} >> backreference syntax. @@ -558,7 +558,7 @@ and C<< \k >> refer to the leftmost defined group. (Thus it's possible to do things with named capture buffers that would otherwise require C<(??{})> code to accomplish.) X X -X<%+> X<$+{name}> X<\k{name}> +X<%+> X<$+{name}> X<< \k >> Examples: @@ -617,7 +617,7 @@ already paid the price. As of 5.005, C<$&> is not so costly as the other two. X<$&> X<$`> X<$'> -As a workaround for this problem, Perl 5.10 introduces C<${^PREMATCH}>, +As a workaround for this problem, Perl 5.10.0 introduces C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}>, which are equivalent to C<$`>, C<$&> and C<$'>, B that they are only guaranteed to be defined after a successful match that was executed with the C

(preserve) modifier. @@ -707,6 +707,10 @@ will match C in any case, some spaces, and an exact (I repetition of the previous word, assuming the C modifier, and no C modifier outside this group. +These modifiers do not carry over into named subpatterns called in the +enclosing group. In other words, a pattern such as C<((?i)(&NAME))> does not +change the case-sensitivity of the "NAME" pattern. + Note that the C

modifier is special in that it can only be enabled, not disabled, and that its presence anywhere in a pattern has a global effect. Thus C<(?-p)> and C<(?-p:...)> are meaningless and will warn @@ -743,7 +747,7 @@ X<(?|)> X This is the "branch reset" pattern, which has the special property that the capture buffers are numbered from the same starting point -in each alternation branch. It is available starting from perl 5.10. +in each alternation branch. It is available starting from perl 5.10.0. Capture buffers are numbered from left to right, but inside this construct the numbering is restarted for each branch. @@ -764,6 +768,9 @@ which buffer the captured content will be stored. / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 +Note: as of Perl 5.10.0, branch resets interfere with the contents of +the C<%+> hash, that holds named captures. Consider using C<%-> instead. + =item Look-Around Assertions X X X X @@ -840,9 +847,9 @@ only for fixed-width look-behind. X<< (?) >> X<(?'NAME')> X X A named capture buffer. Identical in every respect to normal capturing -parentheses C<()> but for the additional fact that C<%+> may be used after -a successful match to refer to a named buffer. See C for more -details on the C<%+> hash. +parentheses C<()> but for the additional fact that C<%+> or C<%-> may be +used after a successful match to refer to a named buffer. See C +for more details on the C<%+> and C<%-> hashes. If multiple distinct capture buffers have the same name then the $+{NAME} will refer to the leftmost defined buffer in the match. @@ -867,8 +874,7 @@ though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> may be used instead of C<< (?pattern) >>; however this form does not -support the use of single quotes as a delimiter for the name. This is -only available in Perl 5.10 or later. +support the use of single quotes as a delimiter for the name. =item C<< \k >> @@ -886,7 +892,7 @@ Both forms are equivalent. B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> -may be used instead of C<< \k >> in Perl 5.10 or later. +may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X @@ -944,13 +950,6 @@ The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. -Due to an unfortunate implementation issue, the Perl code contained in these -blocks is treated as a compile time closure that can have seemingly bizarre -consequences when used with lexically scoped variables inside of subroutines -or loops. There are various workarounds for this, including simply using -global variables instead. If you are using this construct and strange results -occur then check for the use of lexically scoped variables. - For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the @@ -972,9 +971,15 @@ so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. -Because Perl's regex engine is currently not re-entrant, interpolated -code may not invoke the regex engine either directly with C or C), -or indirectly with functions such as C. +B: Use of lexical (C) variables in these blocks is +broken. The result is unpredictable and will make perl unstable. The +workaround is to use global (C) variables. + +B: Because Perl's regex engine is currently not re-entrant, +interpolated code may not invoke the regex engine either directly with +C or C), or indirectly with functions such as +C. Invoking the regex engine in these blocks will make perl +unstable. =item C<(??{ code })> X<(??{})> @@ -1111,7 +1116,7 @@ pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> -may be used instead of C<< (?&NAME) >> in Perl 5.10 or later. +may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> @@ -1390,7 +1395,7 @@ If we add a C<(*PRUNE)> before the count like the following print "Count=$count\n"; we prevent backtracking and find the count of the longest matching -at each matching startpoint like so: +at each matching starting point like so: aaab aab @@ -1436,7 +1441,7 @@ outputs Count=2 Once the 'aaab' at the start of the string has matched, and the C<(*SKIP)> -executed, the next startpoint will be where the cursor was when the +executed, the next starting point will be where the cursor was when the C<(*SKIP)> was executed. =item C<(*MARK:NAME)> C<(*:NAME)> @@ -2109,9 +2114,9 @@ part of this regular expression needs to be converted explicitly =head1 PCRE/Python Support -As of Perl 5.10 Perl supports several Python/PCRE specific extensions +As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions to the regex syntax. While Perl programmers are encouraged to use the -Perl specific syntax, the following are legal in Perl 5.10: +Perl specific syntax, the following are also accepted: =over 4