X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=bb521139ca5887be64a11ced5c165cf25d2c448a;hb=d360a069d6bdc55d9bfda16507abbff2168bf4f7;hp=5c7e76b5ad6b06a55f2de294cbc3aa0d9cc8ef9f;hpb=72ff290864ea88cc224b5d3af7058f500755f94a;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 5c7e76b..bb52113 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -183,19 +183,23 @@ In addition, Perl defines the following: \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", - equivalent to C<(?:\PM\pM*)> + equivalent to (?:\PM\pM*) \C Match a single C char (octet) even under Unicode. - B breaks up characters into their UTF-8 bytes, + NOTE: breaks up characters into their UTF-8 bytes, so you may end up with malformed pieces of UTF-8. -A C<\w> matches a single alphanumeric character or C<_>, not a whole word. -Use C<\w+> to match a string of Perl-identifier characters (which isn't -the same as matching an English word). If C is in effect, the -list of alphabetic characters generated by C<\w> is taken from the -current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, +A C<\w> matches a single alphanumeric character (an alphabetic +character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> +to match a string of Perl-identifier characters (which isn't the same +as matching an English word). If C is in effect, the list +of alphabetic characters generated by C<\w> is taken from the current +locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but if you try to use them -as endpoints of a range, that's not a range, the "-" is understood literally. -See L for details about C<\pP>, C<\PP>, and C<\X>. +as endpoints of a range, that's not a range, the "-" is understood +literally. If Unicode is in effect, C<\s> matches also "\x{85}", +"\x{2028}, and "\x{2029}", see L for more details about +C<\pP>, C<\PP>, and C<\X>, and L about Unicode in general. +You can define your own C<\p> and C<\P> propreties, see L. The POSIX character class syntax @@ -219,10 +223,22 @@ equivalents (if available) are as follows: word \w [3] xdigit - [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. - [2] Not I to C<\s> since the C<[[:space:]]> includes - also the (very rare) `vertical tabulator', "\ck", chr(11). - [3] A Perl extension. +=over + +=item [1] + +A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. + +=item [2] + +Not exactly equivalent to C<\s> since the C<[[:space:]]> includes +also the (very rare) `vertical tabulator', "\ck", chr(11). + +=item [3] + +A Perl extension, see above. + +=back For example use C<[:upper:]> to match all the uppercase characters. Note that the C<[]> are part of the C<[::]> construct, not part of the @@ -300,8 +316,10 @@ with a '^'. This is a Perl extension. For example: [:^space:] \S \P{IsSpace} [:^word:] \W \P{IsWord} -The POSIX character classes [.cc.] and [=cc=] are recognized but -B supported and trying to use them will cause an error. +Perl respects the POSIX standard in that POSIX character classes are +only supported within a character class. The POSIX character classes +[.cc.] and [=cc=] are recognized but B supported and trying to +use them will cause an error. Perl defines the following zero-width assertions: @@ -331,7 +349,11 @@ It is also useful when writing C-like scanners, when you have several patterns that you want to match against consequent substrings of your string, see the previous reference. The actual location where C<\G> will match can also be influenced by using C as -an lvalue. See L. +an lvalue: see L. Currently C<\G> is only fully +supported when anchored to the start of the pattern; while it +is permitted to use it elsewhere, as in C, some +such uses (C, for example) currently cause problems, and +it is recommended that you avoid such usage for now. The bracketing construct C<( ... )> creates capture buffers. To refer to the digit'th buffer use \ within the @@ -370,11 +392,14 @@ Several special variables also refer back to portions of the previous match. C<$+> returns whatever the last bracket match matched. C<$&> returns the entire matched string. (At one point C<$0> did also, but now it returns the name of the program.) C<$`> returns -everything before the matched string. And C<$'> returns everything -after the matched string. +everything before the matched string. C<$'> returns everything +after the matched string. And C<$^N> contains whatever was matched by +the most-recently closed group (submatch). C<$^N> can be used in +extended patterns (see below), for example to assign a submatch to a +variable. The numbered variables ($1, $2, $3, etc.) and the related punctuation -set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped +set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L.) @@ -446,12 +471,14 @@ C<)> in the comment. =item C<(?imsx-imsx)> -One or more embedded pattern-match modifiers. This is particularly -useful for dynamic patterns, such as those read in from a configuration -file, read in as an argument, are specified in a table somewhere, -etc. Consider the case that some of which want to be case sensitive -and some do not. The case insensitive ones need to include merely -C<(?i)> at the front of the pattern. For example: +One or more embedded pattern-match modifiers, to be turned on (or +turned off, if preceded by C<->) for the remainder of the pattern or +the remainder of the enclosing pattern group (if any). This is +particularly useful for dynamic patterns, such as those read in from a +configuration file, read in as an argument, are specified in a table +somewhere, etc. Consider the case that some of which want to be case +sensitive and some do not. The case insensitive ones need to include +merely C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } @@ -461,8 +488,7 @@ C<(?i)> at the front of the pattern. For example: $pattern = "(?i)foobar"; if ( /$pattern/ ) { } -Letters after a C<-> turn those modifiers off. These modifiers are -localized inside an enclosing group (if any). For example, +These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 @@ -540,6 +566,14 @@ This zero-width assertion evaluate any embedded Perl code. It always succeeds, and its C is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. +This feature can be used together with the special variable C<$^N> to +capture the results of submatches in variables without having to keep +track of the number of nested parentheses. For example: + + $_ = "The brown fox jumps over the lazy dog"; + /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; + print "color = $color, animal = $animal\n"; + The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that @@ -850,7 +884,7 @@ multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve. When using look-ahead assertions and negations, this can all get even -tricker. Imagine you'd like to find a sequence of non-digits not +trickier. Imagine you'd like to find a sequence of non-digits not followed by "123". You might try to write that as $_ = "ABC123";