X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlreref.pod;h=5ddacc5046986d0f02459d7a7e94eca390be5124;hb=645252571da2008f8956d78323fb408c931f0665;hp=8aad32719ab3e3207a7500fd8bd5303461400789;hpb=30487ceba2ac5c35e693d7aba544e73d6a7dc3f0;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index 8aad327..5ddacc5 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -6,60 +6,78 @@ perlreref - Perl Regular Expressions Reference
 
 This is a quick reference to Perl's regular expressions.
 For full information see L<perlre> and L<perlop>, as well
-as the L<references|/"SEE ALSO"> section in this document.
-
-=head1 OPERATORS
-
-  =~ determines to which variable the regex is applied.
-     In its absence, C<$_> is used.
-
-        $var =~ /foo/;
-
-  m/pattern/igmsoxc searches a string for a pattern match,
-     applying the given options.
-
-        i  case-Insensitive
-        g  Global - all occurrences
-        m  Multiline mode - ^ and $ match internal lines
-        s  match as a Single line - . matches \n
-        o  compile pattern Once
-        x  eXtended legibility - free whitespace and comments
-        c  don't reset pos on fails when using /g
-
-     If C<pattern> is an empty string, the last I<successfully> match
-     regex is used. Delimiters other than C</> may be used for both this
-     operator and the following ones.
-
-  qr/pattern/imsox lets you store a regex in a variable,
-     or pass one around. Modifiers as for C<m//> and are stored
-     within the regex.
-
-  s/pattern/replacement/igmsoxe substitutes matches of
-     C<pattern> with C<replacement>. Modifiers as for C<m//>
-     with addition of C<e>:
-
-        e  Evaluate replacement as an expression
-
-     'e' may be specified multiple times. 'replacement' is interpreted
-     as a double quoted string unless a single-quote (') is the delimiter.
-
-  ?pattern? is like C<m/pattern/> but matches only once. No alternate
-     delimiters can be used. Must be reset with 'reset'.
-
-=head1 SYNTAX
-
-   \     Escapes the character(s) immediately following it
-   .     Matches any single character except a newline (unless /s is used)
-   ^     Matches at the beginning of the string (or line, if /m is used)
-   $     Matches at the end of the string (or line, if /m is used)
-   *     Matches the preceding element 0 or more times
-   +     Matches the preceding element 1 or more times
-   ?     Matches the preceding element 0 or 1 times
-   {...} Specifies a range of occurrences for the element preceding it
-   [...] Matches any one of the characters contained within the brackets
-   (...) Groups regular expressions
-   |     Matches either the expression preceding or following it
-   \1, \2 ...  The text from the Nth group
+as the L</"SEE ALSO"> section in this document.
+
+=head2 OPERATORS
+
+C<=~> determines to which variable the regex is applied.
+In its absence, $_ is used.
+
+    $var =~ /foo/;
+
+C<!~> determines to which variable the regex is applied,
+and negates the result of the match; it returns
+false if the match succeeds, and true if it fails.
+
+    $var !~ /foo/;
+
+C<m/pattern/msixpogc> searches a string for a pattern match,
+applying the given options.
+
+    m  Multiline mode - ^ and $ match internal lines
+    s  match as a Single line - . matches \n
+    i  case-Insensitive
+    x  eXtended legibility - free whitespace and comments
+    p  Preserve a copy of the matched string -
+       ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
+    o  compile pattern Once
+    g  Global - all occurrences
+    c  don't reset pos on failed matches when using /g
+
+If 'pattern' is an empty string, the last I<successfully> matched
+regex is used. Delimiters other than '/' may be used for both this
+operator and the following ones. The leading C<m> can be omitted
+if the delimiter is '/'.
+
+C<qr/pattern/msixpo> lets you store a regex in a variable,
+or pass one around. Modifiers as for C<m//>, and are stored
+within the regex.
+
+C<s/pattern/replacement/msixpogce> substitutes matches of
+'pattern' with 'replacement'. Modifiers as for C<m//>,
+with two additions:
+
+    e  Evaluate 'replacement' as an expression
+    r  Return substitution and leave the original string untouched.
+
+'e' may be specified multiple times. 'replacement' is interpreted
+as a double quoted string unless a single-quote (C<'>) is the delimiter.
+
+C<?pattern?> is like C<m/pattern/> but matches only once. No alternate
+delimiters can be used.  Must be reset with reset().
+
+=head2 SYNTAX
+
+ \       Escapes the character immediately following it
+ .       Matches any single character except a newline (unless /s is
+           used)
+ ^       Matches at the beginning of the string (or line, if /m is used)
+ $       Matches at the end of the string (or line, if /m is used)
+ *       Matches the preceding element 0 or more times
+ +       Matches the preceding element 1 or more times
+ ?       Matches the preceding element 0 or 1 times
+ {...}   Specifies a range of occurrences for the element preceding it
+ [...]   Matches any one of the characters contained within the brackets
+ (...)   Groups subexpressions for capturing to $1, $2...
+ (?:...) Groups subexpressions without capturing (cluster)
+ |       Matches either the subexpression preceding or following it
+ \1, \2, \3 ...           Matches the text from the Nth group
+ \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
+ \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
+ \g{name}     Named backreference
+ \k<name>     Named backreference
+ \k'name'     Named backreference
+ (?P=name)    Named backreference (python syntax)
 
 =head2 ESCAPE SEQUENCES
 
@@ -71,18 +89,21 @@ These work as in normal strings.
    \n       Newline
    \r       Carriage return
    \t       Tab
-   \038     Any octal ASCII value
+   \037     Any octal ASCII value
    \x7f     Any hexadecimal ASCII value
    \x{263a} A wide hexadecimal value
    \cx      Control-x
    \N{name} A named character
+   \N{U+263D} A Unicode character by hex ordinal
 
-   \l  Lowercase until next character
-   \u  Uppercase until next character
+   \l  Lowercase next character
+   \u  Titlecase next character
    \L  Lowercase until \E
    \U  Uppercase until \E
    \Q  Disable pattern metacharacters until \E
-   \E  End case modification
+   \E  End modification
+
+For Titlecase, see L</Titlecase>.
 
 This one works differently from normal strings:
 
@@ -93,46 +114,80 @@ This one works differently from normal strings:
    [amy]    Match 'a', 'm' or 'y'
    [f-j]    Dash specifies "range"
    [f-j-]   Dash escaped or at start or end means 'dash'
-   [^f-j]   Caret indicates "match char any _except_ these"
-
-The following work within or without a character class:
-
-   \d      A digit, same as [0-9]
-   \D      A nondigit, same as [^0-9]
-   \w      A word character (alphanumeric), same as [a-zA-Z_0-9]
-   \W      A non-word character, [^a-zA-Z_0-9]
-   \s      A whitespace character, same as [ \t\n\r\f]
-   \S      A non-whitespace character, [^ \t\n\r\f]
-   \C      Match a byte (with Unicode. '.' matches char)
+   [^f-j]   Caret indicates "match any character _except_ these"
+
+The following sequences (except C<\N>) work within or without a character class.
+The first six are locale aware, all are Unicode aware. See L<perllocale>
+and L<perlunicode> for details.
+
+   \d      A digit
+   \D      A nondigit
+   \w      A word character
+   \W      A non-word character
+   \s      A whitespace character
+   \S      A non-whitespace character
+   \h      An horizontal whitespace
+   \H      A non horizontal whitespace
+   \N      A non newline (when not followed by '{NAME}'; experimental;
+           not valid in a character class; equivalent to [^\n]; it's
+           like '.' without /s modifier)
+   \v      A vertical whitespace
+   \V      A non vertical whitespace
+   \R      A generic newline           (?>\v|\x0D\x0A)
+
+   \C      Match a byte (with Unicode, '.' matches a character)
    \pP     Match P-named (Unicode) property
-   \p{...} Match Unicode property with long name
+   \p{...} Match Unicode property with name longer than 1 character
    \PP     Match non-P
-   \P{...} Match lack of Unicode property with long name
-   \X      Match extended unicode sequence
+   \P{...} Match lack of Unicode property with name longer than 1 char
+   \X      Match Unicode extended grapheme cluster
 
 POSIX character classes and their Unicode and Perl equivalents:
 
-   alnum   IsAlnum             Alphanumeric
-   alpha   IsAlpha             Alphabetic
-   ascii   IsASCII             Any ASCII char
-   blank   IsSpace  [ \t]      Horizontal whitespace (GNU)
-   cntrl   IsCntrl             Control characters
-   digit   IsDigit  \d         Digits
-   graph   IsGraph             Alphanumeric and punctuation
-   lower   IsLower             Lowercase chars (locale aware)
-   print   IsPrint             Alphanumeric, punct, and space
-   punct   IsPunct             Punctuation
-   space   IsSpace  [\s\ck]    Whitespace
-           IsSpacePerl   \s    Perl's whitespace definition
-   upper   IsUpper             Uppercase chars (locale aware)
-   word    IsWord   \w         Alphanumeric plus _ (Perl)
-   xdigit  IsXDigit [\dA-Fa-f] Hexadecimal digit
+           ASCII-         Full-
+           range          range   backslash
+ POSIX    \p{...}         \p{}    sequence       Description
+ -----------------------------------------------------------------------
+ alnum   PosixAlnum       Alnum               Alpha plus Digit
+ alpha   PosixAlpha       Alpha               Alphabetic characters
+ ascii   ASCII                                Any ASCII character
+ blank   PosixBlank       Blank     \h        Horizontal whitespace;
+                                                full-range also written
+                                                as \p{HorizSpace} (GNU
+                                                extension)
+ cntrl   PosixCntrl       Cntrl               Control characters
+ digit   PosixDigit       Digit     \d        Decimal digits
+ graph   PosixGraph       Graph               Alnum plus Punct
+ lower   PosixLower       Lower               Lowercase characters
+ print   PosixPrint       Print               Graph plus Print, but not
+                                                any Cntrls
+ punct   PosixPunct       Punct               These aren't precisely
+                                                equivalent.  See NOTE,
+                                                below.
+ space   PosixSpace       Space     [\s\cK]   Whitespace
+         PerlSpace        SpacePerl \s        Perl's whitespace
+                                                definition
+ upper   PosixUpper       Upper               Uppercase characters
+ word    PerlWord         Word      \w        Alnum plus '_' (Perl
+                                                extension)
+ xdigit  ASCII_Hex_Digit  XDigit              Hexadecimal digit,
+                                                ASCII-range is
+                                                [0-9A-Fa-f]
+
+NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
+In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
+C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in
+effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
+matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>.  When matching a UTF-8 string,
+C<[[:punct:]]> matches what it does in the ASCII range, plus what
+C<\p{Punct}> matches.  C<\p{Punct}> matches, anything that isn't a
+control, an alphanumeric, a space, nor a symbol.
 
 Within a character class:
 
-    POSIX       traditional   Unicode
-    [:digit:]       \d        \p{IsDigit}
-    [:^digit:]      \D        \P{IsDigit}
+    POSIX      traditional   Unicode
+  [:digit:]       \d        \p{Digit}
+  [:^digit:]      \D        \P{Digit}
 
 =head2 ANCHORS
 
@@ -141,81 +196,123 @@ All are zero-width assertions.
    ^  Match string start (or line, if /m is used)
    $  Match string end (or line, if /m is used) or before newline
    \b Match word boundary (between \w and \W)
-   \B Match except at word boundary
+   \B Match except at word boundary (between \w and \w or \W and \W)
    \A Match string start (regardless of /m)
-   \Z Match string end (preceding optional newline)
+   \Z Match string end (before optional newline)
    \z Match absolute string end
    \G Match where previous m//g left off
-   \c Suppresses resetting of search position when used with /g.
-      Without \c, search pattern is reset to the beginning of the string
+   \K Keep the stuff left of the \K, don't include it in $&
 
 =head2 QUANTIFIERS
 
-Quantifiers are greedy by default --- match the B<longest> leftmost.
+Quantifiers are greedy by default and match the B<longest> leftmost.
+
+   Maximal Minimal Possessive Allowed range
+   ------- ------- ---------- -------------
+   {n,m}   {n,m}?  {n,m}+     Must occur at least n times
+                              but no more than m times
+   {n,}    {n,}?   {n,}+      Must occur at least n times
+   {n}     {n}?    {n}+       Must occur exactly n times
+   *       *?      *+         0 or more times (same as {0,})
+   +       +?      ++         1 or more times (same as {1,})
+   ?       ??      ?+         0 or 1 time (same as {0,1})
 
-   Maximal Minimal Allowed range
-   ------- ------- -------------
-   {n,m}   {n,m}?  Must occur at least n times but no more than m times
-   {n,}    {n,}?   Must occur at least n times
-   {n}     {n}?    Must match exactly n times
-   *       *?      0 or more times (same as {0,})
-   +       +?      1 or more times (same as {1,})
-   ?       ??      0 or 1 time (same as {0,1})
+The possessive forms (new in Perl 5.10) prevent backtracking: what gets
+matched by a pattern with a possessive quantifier will not be backtracked
+into, even if that causes the whole match to fail.
+
+There is no quantifier C<{,n}>. That's interpreted as a literal string.
 
 =head2 EXTENDED CONSTRUCTS
 
-   (?#text)         A comment
-   (?:...)          Cluster without capturing
-   (?imxs-imsx:...) Enable/disable option (as per m//)
-   (?=...)          Zero-width positive lookahead assertion
-   (?!...)          Zero-width negative lookahead assertion
-   (?<...)          Zero-width positive lookbehind assertion
-   (?<!...)         Zero-width negative lookbehind assertion
-   (?>...)          Grab what we can, prohibit backtracking
-   (?{ code })      Embedded code, return value becomes $^R
-   (??{ code })     Dynamic regex, return value used as regex
-   (?(cond)yes|no)  cond being int corresponding to capturing parens
-   (?(cond)yes)        or a lookaround/eval zero-width assertion
-
-=head1 VARIABLES
+   (?#text)          A comment
+   (?:...)           Groups subexpressions without capturing (cluster)
+   (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
+   (?=...)           Zero-width positive lookahead assertion
+   (?!...)           Zero-width negative lookahead assertion
+   (?<=...)          Zero-width positive lookbehind assertion
+   (?<!...)          Zero-width negative lookbehind assertion
+   (?>...)           Grab what we can, prohibit backtracking
+   (?|...)           Branch reset
+   (?<name>...)      Named capture
+   (?'name'...)      Named capture
+   (?P<name>...)     Named capture (python syntax)
+   (?{ code })       Embedded code, return value becomes $^R
+   (??{ code })      Dynamic regex, return value used as regex
+   (?N)              Recurse into subpattern number N
+   (?-N), (?+N)      Recurse into Nth previous/next subpattern
+   (?R), (?0)        Recurse at the beginning of the whole pattern
+   (?&name)          Recurse into a named subpattern
+   (?P>name)         Recurse into a named subpattern (python syntax)
+   (?(cond)yes|no)
+   (?(cond)yes)      Conditional expression, where "cond" can be:
+                     (N)       subpattern N has matched something
+                     (<name>)  named subpattern has matched something
+                     ('name')  named subpattern has matched something
+                     (?{code}) code condition
+                     (R)       true if recursing
+                     (RN)      true if recursing into Nth subpattern
+                     (R&name)  true if recursing into named subpattern
+                     (DEFINE)  always false, no no-pattern allowed
+
+=head2 VARIABLES
 
    $_    Default variable for operators to use
-   $*    Enable multiline matching (deprecated; not in 5.8.1+)
 
-   $&    Entire matched string
    $`    Everything prior to matched string
+   $&    Entire matched string
    $'    Everything after to matched string
 
-The use of those last three will slow down B<all> regex use
-within your program. Consult L<perlvar> for C<@LAST_MATCH_START>
+   ${^PREMATCH}   Everything prior to matched string
+   ${^MATCH}      Entire matched string
+   ${^POSTMATCH}  Everything after to matched string
+
+The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
+within your program. Consult L<perlvar> for C<@->
 to see equivalent expressions that won't cause slow down.
-See also L<Devel::SawAmpersand>.
+See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
+can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
+and C<${^POSTMATCH}>, but for them to be defined, you have to
+specify the C</p> (preserve) modifier on your regular expression.
 
    $1, $2 ...  hold the Xth captured expr
    $+    Last parenthesized pattern match
    $^N   Holds the most recently closed capture
    $^R   Holds the result of the last (?{...}) expr
-   @-    Offsets of starts of groups. [0] holds start of whole match
-   @+    Offsets of ends of groups. [0] holds end of whole match
+   @-    Offsets of starts of groups. $-[0] holds start of whole match
+   @+    Offsets of ends of groups. $+[0] holds end of whole match
+   %+    Named capture buffers
+   %-    Named capture buffers, as array refs
 
-Capture groups are numbered according to their I<opening> paren.
+Captured groups are numbered according to their I<opening> paren.
 
-=head1 FUNCTIONS
+=head2 FUNCTIONS
 
    lc          Lowercase a string
    lcfirst     Lowercase first char of a string
    uc          Uppercase a string
    ucfirst     Titlecase first char of a string
+
    pos         Return or set current match position
    quotemeta   Quote metacharacters
    reset       Reset ?pattern? status
    study       Analyze string for optimizing matching
 
-   split       Use regex to split a string into parts
+   split       Use a regex to split a string into parts
+
+The first four of these are like the escape sequences C<\L>, C<\l>,
+C<\U>, and C<\u>.  For Titlecase, see L</Titlecase>.
+
+=head2 TERMINOLOGY
+
+=head3 Titlecase
+
+Unicode concept which most often is equal to uppercase, but for
+certain characters like the German "sharp s" there is a difference.
 
 =head1 AUTHOR
 
-Iain Truskett.
+Iain Truskett. Updated by the Perl 5 Porters.
 
 This document may be distributed under the same terms as Perl itself.
 
@@ -253,6 +350,14 @@ L<perlfaq6> for FAQs on regular expressions.
 
 =item *
 
+L<perlrebackslash> for a reference on backslash sequences.
+
+=item *
+
+L<perlrecharclass> for a reference on character classes.
+
+=item *
+
 The L<re> module to alter behaviour and aid
 debugging.
 
@@ -262,13 +367,13 @@ L<perldebug/"Debugging regular expressions">
 
 =item *
 
-L<perluniintro>, L<perlunicode>, L<charnames> and L<locale>
+L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
 for details on regexes and internationalisation.
 
 =item *
 
 I<Mastering Regular Expressions> by Jeffrey Friedl
-(F<http://regex.info/>) for a thorough grounding and
+(F<http://oreilly.com/catalog/9780596528126/>) for a thorough grounding and
 reference on the topic.
 
 =back
@@ -279,6 +384,9 @@ David P.C. Wollmann,
 Richard Soderberg,
 Sean M. Burke,
 Tom Christiansen,
+Jim Cromie,
 and
 Jeffrey Goff
 for useful advice.
+
+=cut