X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlreref.pod;h=5ddacc5046986d0f02459d7a7e94eca390be5124;hb=645252571da2008f8956d78323fb408c931f0665;hp=8aad32719ab3e3207a7500fd8bd5303461400789;hpb=30487ceba2ac5c35e693d7aba544e73d6a7dc3f0;p=p5sagit%2Fp5-mst-13.2.git
diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index 8aad327..5ddacc5 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -6,60 +6,78 @@ perlreref - Perl Regular Expressions Reference
This is a quick reference to Perl's regular expressions.
For full information see L and L, as well
-as the L section in this document.
-
-=head1 OPERATORS
-
- =~ determines to which variable the regex is applied.
- In its absence, C<$_> is used.
-
- $var =~ /foo/;
-
- m/pattern/igmsoxc searches a string for a pattern match,
- applying the given options.
-
- i case-Insensitive
- g Global - all occurrences
- m Multiline mode - ^ and $ match internal lines
- s match as a Single line - . matches \n
- o compile pattern Once
- x eXtended legibility - free whitespace and comments
- c don't reset pos on fails when using /g
-
- If C is an empty string, the last I match
- regex is used. Delimiters other than C> may be used for both this
- operator and the following ones.
-
- qr/pattern/imsox lets you store a regex in a variable,
- or pass one around. Modifiers as for C and are stored
- within the regex.
-
- s/pattern/replacement/igmsoxe substitutes matches of
- C with C. Modifiers as for C
- with addition of C:
-
- e Evaluate replacement as an expression
-
- 'e' may be specified multiple times. 'replacement' is interpreted
- as a double quoted string unless a single-quote (') is the delimiter.
-
- ?pattern? is like C but matches only once. No alternate
- delimiters can be used. Must be reset with 'reset'.
-
-=head1 SYNTAX
-
- \ Escapes the character(s) immediately following it
- . Matches any single character except a newline (unless /s is used)
- ^ Matches at the beginning of the string (or line, if /m is used)
- $ Matches at the end of the string (or line, if /m is used)
- * Matches the preceding element 0 or more times
- + Matches the preceding element 1 or more times
- ? Matches the preceding element 0 or 1 times
- {...} Specifies a range of occurrences for the element preceding it
- [...] Matches any one of the characters contained within the brackets
- (...) Groups regular expressions
- | Matches either the expression preceding or following it
- \1, \2 ... The text from the Nth group
+as the L"SEE ALSO"> section in this document.
+
+=head2 OPERATORS
+
+C<=~> determines to which variable the regex is applied.
+In its absence, $_ is used.
+
+ $var =~ /foo/;
+
+C determines to which variable the regex is applied,
+and negates the result of the match; it returns
+false if the match succeeds, and true if it fails.
+
+ $var !~ /foo/;
+
+C searches a string for a pattern match,
+applying the given options.
+
+ m Multiline mode - ^ and $ match internal lines
+ s match as a Single line - . matches \n
+ i case-Insensitive
+ x eXtended legibility - free whitespace and comments
+ p Preserve a copy of the matched string -
+ ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
+ o compile pattern Once
+ g Global - all occurrences
+ c don't reset pos on failed matches when using /g
+
+If 'pattern' is an empty string, the last I matched
+regex is used. Delimiters other than '/' may be used for both this
+operator and the following ones. The leading C can be omitted
+if the delimiter is '/'.
+
+C lets you store a regex in a variable,
+or pass one around. Modifiers as for C, and are stored
+within the regex.
+
+C substitutes matches of
+'pattern' with 'replacement'. Modifiers as for C,
+with two additions:
+
+ e Evaluate 'replacement' as an expression
+ r Return substitution and leave the original string untouched.
+
+'e' may be specified multiple times. 'replacement' is interpreted
+as a double quoted string unless a single-quote (C<'>) is the delimiter.
+
+C is like C but matches only once. No alternate
+delimiters can be used. Must be reset with reset().
+
+=head2 SYNTAX
+
+ \ Escapes the character immediately following it
+ . Matches any single character except a newline (unless /s is
+ used)
+ ^ Matches at the beginning of the string (or line, if /m is used)
+ $ Matches at the end of the string (or line, if /m is used)
+ * Matches the preceding element 0 or more times
+ + Matches the preceding element 1 or more times
+ ? Matches the preceding element 0 or 1 times
+ {...} Specifies a range of occurrences for the element preceding it
+ [...] Matches any one of the characters contained within the brackets
+ (...) Groups subexpressions for capturing to $1, $2...
+ (?:...) Groups subexpressions without capturing (cluster)
+ | Matches either the subexpression preceding or following it
+ \1, \2, \3 ... Matches the text from the Nth group
+ \g1 or \g{1}, \g2 ... Matches the text from the Nth group
+ \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
+ \g{name} Named backreference
+ \k Named backreference
+ \k'name' Named backreference
+ (?P=name) Named backreference (python syntax)
=head2 ESCAPE SEQUENCES
@@ -71,18 +89,21 @@ These work as in normal strings.
\n Newline
\r Carriage return
\t Tab
- \038 Any octal ASCII value
+ \037 Any octal ASCII value
\x7f Any hexadecimal ASCII value
\x{263a} A wide hexadecimal value
\cx Control-x
\N{name} A named character
+ \N{U+263D} A Unicode character by hex ordinal
- \l Lowercase until next character
- \u Uppercase until next character
+ \l Lowercase next character
+ \u Titlecase next character
\L Lowercase until \E
\U Uppercase until \E
\Q Disable pattern metacharacters until \E
- \E End case modification
+ \E End modification
+
+For Titlecase, see L.
This one works differently from normal strings:
@@ -93,46 +114,80 @@ This one works differently from normal strings:
[amy] Match 'a', 'm' or 'y'
[f-j] Dash specifies "range"
[f-j-] Dash escaped or at start or end means 'dash'
- [^f-j] Caret indicates "match char any _except_ these"
-
-The following work within or without a character class:
-
- \d A digit, same as [0-9]
- \D A nondigit, same as [^0-9]
- \w A word character (alphanumeric), same as [a-zA-Z_0-9]
- \W A non-word character, [^a-zA-Z_0-9]
- \s A whitespace character, same as [ \t\n\r\f]
- \S A non-whitespace character, [^ \t\n\r\f]
- \C Match a byte (with Unicode. '.' matches char)
+ [^f-j] Caret indicates "match any character _except_ these"
+
+The following sequences (except C<\N>) work within or without a character class.
+The first six are locale aware, all are Unicode aware. See L
+and L for details.
+
+ \d A digit
+ \D A nondigit
+ \w A word character
+ \W A non-word character
+ \s A whitespace character
+ \S A non-whitespace character
+ \h An horizontal whitespace
+ \H A non horizontal whitespace
+ \N A non newline (when not followed by '{NAME}'; experimental;
+ not valid in a character class; equivalent to [^\n]; it's
+ like '.' without /s modifier)
+ \v A vertical whitespace
+ \V A non vertical whitespace
+ \R A generic newline (?>\v|\x0D\x0A)
+
+ \C Match a byte (with Unicode, '.' matches a character)
\pP Match P-named (Unicode) property
- \p{...} Match Unicode property with long name
+ \p{...} Match Unicode property with name longer than 1 character
\PP Match non-P
- \P{...} Match lack of Unicode property with long name
- \X Match extended unicode sequence
+ \P{...} Match lack of Unicode property with name longer than 1 char
+ \X Match Unicode extended grapheme cluster
POSIX character classes and their Unicode and Perl equivalents:
- alnum IsAlnum Alphanumeric
- alpha IsAlpha Alphabetic
- ascii IsASCII Any ASCII char
- blank IsSpace [ \t] Horizontal whitespace (GNU)
- cntrl IsCntrl Control characters
- digit IsDigit \d Digits
- graph IsGraph Alphanumeric and punctuation
- lower IsLower Lowercase chars (locale aware)
- print IsPrint Alphanumeric, punct, and space
- punct IsPunct Punctuation
- space IsSpace [\s\ck] Whitespace
- IsSpacePerl \s Perl's whitespace definition
- upper IsUpper Uppercase chars (locale aware)
- word IsWord \w Alphanumeric plus _ (Perl)
- xdigit IsXDigit [\dA-Fa-f] Hexadecimal digit
+ ASCII- Full-
+ range range backslash
+ POSIX \p{...} \p{} sequence Description
+ -----------------------------------------------------------------------
+ alnum PosixAlnum Alnum Alpha plus Digit
+ alpha PosixAlpha Alpha Alphabetic characters
+ ascii ASCII Any ASCII character
+ blank PosixBlank Blank \h Horizontal whitespace;
+ full-range also written
+ as \p{HorizSpace} (GNU
+ extension)
+ cntrl PosixCntrl Cntrl Control characters
+ digit PosixDigit Digit \d Decimal digits
+ graph PosixGraph Graph Alnum plus Punct
+ lower PosixLower Lower Lowercase characters
+ print PosixPrint Print Graph plus Print, but not
+ any Cntrls
+ punct PosixPunct Punct These aren't precisely
+ equivalent. See NOTE,
+ below.
+ space PosixSpace Space [\s\cK] Whitespace
+ PerlSpace SpacePerl \s Perl's whitespace
+ definition
+ upper PosixUpper Upper Uppercase characters
+ word PerlWord Word \w Alnum plus '_' (Perl
+ extension)
+ xdigit ASCII_Hex_Digit XDigit Hexadecimal digit,
+ ASCII-range is
+ [0-9A-Fa-f]
+
+NOTE on C<[[:punct:]]>, C<\p{PosixPunct}> and C<\p{Punct}>:
+In the ASCII range, C<[[:punct:]]> and C<\p{PosixPunct}> match
+C<[-!"#$%&'()*+,./:;<=E?@[\\\]^_`{|}~]> (although if a locale is in
+effect, it could alter the behavior of C<[[:punct:]]>); and C<\p{Punct}>
+matches C<[-!"#%&'()*,./:;?@[\\\]_{}]>. When matching a UTF-8 string,
+C<[[:punct:]]> matches what it does in the ASCII range, plus what
+C<\p{Punct}> matches. C<\p{Punct}> matches, anything that isn't a
+control, an alphanumeric, a space, nor a symbol.
Within a character class:
- POSIX traditional Unicode
- [:digit:] \d \p{IsDigit}
- [:^digit:] \D \P{IsDigit}
+ POSIX traditional Unicode
+ [:digit:] \d \p{Digit}
+ [:^digit:] \D \P{Digit}
=head2 ANCHORS
@@ -141,81 +196,123 @@ All are zero-width assertions.
^ Match string start (or line, if /m is used)
$ Match string end (or line, if /m is used) or before newline
\b Match word boundary (between \w and \W)
- \B Match except at word boundary
+ \B Match except at word boundary (between \w and \w or \W and \W)
\A Match string start (regardless of /m)
- \Z Match string end (preceding optional newline)
+ \Z Match string end (before optional newline)
\z Match absolute string end
\G Match where previous m//g left off
- \c Suppresses resetting of search position when used with /g.
- Without \c, search pattern is reset to the beginning of the string
+ \K Keep the stuff left of the \K, don't include it in $&
=head2 QUANTIFIERS
-Quantifiers are greedy by default --- match the B leftmost.
+Quantifiers are greedy by default and match the B leftmost.
+
+ Maximal Minimal Possessive Allowed range
+ ------- ------- ---------- -------------
+ {n,m} {n,m}? {n,m}+ Must occur at least n times
+ but no more than m times
+ {n,} {n,}? {n,}+ Must occur at least n times
+ {n} {n}? {n}+ Must occur exactly n times
+ * *? *+ 0 or more times (same as {0,})
+ + +? ++ 1 or more times (same as {1,})
+ ? ?? ?+ 0 or 1 time (same as {0,1})
- Maximal Minimal Allowed range
- ------- ------- -------------
- {n,m} {n,m}? Must occur at least n times but no more than m times
- {n,} {n,}? Must occur at least n times
- {n} {n}? Must match exactly n times
- * *? 0 or more times (same as {0,})
- + +? 1 or more times (same as {1,})
- ? ?? 0 or 1 time (same as {0,1})
+The possessive forms (new in Perl 5.10) prevent backtracking: what gets
+matched by a pattern with a possessive quantifier will not be backtracked
+into, even if that causes the whole match to fail.
+
+There is no quantifier C<{,n}>. That's interpreted as a literal string.
=head2 EXTENDED CONSTRUCTS
- (?#text) A comment
- (?:...) Cluster without capturing
- (?imxs-imsx:...) Enable/disable option (as per m//)
- (?=...) Zero-width positive lookahead assertion
- (?!...) Zero-width negative lookahead assertion
- (?<...) Zero-width positive lookbehind assertion
- (?...) Grab what we can, prohibit backtracking
- (?{ code }) Embedded code, return value becomes $^R
- (??{ code }) Dynamic regex, return value used as regex
- (?(cond)yes|no) cond being int corresponding to capturing parens
- (?(cond)yes) or a lookaround/eval zero-width assertion
-
-=head1 VARIABLES
+ (?#text) A comment
+ (?:...) Groups subexpressions without capturing (cluster)
+ (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
+ (?=...) Zero-width positive lookahead assertion
+ (?!...) Zero-width negative lookahead assertion
+ (?<=...) Zero-width positive lookbehind assertion
+ (?...) Grab what we can, prohibit backtracking
+ (?|...) Branch reset
+ (?...) Named capture
+ (?'name'...) Named capture
+ (?P...) Named capture (python syntax)
+ (?{ code }) Embedded code, return value becomes $^R
+ (??{ code }) Dynamic regex, return value used as regex
+ (?N) Recurse into subpattern number N
+ (?-N), (?+N) Recurse into Nth previous/next subpattern
+ (?R), (?0) Recurse at the beginning of the whole pattern
+ (?&name) Recurse into a named subpattern
+ (?P>name) Recurse into a named subpattern (python syntax)
+ (?(cond)yes|no)
+ (?(cond)yes) Conditional expression, where "cond" can be:
+ (N) subpattern N has matched something
+ () named subpattern has matched something
+ ('name') named subpattern has matched something
+ (?{code}) code condition
+ (R) true if recursing
+ (RN) true if recursing into Nth subpattern
+ (R&name) true if recursing into named subpattern
+ (DEFINE) always false, no no-pattern allowed
+
+=head2 VARIABLES
$_ Default variable for operators to use
- $* Enable multiline matching (deprecated; not in 5.8.1+)
- $& Entire matched string
$` Everything prior to matched string
+ $& Entire matched string
$' Everything after to matched string
-The use of those last three will slow down B regex use
-within your program. Consult L for C<@LAST_MATCH_START>
+ ${^PREMATCH} Everything prior to matched string
+ ${^MATCH} Entire matched string
+ ${^POSTMATCH} Everything after to matched string
+
+The use of C<$`>, C<$&> or C<$'> will slow down B regex use
+within your program. Consult L for C<@->
to see equivalent expressions that won't cause slow down.
-See also L.
+See also L. Starting with Perl 5.10, you
+can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
+and C<${^POSTMATCH}>, but for them to be defined, you have to
+specify the C
(preserve) modifier on your regular expression.
$1, $2 ... hold the Xth captured expr
$+ Last parenthesized pattern match
$^N Holds the most recently closed capture
$^R Holds the result of the last (?{...}) expr
- @- Offsets of starts of groups. [0] holds start of whole match
- @+ Offsets of ends of groups. [0] holds end of whole match
+ @- Offsets of starts of groups. $-[0] holds start of whole match
+ @+ Offsets of ends of groups. $+[0] holds end of whole match
+ %+ Named capture buffers
+ %- Named capture buffers, as array refs
-Capture groups are numbered according to their I paren.
+Captured groups are numbered according to their I paren.
-=head1 FUNCTIONS
+=head2 FUNCTIONS
lc Lowercase a string
lcfirst Lowercase first char of a string
uc Uppercase a string
ucfirst Titlecase first char of a string
+
pos Return or set current match position
quotemeta Quote metacharacters
reset Reset ?pattern? status
study Analyze string for optimizing matching
- split Use regex to split a string into parts
+ split Use a regex to split a string into parts
+
+The first four of these are like the escape sequences C<\L>, C<\l>,
+C<\U>, and C<\u>. For Titlecase, see L.
+
+=head2 TERMINOLOGY
+
+=head3 Titlecase
+
+Unicode concept which most often is equal to uppercase, but for
+certain characters like the German "sharp s" there is a difference.
=head1 AUTHOR
-Iain Truskett.
+Iain Truskett. Updated by the Perl 5 Porters.
This document may be distributed under the same terms as Perl itself.
@@ -253,6 +350,14 @@ L for FAQs on regular expressions.
=item *
+L for a reference on backslash sequences.
+
+=item *
+
+L for a reference on character classes.
+
+=item *
+
The L module to alter behaviour and aid
debugging.
@@ -262,13 +367,13 @@ L
=item *
-L, L, L and L
+L, L, L and L
for details on regexes and internationalisation.
=item *
I by Jeffrey Friedl
-(F) for a thorough grounding and
+(F) for a thorough grounding and
reference on the topic.
=back
@@ -279,6 +384,9 @@ David P.C. Wollmann,
Richard Soderberg,
Sean M. Burke,
Tom Christiansen,
+Jim Cromie,
and
Jeffrey Goff
for useful advice.
+
+=cut