X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=02dd2cda5d8260ce0421f39218d9070d90372e7c;hb=a0acbdc36d211b2eba42328df555d9ec49fa4cd4;hp=2db4139c30ddb7095ed43d12eeb14628704d3241;hpb=92d29cee5ff815b05b81b877528e4c77e73881c9;p=p5sagit%2Fp5-mst-13.2.git

diff --git a/pod/perlre.pod b/pod/perlre.pod
index 2db4139..02dd2cd 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -40,7 +40,7 @@ is, no matter what C<$*> contains, C</s> without C</m> will force
 "^" to match only at the beginning of the string and "$" to match
 only at the end (or just before a newline at the end) of the string.
 Together, as /ms, they let the "." match any character whatsoever,
-while yet allowing "^" and "$" to match, respectively, just after
+while still allowing "^" and "$" to match, respectively, just after
 and just before newlines within the string.
 
 =item x
@@ -169,7 +169,7 @@ You'll need to write something like C<m/\Quser\E\@\Qhost/>.
 In addition, Perl defines the following:
 
     \w	Match a "word" character (alphanumeric plus "_")
-    \W	Match a non-word character
+    \W	Match a non-"word" character
     \s	Match a whitespace character
     \S	Match a non-whitespace character
     \d	Match a digit character
@@ -179,8 +179,9 @@ In addition, Perl defines the following:
     \X	Match eXtended Unicode "combining character sequence",
         equivalent to C<(?:\PM\pM*)>
     \C	Match a single C char (octet) even under utf8.
+	(Currently this does not work correctly.)
 
-A C<\w> matches a single alphanumeric character, not a whole word.
+A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
 Use C<\w+> to match a string of Perl-identifier characters (which isn't 
 the same as matching an English word).  If C<use locale> is in effect, the
 list of alphabetic characters generated by C<\w> is taken from the
@@ -199,38 +200,47 @@ equivalents (if available) are as follows:
     alpha
     alnum
     ascii
+    blank		[1]
     cntrl
     digit       \d
     graph
     lower
     print
     punct
-    space       \s
+    space       \s	[2]
     upper
-    word        \w
+    word        \w	[3]
     xdigit
 
+  [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+  [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
+      also the (very rare) `vertical tabulator', "\ck", chr(11).
+  [3] A Perl extension. 
+
 For example use C<[:upper:]> to match all the uppercase characters.
-Note that the C<[]> are part of the C<[::]> construct, not part of the whole
-character class.  For example:
+Note that the C<[]> are part of the C<[::]> construct, not part of the
+whole character class.  For example:
 
     [01[:alpha:]%]
 
-matches one, zero, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percentage sign.
 
 If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs hold:
+\p{} constructs and equivalent backslash character classes (if available),
+will hold:
 
     alpha       IsAlpha
     alnum       IsAlnum
     ascii       IsASCII
+    blank	IsSpace
     cntrl       IsCntrl
-    digit       IsDigit
+    digit       IsDigit        \d
     graph       IsGraph
     lower       IsLower
     print       IsPrint
     punct       IsPunct
     space       IsSpace
+                IsSpacePerl    \s
     upper       IsUpper
     word        IsWord
     xdigit      IsXDigit
@@ -238,8 +248,8 @@ If the C<utf8> pragma is used, the following equivalences to Unicode
 For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
 
 If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the isalpha(3) interface (except for `word',
-which is a Perl extension, mirroring C<\w>).
+classes correlate with the usual isalpha(3) interface (except for
+`word' and `blank').
 
 The assumedly non-obviously named classes are:
 
@@ -250,23 +260,24 @@ The assumedly non-obviously named classes are:
 Any control character.  Usually characters that don't produce output as
 such but instead control the terminal somehow: for example newline and
 backspace are control characters.  All characters with ord() less than
-32 are most often classified as control characters.
+32 are most often classified as control characters (assuming ASCII,
+the ISO Latin character sets, and Unicode).
 
 =item graph
 
-Any alphanumeric or punctuation character.
+Any alphanumeric or punctuation (special) character.
 
 =item print
 
-Any alphanumeric or punctuation character or space.
+Any alphanumeric or punctuation (special) character or space.
 
 =item punct
 
-Any punctuation character.
+Any punctuation (special) character.
 
 =item xdigit
 
-Any hexadecimal digit.  Though this may feel silly (/0-9a-f/i would
+Any hexadecimal digit.  Though this may feel silly ([0-9A-Fa-f] would
 work just fine) it is included for completeness.
 
 =back
@@ -323,12 +334,14 @@ I<backreference>.
 
 There is no limit to the number of captured substrings that you may
 use.  However Perl also uses \10, \11, etc. as aliases for \010,
-\011, etc.  (Recall that 0 means octal, so \011 is the 9'th ASCII
-character, a tab.)  Perl resolves this ambiguity by interpreting
-\10 as a backreference only if at least 10 left parentheses have
-opened before it.  Likewise \11 is a backreference only if at least
-11 left parentheses have opened before it.  And so on.  \1 through
-\9 are always interpreted as backreferences."
+\011, etc.  (Recall that 0 means octal, so \011 is the character at
+number 9 in your coded character set; which would be the 10th character,
+a horizontal tab under ASCII.)  Perl resolves this 
+ambiguity by interpreting \10 as a backreference only if at least 10 
+left parentheses have opened before it.  Likewise \11 is a 
+backreference only if at least 11 left parentheses have opened 
+before it.  And so on.  \1 through \9 are always interpreted as 
+backreferences.
 
 Examples:
 
@@ -352,7 +365,7 @@ everything before the matched string.  And C<$'> returns everything
 after the matched string.
 
 The numbered variables ($1, $2, $3, etc.) and the related punctuation
-set (C<<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
+set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
 until the end of the enclosing block or until the next successful
 match, whichever comes first.  (See L<perlsyn/"Compound Statements">.)
 
@@ -377,10 +390,11 @@ that looks like \\, \(, \), \<, \>, \{, or \} is always
 interpreted as a literal character, not a metacharacter.  This was
 once used in a common idiom to disable or quote the special meanings
 of regular expression metacharacters in a string that you want to
-use for a pattern. Simply quote all non-alphanumeric characters:
+use for a pattern. Simply quote all non-"word" characters:
 
     $pattern =~ s/(\W)/\\$1/g;
 
+(If C<use locale> is set, then this depends on the current locale.)
 Today it is more common to use the quotemeta() function or the C<\Q>
 metaquoting escape sequence to disable all metacharacters' special
 meanings like this:
@@ -901,10 +915,14 @@ ways they can use backtracking to try match.  For example, without
 internal optimizations done by the regular expression engine, this will
 take a painfully long time to run:
 
-    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
+    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
 
-And if you used C<*>'s instead of limiting it to 0 through 5 matches,
-then it would take forever--or until you ran out of stack space.
+And if you used C<*>'s in the internal groups instead of limiting them
+to 0 through 5 matches, then it would take forever--or until you ran
+out of stack space.  Moreover, these internal optimizations are not
+always applicable.  For example, if you put C<{0,5}> instead of C<*>
+on the external group, no current optimization is applicable, and the
+match takes a long time to finish.
 
 A powerful tool for optimizing such beasts is what is known as an
 "independent group",
@@ -939,10 +957,10 @@ escape it with a backslash.  "-" is also taken literally when it is
 at the end of the list, just before the closing "]".  (The
 following all specify the same class of three characters: C<[-az]>,
 C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, which
-specifies a class containing twenty-six characters.)
-Also, if you try to use the character classes C<\w>, C<\W>, C<\s>,
-C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range,
-the "-" is understood literally.
+specifies a class containing twenty-six characters, even on EBCDIC
+based coded character sets.)  Also, if you try to use the character 
+classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of 
+a range, that's not a range, the "-" is understood literally.
 
 Note also that the whole range idea is rather unportable between
 character sets--and even within character sets they may cause results
@@ -954,11 +972,11 @@ spell out the character sets in full.
 Characters may be specified using a metacharacter syntax much like that
 used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
 "\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
-of octal digits, matches the character whose ASCII value is I<nnn>.
-Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
-character whose ASCII value is I<nn>. The expression \cI<x> matches the
-ASCII character control-I<x>.  Finally, the "." metacharacter matches any
-character except "\n" (unless you use C</s>).
+of octal digits, matches the character whose coded character set value 
+is I<nnn>.  Similarly, \xI<nn>, where I<nn> are hexadecimal digits, 
+matches the character whose numeric value is I<nn>. The expression \cI<x> 
+matches the character control-I<x>.  Finally, the "." metacharacter 
+matches any character except "\n" (unless you use C</s>).
 
 You can specify a series of alternatives for a pattern using "|" to
 separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
@@ -1080,7 +1098,7 @@ For example:
     $_ = 'bar';
     s/\w??/<$&>/g;
 
-results in C<"<><b><><a><><r><>">.  At each position of the string the best
+results in C<< <><b><><a><><r><> >>.  At each position of the string the best
 match given by non-greedy C<??> is the zero-length match, and the I<second 
 best> match is what is matched by C<\w>.  Thus zero-length matches
 alternate with one-character-long matches.
@@ -1120,7 +1138,7 @@ one match at a given position is possible.  This section describes the
 notion of better/worse for combining operators.  In the description
 below C<S> and C<T> are regular subexpressions.
 
-=over
+=over 4
 
 =item C<ST>
 
@@ -1262,5 +1280,7 @@ L<perlfunc/pos>.
 
 L<perllocale>.
 
+L<perlebcdic>.
+
 I<Mastering Regular Expressions> by Jeffrey Friedl, published
 by O'Reilly and Associates.