Tick off Unicode collation and the normalization from

[p5sagit/p5-mst-13.2.git] / pod / perlre.pod
diff --git a/pod/perlre.pod b/pod/perlre.pod

index 2db4139..874fed4 100644 (file)
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -4,10 +4,16 @@ perlre - Perl regular expressions
 
 =head1 DESCRIPTION
 
-This page describes the syntax of regular expressions in Perl.  For a
-description of how to I<use> regular expressions in matching
-operations, plus various examples of the same, see discussions
-of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
+This page describes the syntax of regular expressions in Perl.  
+
+if you haven't used regular expressions before, a quick-start
+introduction is available in L<perlrequick>, and a longer tutorial
+introduction is available in L<perlretut>.
+
+For reference on how regular expressions are used in matching
+operations, plus various examples of the same, see discussions of
+C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
+Operators">.
 
 Matching operations can have various modifiers.  Modifiers
 that relate to the interpretation of the regular expression inside
@@ -40,7 +46,7 @@ is, no matter what C<$*> contains, C</s> without C</m> will force
 "^" to match only at the beginning of the string and "$" to match
 only at the end (or just before a newline at the end) of the string.
 Together, as /ms, they let the "." match any character whatsoever,
-while yet allowing "^" and "$" to match, respectively, just after
+while still allowing "^" and "$" to match, respectively, just after
 and just before newlines within the string.
 
 =item x
@@ -169,7 +175,7 @@ You'll need to write something like C<m/\Quser\E\@\Qhost/>.
 In addition, Perl defines the following:
 
     \w Match a "word" character (alphanumeric plus "_")
-    \W Match a non-word character
+    \W Match a non-"word" character
     \s Match a whitespace character
     \S Match a non-whitespace character
     \d Match a digit character
@@ -180,7 +186,7 @@ In addition, Perl defines the following:
         equivalent to C<(?:\PM\pM*)>
     \C Match a single C char (octet) even under utf8.
 
-A C<\w> matches a single alphanumeric character, not a whole word.
+A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
 Use C<\w+> to match a string of Perl-identifier characters (which isn't 
 the same as matching an English word).  If C<use locale> is in effect, the
 list of alphabetic characters generated by C<\w> is taken from the
@@ -199,38 +205,47 @@ equivalents (if available) are as follows:
     alpha
     alnum
     ascii
+    blank              [1]
     cntrl
     digit       \d
     graph
     lower
     print
     punct
-    space       \s
+    space       \s     [2]
     upper
-    word        \w
+    word        \w     [3]
     xdigit
 
+  [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+  [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
+      also the (very rare) `vertical tabulator', "\ck", chr(11).
+  [3] A Perl extension. 
+
 For example use C<[:upper:]> to match all the uppercase characters.
-Note that the C<[]> are part of the C<[::]> construct, not part of the whole
-character class.  For example:
+Note that the C<[]> are part of the C<[::]> construct, not part of the
+whole character class.  For example:
 
     [01[:alpha:]%]
 
-matches one, zero, any alphabetic character, and the percentage sign.
+matches zero, one, any alphabetic character, and the percentage sign.
 
 If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs hold:
+\p{} constructs and equivalent backslash character classes (if available),
+will hold:
 
     alpha       IsAlpha
     alnum       IsAlnum
     ascii       IsASCII
+    blank      IsSpace
     cntrl       IsCntrl
-    digit       IsDigit
+    digit       IsDigit        \d
     graph       IsGraph
     lower       IsLower
     print       IsPrint
     punct       IsPunct
     space       IsSpace
+                IsSpacePerl    \s
     upper       IsUpper
     word        IsWord
     xdigit      IsXDigit
@@ -238,8 +253,8 @@ If the C<utf8> pragma is used, the following equivalences to Unicode
 For example C<[:lower:]> and C<\p{IsLower}> are equivalent.
 
 If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the isalpha(3) interface (except for `word',
-which is a Perl extension, mirroring C<\w>).
+classes correlate with the usual isalpha(3) interface (except for
+`word' and `blank').
 
 The assumedly non-obviously named classes are:
 
@@ -250,23 +265,25 @@ The assumedly non-obviously named classes are:
 Any control character.  Usually characters that don't produce output as
 such but instead control the terminal somehow: for example newline and
 backspace are control characters.  All characters with ord() less than
-32 are most often classified as control characters.
+32 are most often classified as control characters (assuming ASCII,
+the ISO Latin character sets, and Unicode), as is the character with
+the ord() value of 127 (C<DEL>).
 
 =item graph
 
-Any alphanumeric or punctuation character.
+Any alphanumeric or punctuation (special) character.
 
 =item print
 
-Any alphanumeric or punctuation character or space.
+Any alphanumeric or punctuation (special) character or the space character.
 
 =item punct
 
-Any punctuation character.
+Any punctuation (special) character.
 
 =item xdigit
 
-Any hexadecimal digit.  Though this may feel silly (/0-9a-f/i would
+Any hexadecimal digit.  Though this may feel silly ([0-9A-Fa-f] would
 work just fine) it is included for completeness.
 
 =back
@@ -323,12 +340,14 @@ I<backreference>.
 
 There is no limit to the number of captured substrings that you may
 use.  However Perl also uses \10, \11, etc. as aliases for \010,
-\011, etc.  (Recall that 0 means octal, so \011 is the 9'th ASCII
-character, a tab.)  Perl resolves this ambiguity by interpreting
-\10 as a backreference only if at least 10 left parentheses have
-opened before it.  Likewise \11 is a backreference only if at least
-11 left parentheses have opened before it.  And so on.  \1 through
-\9 are always interpreted as backreferences."
+\011, etc.  (Recall that 0 means octal, so \011 is the character at
+number 9 in your coded character set; which would be the 10th character,
+a horizontal tab under ASCII.)  Perl resolves this 
+ambiguity by interpreting \10 as a backreference only if at least 10 
+left parentheses have opened before it.  Likewise \11 is a 
+backreference only if at least 11 left parentheses have opened 
+before it.  And so on.  \1 through \9 are always interpreted as 
+backreferences.
 
 Examples:
 
@@ -352,7 +371,7 @@ everything before the matched string.  And C<$'> returns everything
 after the matched string.
 
 The numbered variables ($1, $2, $3, etc.) and the related punctuation
-set (C<<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
+set (C<$+>, C<$&>, C<$`>, and C<$'>) are all dynamically scoped
 until the end of the enclosing block or until the next successful
 match, whichever comes first.  (See L<perlsyn/"Compound Statements">.)
 
@@ -377,10 +396,11 @@ that looks like \\, \(, \), \<, \>, \{, or \} is always
 interpreted as a literal character, not a metacharacter.  This was
 once used in a common idiom to disable or quote the special meanings
 of regular expression metacharacters in a string that you want to
-use for a pattern. Simply quote all non-alphanumeric characters:
+use for a pattern. Simply quote all non-"word" characters:
 
     $pattern =~ s/(\W)/\\$1/g;
 
+(If C<use locale> is set, then this depends on the current locale.)
 Today it is more common to use the quotemeta() function or the C<\Q>
 metaquoting escape sequence to disable all metacharacters' special
 meanings like this:
@@ -663,7 +683,7 @@ this yourself would be a productive exercise), but finishes in a fourth
 the time when used on a similar string with 1000000 C<a>s.  Be aware,
 however, that this pattern currently triggers a warning message under
 the C<use warnings> pragma or B<-w> switch saying it
-C<"matches the null string many times">):
+C<"matches null string many times in regex">.
 
 On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
 effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
@@ -771,7 +791,7 @@ and the first "bar" thereafter.
   got <d is under the >
 
 Here's another example: let's say you'd like to match a number at the end
-of a string, and you also want to keep the preceding part the match.
+of a string, and you also want to keep the preceding part of the match.
 So you write this:
 
     $_ = "I have 2 numbers: 53147";
@@ -837,7 +857,7 @@ followed by "123".  You might try to write that as
 
 But that isn't going to match; at least, not the way you're hoping.  It
 claims that there is no 123 in the string.  Here's a clearer picture of
-why it that pattern matches, contrary to popular expectations:
+why that pattern matches, contrary to popular expectations:
 
     $x = 'ABC123' ;
     $y = 'ABC445' ;
@@ -901,10 +921,14 @@ ways they can use backtracking to try match.  For example, without
 internal optimizations done by the regular expression engine, this will
 take a painfully long time to run:
 
-    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5}){0,5}[c]/
+    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
 
-And if you used C<*>'s instead of limiting it to 0 through 5 matches,
-then it would take forever--or until you ran out of stack space.
+And if you used C<*>'s in the internal groups instead of limiting them
+to 0 through 5 matches, then it would take forever--or until you ran
+out of stack space.  Moreover, these internal optimizations are not
+always applicable.  For example, if you put C<{0,5}> instead of C<*>
+on the external group, no current optimization is applicable, and the
+match takes a long time to finish.
 
 A powerful tool for optimizing such beasts is what is known as an
 "independent group",
@@ -939,10 +963,10 @@ escape it with a backslash.  "-" is also taken literally when it is
 at the end of the list, just before the closing "]".  (The
 following all specify the same class of three characters: C<[-az]>,
 C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, which
-specifies a class containing twenty-six characters.)
-Also, if you try to use the character classes C<\w>, C<\W>, C<\s>,
-C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range,
-the "-" is understood literally.
+specifies a class containing twenty-six characters, even on EBCDIC
+based coded character sets.)  Also, if you try to use the character 
+classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of 
+a range, that's not a range, the "-" is understood literally.
 
 Note also that the whole range idea is rather unportable between
 character sets--and even within character sets they may cause results
@@ -954,11 +978,11 @@ spell out the character sets in full.
 Characters may be specified using a metacharacter syntax much like that
 used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,
 "\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a string
-of octal digits, matches the character whose ASCII value is I<nnn>.
-Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the
-character whose ASCII value is I<nn>. The expression \cI<x> matches the
-ASCII character control-I<x>.  Finally, the "." metacharacter matches any
-character except "\n" (unless you use C</s>).
+of octal digits, matches the character whose coded character set value 
+is I<nnn>.  Similarly, \xI<nn>, where I<nn> are hexadecimal digits, 
+matches the character whose numeric value is I<nn>. The expression \cI<x> 
+matches the character control-I<x>.  Finally, the "." metacharacter 
+matches any character except "\n" (unless you use C</s>).
 
 You can specify a series of alternatives for a pattern using "|" to
 separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
@@ -1080,7 +1104,7 @@ For example:
     $_ = 'bar';
     s/\w??/<$&>/g;
 
-results in C<"<><b><><a><><r><>">.  At each position of the string the best
+results in C<< <><b><><a><><r><> >>.  At each position of the string the best
 match given by non-greedy C<??> is the zero-length match, and the I<second 
 best> match is what is matched by C<\w>.  Thus zero-length matches
 alternate with one-character-long matches.
@@ -1120,7 +1144,7 @@ one match at a given position is possible.  This section describes the
 notion of better/worse for combining operators.  In the description
 below C<S> and C<T> are regular subexpressions.
 
-=over
+=over 4
 
 =item C<ST>
 
@@ -1252,6 +1276,10 @@ from the reference content.
 
 =head1 SEE ALSO
 
+L<perlrequick>.
+
+L<perlretut>.
+
 L<perlop/"Regexp Quote-Like Operators">.
 
 L<perlop/"Gory details of parsing quoted constructs">.
@@ -1262,5 +1290,7 @@ L<perlfunc/pos>.
 
 L<perllocale>.
 
+L<perlebcdic>.
+
 I<Mastering Regular Expressions> by Jeffrey Friedl, published
 by O'Reilly and Associates.