Typo squad.

[p5sagit/p5-mst-13.2.git] / pod / perlre.pod
diff --git a/pod/perlre.pod b/pod/perlre.pod

index 02dd2cd..c0d4e89 100644 (file)
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -4,10 +4,16 @@ perlre - Perl regular expressions
 
 =head1 DESCRIPTION
 
-This page describes the syntax of regular expressions in Perl.  For a
-description of how to I<use> regular expressions in matching
-operations, plus various examples of the same, see discussions
-of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.
+This page describes the syntax of regular expressions in Perl.  
+
+if you haven't used regular expressions before, a quick-start
+introduction is available in L<perlrequick>, and a longer tutorial
+introduction is available in L<perlretut>.
+
+For reference on how regular expressions are used in matching
+operations, plus various examples of the same, see discussions of
+C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like
+Operators">.
 
 Matching operations can have various modifiers.  Modifiers
 that relate to the interpretation of the regular expression inside
@@ -177,18 +183,23 @@ In addition, Perl defines the following:
     \pP        Match P, named property.  Use \p{Prop} for longer names.
     \PP        Match non-P
     \X Match eXtended Unicode "combining character sequence",
-        equivalent to C<(?:\PM\pM*)>
-    \C Match a single C char (octet) even under utf8.
-       (Currently this does not work correctly.)
-
-A C<\w> matches a single alphanumeric character or C<_>, not a whole word.
-Use C<\w+> to match a string of Perl-identifier characters (which isn't 
-the same as matching an English word).  If C<use locale> is in effect, the
-list of alphabetic characters generated by C<\w> is taken from the
-current locale.  See L<perllocale>.  You may use C<\w>, C<\W>, C<\s>, C<\S>,
+        equivalent to (?:\PM\pM*)
+    \C Match a single C char (octet) even under Unicode.
+       NOTE: breaks up characters into their UTF-8 bytes,
+       so you may end up with malformed pieces of UTF-8.
+
+A C<\w> matches a single alphanumeric character (an alphabetic
+character, or a decimal digit) or C<_>, not a whole word.  Use C<\w+>
+to match a string of Perl-identifier characters (which isn't the same
+as matching an English word).  If C<use locale> is in effect, the list
+of alphabetic characters generated by C<\w> is taken from the current
+locale.  See L<perllocale>.  You may use C<\w>, C<\W>, C<\s>, C<\S>,
 C<\d>, and C<\D> within character classes, but if you try to use them
-as endpoints of a range, that's not a range, the "-" is understood literally.
-See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.
+as endpoints of a range, that's not a range, the "-" is understood
+literally.  If Unicode is in effect, C<\s> matches also "\x{85}",
+"\x{2028}, and "\x{2029}", see L<perlunicode> for more details about
+C<\pP>, C<\PP>, and C<\X>, and L<perluniintro> about Unicode in general.
+You can define your own C<\p> and C<\P> propreties, see L<perlunicode>.
 
 The POSIX character class syntax
 
@@ -212,10 +223,22 @@ equivalents (if available) are as follows:
     word        \w     [3]
     xdigit
 
-  [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
-  [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes
-      also the (very rare) `vertical tabulator', "\ck", chr(11).
-  [3] A Perl extension. 
+=over
+
+=item [1]
+
+A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'.
+
+=item [2]
+
+Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
+also the (very rare) `vertical tabulator', "\ck", chr(11).
+
+=item [3]
+
+A Perl extension, see above.
+
+=back
 
 For example use C<[:upper:]> to match all the uppercase characters.
 Note that the C<[]> are part of the C<[::]> construct, not part of the
@@ -225,9 +248,10 @@ whole character class.  For example:
 
 matches zero, one, any alphabetic character, and the percentage sign.
 
-If the C<utf8> pragma is used, the following equivalences to Unicode
-\p{} constructs and equivalent backslash character classes (if available),
-will hold:
+The following equivalences to Unicode \p{} constructs and equivalent
+backslash character classes (if available), will hold:
+
+    [:...:]    \p{...}         backslash
 
     alpha       IsAlpha
     alnum       IsAlnum
@@ -261,7 +285,8 @@ Any control character.  Usually characters that don't produce output as
 such but instead control the terminal somehow: for example newline and
 backspace are control characters.  All characters with ord() less than
 32 are most often classified as control characters (assuming ASCII,
-the ISO Latin character sets, and Unicode).
+the ISO Latin character sets, and Unicode), as is the character with
+the ord() value of 127 (C<DEL>).
 
 =item graph
 
@@ -269,7 +294,7 @@ Any alphanumeric or punctuation (special) character.
 
 =item print
 
-Any alphanumeric or punctuation (special) character or space.
+Any alphanumeric or punctuation (special) character or the space character.
 
 =item punct
 
@@ -285,14 +310,16 @@ work just fine) it is included for completeness.
 You can negate the [::] character classes by prefixing the class name
 with a '^'. This is a Perl extension.  For example:
 
-    POSIX      trad. Perl  utf8 Perl
+    POSIX      traditional Unicode
 
     [:^digit:]      \D      \P{IsDigit}
     [:^space:]     \S      \P{IsSpace}
     [:^word:]      \W      \P{IsWord}
 
-The POSIX character classes [.cc.] and [=cc=] are recognized but
-B<not> supported and trying to use them will cause an error.
+Perl respects the POSIX standard in that POSIX character classes are
+only supported within a character class.  The POSIX character classes
+[.cc.] and [=cc=] are recognized but B<not> supported and trying to
+use them will cause an error.
 
 Perl defines the following zero-width assertions:
 
@@ -322,7 +349,11 @@ It is also useful when writing C<lex>-like scanners, when you have
 several patterns that you want to match against consequent substrings
 of your string, see the previous reference.  The actual location
 where C<\G> will match can also be influenced by using C<pos()> as
-an lvalue.  See L<perlfunc/pos>.
+an lvalue: see L<perlfunc/pos>. Currently C<\G> is only fully
+supported when anchored to the start of the pattern; while it
+is permitted to use it elsewhere, as in C</(?<=\G..)./g>, some
+such uses (C</.\G/g>, for example) currently cause problems, and
+it is recommended that you avoid such usage for now.
 
 The bracketing construct C<( ... )> creates capture buffers.  To
 refer to the digit'th buffer use \<digit> within the
@@ -437,12 +468,14 @@ C<)> in the comment.
 
 =item C<(?imsx-imsx)>
 
-One or more embedded pattern-match modifiers.  This is particularly
-useful for dynamic patterns, such as those read in from a configuration
-file, read in as an argument, are specified in a table somewhere,
-etc.  Consider the case that some of which want to be case sensitive
-and some do not.  The case insensitive ones need to include merely
-C<(?i)> at the front of the pattern.  For example:
+One or more embedded pattern-match modifiers, to be turned on (or
+turned off, if preceded by C<->) for the remainder of the pattern or
+the remainder of the enclosing pattern group (if any). This is
+particularly useful for dynamic patterns, such as those read in from a
+configuration file, read in as an argument, are specified in a table
+somewhere, etc.  Consider the case that some of which want to be case
+sensitive and some do not.  The case insensitive ones need to include
+merely C<(?i)> at the front of the pattern.  For example:
 
     $pattern = "foobar";
     if ( /$pattern/i ) { } 
@@ -452,8 +485,7 @@ C<(?i)> at the front of the pattern.  For example:
     $pattern = "(?i)foobar";
     if ( /$pattern/ ) { } 
 
-Letters after a C<-> turn those modifiers off.  These modifiers are
-localized inside an enclosing group (if any).  For example,
+These modifiers are restored at the end of the enclosing group. For example,
 
     ( (?i) blah ) \s+ \1
 
@@ -677,7 +709,7 @@ this yourself would be a productive exercise), but finishes in a fourth
 the time when used on a similar string with 1000000 C<a>s.  Be aware,
 however, that this pattern currently triggers a warning message under
 the C<use warnings> pragma or B<-w> switch saying it
-C<"matches the null string many times">):
+C<"matches null string many times in regex">.
 
 On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
 effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.
@@ -785,7 +817,7 @@ and the first "bar" thereafter.
   got <d is under the >
 
 Here's another example: let's say you'd like to match a number at the end
-of a string, and you also want to keep the preceding part the match.
+of a string, and you also want to keep the preceding part of the match.
 So you write this:
 
     $_ = "I have 2 numbers: 53147";
@@ -851,7 +883,7 @@ followed by "123".  You might try to write that as
 
 But that isn't going to match; at least, not the way you're hoping.  It
 claims that there is no 123 in the string.  Here's a clearer picture of
-why it that pattern matches, contrary to popular expectations:
+why that pattern matches, contrary to popular expectations:
 
     $x = 'ABC123' ;
     $y = 'ABC445' ;
@@ -1018,7 +1050,7 @@ Some people get too used to writing things like:
 
 This is grandfathered for the RHS of a substitute to avoid shocking the
 B<sed> addicts, but it's a dirty habit to get into.  That's because in
-PerlThink, the righthand side of a C<s///> is a double-quoted string.  C<\1> in
+PerlThink, the righthand side of an C<s///> is a double-quoted string.  C<\1> in
 the usual double-quoted string means a control-A.  The customary Unix
 meaning of C<\1> is kludged in for C<s///>.  However, if you get into the habit
 of doing that, you get yourself into trouble if you then add an C</e>
@@ -1270,6 +1302,10 @@ from the reference content.
 
 =head1 SEE ALSO
 
+L<perlrequick>.
+
+L<perlretut>.
+
 L<perlop/"Regexp Quote-Like Operators">.
 
 L<perlop/"Gory details of parsing quoted constructs">.