X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=58cd6456f5a3b4e3e9c30c5677c2ec8b5b5f8642;hb=c9436a12b1ee8d5e32d19b5870c63a8435afed9d;hp=0c38ac7cba692057b43f475d15f99a3292783cf7;hpb=847a5fae45dac396d0f9e1bb61d5b4ff9d94cdcd;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index 0c38ac7..58cd645 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -4,10 +4,16 @@ perlre - Perl regular expressions =head1 DESCRIPTION -This page describes the syntax of regular expressions in Perl. For a -description of how to I regular expressions in matching -operations, plus various examples of the same, see discussions -of C, C, C and C in L. +This page describes the syntax of regular expressions in Perl. + +if you haven't used regular expressions before, a quick-start +introduction is available in L, and a longer tutorial +introduction is available in L. + +For reference on how regular expressions are used in matching +operations, plus various examples of the same, see discussions of +C, C, C and C in L. Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside @@ -40,7 +46,7 @@ is, no matter what C<$*> contains, C without C will force "^" to match only at the beginning of the string and "$" to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the "." match any character whatsoever, -while yet allowing "^" and "$" to match, respectively, just after +while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. =item x @@ -177,18 +183,23 @@ In addition, Perl defines the following: \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", - equivalent to C<(?:\PM\pM*)> - \C Match a single C char (octet) even under utf8. - (Currently this does not work correctly.) - -A C<\w> matches a single alphanumeric character or C<_>, not a whole word. -Use C<\w+> to match a string of Perl-identifier characters (which isn't -the same as matching an English word). If C is in effect, the -list of alphabetic characters generated by C<\w> is taken from the -current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, + equivalent to (?:\PM\pM*) + \C Match a single C char (octet) even under Unicode. + NOTE: breaks up characters into their UTF-8 bytes, + so you may end up with malformed pieces of UTF-8. + +A C<\w> matches a single alphanumeric character (an alphabetic +character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> +to match a string of Perl-identifier characters (which isn't the same +as matching an English word). If C is in effect, the list +of alphabetic characters generated by C<\w> is taken from the current +locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but if you try to use them -as endpoints of a range, that's not a range, the "-" is understood literally. -See L for details about C<\pP>, C<\PP>, and C<\X>. +as endpoints of a range, that's not a range, the "-" is understood +literally. If Unicode is in effect, C<\s> matches also "\x{85}", +"\x{2028}, and "\x{2029}", see L for more details about +C<\pP>, C<\PP>, and C<\X>, and L about Unicode in +general. The POSIX character class syntax @@ -212,10 +223,22 @@ equivalents (if available) are as follows: word \w [3] xdigit - [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. - [2] Not I to C<\s> since the C<[[:space:]]> includes - also the (very rare) `vertical tabulator', "\ck", chr(11). - [3] A Perl extension. +=over + +=item [1] + +A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. + +=item [2] + +Not exactly equivalent to C<\s> since the C<[[:space:]]> includes +also the (very rare) `vertical tabulator', "\ck", chr(11). + +=item [3] + +A Perl extension, see above. + +=back For example use C<[:upper:]> to match all the uppercase characters. Note that the C<[]> are part of the C<[::]> construct, not part of the @@ -225,9 +248,10 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percentage sign. -If the C pragma is used, the following equivalences to Unicode -\p{} constructs and equivalent backslash character classes (if available), -will hold: +The following equivalences to Unicode \p{} constructs and equivalent +backslash character classes (if available), will hold: + + [:...:] \p{...} backslash alpha IsAlpha alnum IsAlnum @@ -261,7 +285,8 @@ Any control character. Usually characters that don't produce output as such but instead control the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than 32 are most often classified as control characters (assuming ASCII, -the ISO Latin character sets, and Unicode). +the ISO Latin character sets, and Unicode), as is the character with +the ord() value of 127 (C). =item graph @@ -269,7 +294,7 @@ Any alphanumeric or punctuation (special) character. =item print -Any alphanumeric or punctuation (special) character or space. +Any alphanumeric or punctuation (special) character or the space character. =item punct @@ -285,7 +310,7 @@ work just fine) it is included for completeness. You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension. For example: - POSIX trad. Perl utf8 Perl + POSIX traditional Unicode [:^digit:] \D \P{IsDigit} [:^space:] \S \P{IsSpace} @@ -334,12 +359,14 @@ I. There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, -\011, etc. (Recall that 0 means octal, so \011 is the 9'th ASCII -character, a tab.) Perl resolves this ambiguity by interpreting -\10 as a backreference only if at least 10 left parentheses have -opened before it. Likewise \11 is a backreference only if at least -11 left parentheses have opened before it. And so on. \1 through -\9 are always interpreted as backreferences." +\011, etc. (Recall that 0 means octal, so \011 is the character at +number 9 in your coded character set; which would be the 10th character, +a horizontal tab under ASCII.) Perl resolves this +ambiguity by interpreting \10 as a backreference only if at least 10 +left parentheses have opened before it. Likewise \11 is a +backreference only if at least 11 left parentheses have opened +before it. And so on. \1 through \9 are always interpreted as +backreferences. Examples: @@ -435,12 +462,14 @@ C<)> in the comment. =item C<(?imsx-imsx)> -One or more embedded pattern-match modifiers. This is particularly -useful for dynamic patterns, such as those read in from a configuration -file, read in as an argument, are specified in a table somewhere, -etc. Consider the case that some of which want to be case sensitive -and some do not. The case insensitive ones need to include merely -C<(?i)> at the front of the pattern. For example: +One or more embedded pattern-match modifiers, to be turned on (or +turned off, if preceded by C<->) for the remainder of the pattern or +the remainder of the enclosing pattern group (if any). This is +particularly useful for dynamic patterns, such as those read in from a +configuration file, read in as an argument, are specified in a table +somewhere, etc. Consider the case that some of which want to be case +sensitive and some do not. The case insensitive ones need to include +merely C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } @@ -450,8 +479,7 @@ C<(?i)> at the front of the pattern. For example: $pattern = "(?i)foobar"; if ( /$pattern/ ) { } -Letters after a C<-> turn those modifiers off. These modifiers are -localized inside an enclosing group (if any). For example, +These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 @@ -675,7 +703,7 @@ this yourself would be a productive exercise), but finishes in a fourth the time when used on a similar string with 1000000 Cs. Be aware, however, that this pattern currently triggers a warning message under the C pragma or B<-w> switch saying it -C<"matches the null string many times">): +C<"matches null string many times in regex">. On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. @@ -783,7 +811,7 @@ and the first "bar" thereafter. got Here's another example: let's say you'd like to match a number at the end -of a string, and you also want to keep the preceding part the match. +of a string, and you also want to keep the preceding part of the match. So you write this: $_ = "I have 2 numbers: 53147"; @@ -849,7 +877,7 @@ followed by "123". You might try to write that as But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of -why it that pattern matches, contrary to popular expectations: +why that pattern matches, contrary to popular expectations: $x = 'ABC123' ; $y = 'ABC445' ; @@ -955,10 +983,10 @@ escape it with a backslash. "-" is also taken literally when it is at the end of the list, just before the closing "]". (The following all specify the same class of three characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which -specifies a class containing twenty-six characters.) -Also, if you try to use the character classes C<\w>, C<\W>, C<\s>, -C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range, -the "-" is understood literally. +specifies a class containing twenty-six characters, even on EBCDIC +based coded character sets.) Also, if you try to use the character +classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of +a range, that's not a range, the "-" is understood literally. Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results @@ -970,11 +998,11 @@ spell out the character sets in full. Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, "\f" a form feed, etc. More generally, \I, where I is a string -of octal digits, matches the character whose ASCII value is I. -Similarly, \xI, where I are hexadecimal digits, matches the -character whose ASCII value is I. The expression \cI matches the -ASCII character control-I. Finally, the "." metacharacter matches any -character except "\n" (unless you use C). +of octal digits, matches the character whose coded character set value +is I. Similarly, \xI, where I are hexadecimal digits, +matches the character whose numeric value is I. The expression \cI +matches the character control-I. Finally, the "." metacharacter +matches any character except "\n" (unless you use C). You can specify a series of alternatives for a pattern using "|" to separate them, so that C will match any of "fee", "fie", @@ -1016,7 +1044,7 @@ Some people get too used to writing things like: This is grandfathered for the RHS of a substitute to avoid shocking the B addicts, but it's a dirty habit to get into. That's because in -PerlThink, the righthand side of a C is a double-quoted string. C<\1> in +PerlThink, the righthand side of an C is a double-quoted string. C<\1> in the usual double-quoted string means a control-A. The customary Unix meaning of C<\1> is kludged in for C. However, if you get into the habit of doing that, you get yourself into trouble if you then add an C @@ -1096,7 +1124,7 @@ For example: $_ = 'bar'; s/\w??/<$&>/g; -results in C<"<><><><>">. At each position of the string the best +results in C<< <><><><> >>. At each position of the string the best match given by non-greedy C is the zero-length match, and the I match is what is matched by C<\w>. Thus zero-length matches alternate with one-character-long matches. @@ -1268,6 +1296,10 @@ from the reference content. =head1 SEE ALSO +L. + +L. + L. L. @@ -1278,5 +1310,7 @@ L. L. +L. + I by Jeffrey Friedl, published by O'Reilly and Associates.