X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlre.pod;h=58cd6456f5a3b4e3e9c30c5677c2ec8b5b5f8642;hb=c9436a12b1ee8d5e32d19b5870c63a8435afed9d;hp=ce2b9bd952e6679fd62d26ce67edc476c366178f;hpb=945c54fd8d2501611a8e97dae49e901ff9478cad;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlre.pod b/pod/perlre.pod index ce2b9bd..58cd645 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -4,10 +4,16 @@ perlre - Perl regular expressions =head1 DESCRIPTION -This page describes the syntax of regular expressions in Perl. For a -description of how to I regular expressions in matching -operations, plus various examples of the same, see discussions -of C, C, C and C in L. +This page describes the syntax of regular expressions in Perl. + +if you haven't used regular expressions before, a quick-start +introduction is available in L, and a longer tutorial +introduction is available in L. + +For reference on how regular expressions are used in matching +operations, plus various examples of the same, see discussions of +C, C, C and C in L. Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside @@ -177,17 +183,23 @@ In addition, Perl defines the following: \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", - equivalent to C<(?:\PM\pM*)> - \C Match a single C char (octet) even under utf8. - -A C<\w> matches a single alphanumeric character or C<_>, not a whole word. -Use C<\w+> to match a string of Perl-identifier characters (which isn't -the same as matching an English word). If C is in effect, the -list of alphabetic characters generated by C<\w> is taken from the -current locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, + equivalent to (?:\PM\pM*) + \C Match a single C char (octet) even under Unicode. + NOTE: breaks up characters into their UTF-8 bytes, + so you may end up with malformed pieces of UTF-8. + +A C<\w> matches a single alphanumeric character (an alphabetic +character, or a decimal digit) or C<_>, not a whole word. Use C<\w+> +to match a string of Perl-identifier characters (which isn't the same +as matching an English word). If C is in effect, the list +of alphabetic characters generated by C<\w> is taken from the current +locale. See L. You may use C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes, but if you try to use them -as endpoints of a range, that's not a range, the "-" is understood literally. -See L for details about C<\pP>, C<\PP>, and C<\X>. +as endpoints of a range, that's not a range, the "-" is understood +literally. If Unicode is in effect, C<\s> matches also "\x{85}", +"\x{2028}, and "\x{2029}", see L for more details about +C<\pP>, C<\PP>, and C<\X>, and L about Unicode in +general. The POSIX character class syntax @@ -211,10 +223,22 @@ equivalents (if available) are as follows: word \w [3] xdigit - [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. - [2] Not I to C<\s> since the C<[[:space:]]> includes - also the (very rare) `vertical tabulator', "\ck", chr(11). - [3] A Perl extension. +=over + +=item [1] + +A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. + +=item [2] + +Not exactly equivalent to C<\s> since the C<[[:space:]]> includes +also the (very rare) `vertical tabulator', "\ck", chr(11). + +=item [3] + +A Perl extension, see above. + +=back For example use C<[:upper:]> to match all the uppercase characters. Note that the C<[]> are part of the C<[::]> construct, not part of the @@ -224,9 +248,10 @@ whole character class. For example: matches zero, one, any alphabetic character, and the percentage sign. -If the C pragma is used, the following equivalences to Unicode -\p{} constructs and equivalent backslash character classes (if available), -will hold: +The following equivalences to Unicode \p{} constructs and equivalent +backslash character classes (if available), will hold: + + [:...:] \p{...} backslash alpha IsAlpha alnum IsAlnum @@ -260,7 +285,8 @@ Any control character. Usually characters that don't produce output as such but instead control the terminal somehow: for example newline and backspace are control characters. All characters with ord() less than 32 are most often classified as control characters (assuming ASCII, -the ISO Latin character sets, and Unicode). +the ISO Latin character sets, and Unicode), as is the character with +the ord() value of 127 (C). =item graph @@ -268,7 +294,7 @@ Any alphanumeric or punctuation (special) character. =item print -Any alphanumeric or punctuation (special) character or space. +Any alphanumeric or punctuation (special) character or the space character. =item punct @@ -284,7 +310,7 @@ work just fine) it is included for completeness. You can negate the [::] character classes by prefixing the class name with a '^'. This is a Perl extension. For example: - POSIX trad. Perl utf8 Perl + POSIX traditional Unicode [:^digit:] \D \P{IsDigit} [:^space:] \S \P{IsSpace} @@ -436,12 +462,14 @@ C<)> in the comment. =item C<(?imsx-imsx)> -One or more embedded pattern-match modifiers. This is particularly -useful for dynamic patterns, such as those read in from a configuration -file, read in as an argument, are specified in a table somewhere, -etc. Consider the case that some of which want to be case sensitive -and some do not. The case insensitive ones need to include merely -C<(?i)> at the front of the pattern. For example: +One or more embedded pattern-match modifiers, to be turned on (or +turned off, if preceded by C<->) for the remainder of the pattern or +the remainder of the enclosing pattern group (if any). This is +particularly useful for dynamic patterns, such as those read in from a +configuration file, read in as an argument, are specified in a table +somewhere, etc. Consider the case that some of which want to be case +sensitive and some do not. The case insensitive ones need to include +merely C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } @@ -451,8 +479,7 @@ C<(?i)> at the front of the pattern. For example: $pattern = "(?i)foobar"; if ( /$pattern/ ) { } -Letters after a C<-> turn those modifiers off. These modifiers are -localized inside an enclosing group (if any). For example, +These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \1 @@ -676,7 +703,7 @@ this yourself would be a productive exercise), but finishes in a fourth the time when used on a similar string with 1000000 Cs. Be aware, however, that this pattern currently triggers a warning message under the C pragma or B<-w> switch saying it -C<"matches the null string many times">): +C<"matches null string many times in regex">. On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. @@ -784,7 +811,7 @@ and the first "bar" thereafter. got Here's another example: let's say you'd like to match a number at the end -of a string, and you also want to keep the preceding part the match. +of a string, and you also want to keep the preceding part of the match. So you write this: $_ = "I have 2 numbers: 53147"; @@ -850,7 +877,7 @@ followed by "123". You might try to write that as But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of -why it that pattern matches, contrary to popular expectations: +why that pattern matches, contrary to popular expectations: $x = 'ABC123' ; $y = 'ABC445' ; @@ -1017,7 +1044,7 @@ Some people get too used to writing things like: This is grandfathered for the RHS of a substitute to avoid shocking the B addicts, but it's a dirty habit to get into. That's because in -PerlThink, the righthand side of a C is a double-quoted string. C<\1> in +PerlThink, the righthand side of an C is a double-quoted string. C<\1> in the usual double-quoted string means a control-A. The customary Unix meaning of C<\1> is kludged in for C. However, if you get into the habit of doing that, you get yourself into trouble if you then add an C @@ -1269,6 +1296,10 @@ from the reference content. =head1 SEE ALSO +L. + +L. + L. L.