X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlretut.pod;h=b738c3b2cbe48c01f3fae8a69a1dbbe0fcc2e52a;hb=85b35914c9f3fc562f8a505e6508276be17f9d70;hp=cc8f5c4c9ba169edfc1a94a9c582655d75acd022;hpb=da75cd15705fec9f427a4cc8a647f3b6919c6e04;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlretut.pod b/pod/perlretut.pod index cc8f5c4..b738c3b 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -158,13 +158,14 @@ that a metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! "The interval is [0,1)." =~ /\[0,1\)\./ # matches - "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches + "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches In the last regexp, the forward slash C<'/'> is also backslashed, because it is used to delimit the regexp. This can lead to LTS (leaning toothpick syndrome), however, and it is often more readable to change delimiters. + "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read The backslash character C<'\'> is a metacharacter itself and needs to be backslashed: @@ -550,7 +551,7 @@ to give them a chance to match. The last example points out that character classes are like alternations of characters. At a given character position, the first -alternative that allows the regexp match to succeed wil be the one +alternative that allows the regexp match to succeed will be the one that matches. =head2 Grouping things and hierarchical matching @@ -587,7 +588,7 @@ are Alternations behave the same way in groups as out of them: at a given string position, the leftmost alternative that allows the regexp to -match is taken. So in the last example at tth first string position, +match is taken. So in the last example at the first string position, C<"20"> matches the second alternative, but there is nothing left over to match the next two digits C<\d\d>. So perl moves on to the next alternative, which is the null alternative and that works, since @@ -689,10 +690,11 @@ inside goes into the special variables C<$1>, C<$2>, etc. They can be used just as ordinary variables: # extract hours, minutes, seconds - $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format - $hours = $1; - $minutes = $2; - $seconds = $3; + if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format + $hours = $1; + $minutes = $2; + $seconds = $3; + } Now, we know that in scalar context, S > returns a true or false @@ -1323,9 +1325,9 @@ If you change C<$pattern> after the first substitution happens, perl will ignore it. If you don't want any substitutions at all, use the special delimiter C: - $pattern = 'Seuss'; + @pattern = ('Seuss'); while (<>) { - print if m'$pattern'; # matches '$pattern', not 'Seuss' + print if m'@pattern'; # matches literal '@pattern', not 'Seuss' } C acts like single quotes on a regexp; all other C delimiters @@ -1403,6 +1405,8 @@ off. C<\G> allows us to easily do context-sensitive matching: The combination of C and C<\G> allows us to process the string a bit at a time and use arbitrary Perl logic to decide what to do next. +Currently, the C<\G> anchor is only fully supported when used to anchor +to the start of the pattern. C<\G> is also invaluable in processing fixed length records with regexps. Suppose we have a snippet of coding region DNA, encoded as @@ -1415,7 +1419,7 @@ naive regexp $dna = "ATCGTTGAATGCAAATGACATGAC"; $dna =~ /TGA/; -doesn't work; it may match an C, but there is no guarantee that +doesn't work; it may match a C, but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring S > gives a match. A better solution is @@ -1653,12 +1657,11 @@ Unicode characters in the range of 128-255 use two hexadecimal digits with braces: C<\x{ab}>. Note that this is different than C<\xab>, which is just a hexadecimal byte with no Unicode significance. -B: in perl 5.6.0 it used to be that one needed to say C -to use any Unicode features. This is no more the case: for almost all -Unicode processing, the explicit C pragma is not needed. -(The only case where it matters is if your Perl script is in Unicode, -that is, encoded in UTF-8/UTF-16/UTF-EBCDIC: then an explicit C -is needed.) +B: in Perl 5.6.0 it used to be that one needed to say C to use any Unicode features. This is no more the case: for +almost all Unicode processing, the explicit C pragma is not +needed. (The only case where it matters is if your Perl script is in +Unicode and encoded in UTF-8, then an explicit C is needed.) Figuring out the hexadecimal sequence of a Unicode character you want or deciphering someone else's hexadecimal Unicode regexp is about as @@ -1706,7 +1709,7 @@ it matches I byte 0-255. So The last regexp matches, but is dangerous because the string I position is no longer synchronized to the string I position. This generates the warning 'Malformed UTF-8 -character'. C<\C> is best used for matching the binary data in strings +character'. The C<\C> is best used for matching the binary data in strings with binary data intermixed with Unicode characters. Let us now discuss the rest of the character classes. Just as with @@ -1739,7 +1742,7 @@ traditional Unicode classes: IsPrint /^([LMNPS]|Co|Zs)/ IsPunct /^P/ IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ - IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/ + IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/ IsUpper /^L[ut]/ IsWord /^[LMN]/ || $code eq "005F" IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ @@ -1753,7 +1756,7 @@ For the full list see L. The Unicode has also been separated into various sets of charaters which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in), -for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>. +for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>. For the full list see L. C<\X> is an abbreviation for a character class sequence that includes @@ -1783,10 +1786,11 @@ C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> character classes. To negate a POSIX class, put a C<^> in front of the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under C, C<\P{IsDigit}>. The Unicode and POSIX character classes can -be used just like C<\d>, both inside and outside of character classes: +be used just like C<\d>, with the exception that POSIX character +classes can only be used inside of a character class: /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit - /^=item\s[:digit:]/; # match '=item', + /^=item\s[[:digit:]]/; # match '=item', # followed by a space and a digit use charnames ":full"; /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit @@ -2002,6 +2006,10 @@ They evaluate true if the regexps do I match: $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' $x =~ /(? is unsupported in lookbehind, because the already +treacherous definition of C<\C> would become even more so +when going backwards. + =head2 Using independent subexpressions to prevent backtracking The last few extended patterns in this tutorial are experimental as of @@ -2060,7 +2068,7 @@ the first alternative C<[^()]+> matching a substring with no parentheses and the second alternative C<\([^()]*\)> matching a substring delimited by parentheses. The problem with this regexp is that it is pathological: it has nested indeterminate quantifiers - of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers +of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers like this could take an exponentially long time to execute if there was no match possible. To prevent the exponential blowup, we need to prevent useless backtracking at some point. This can be done by