X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=blobdiff_plain;f=pod%2Fperlretut.pod;h=c0a78a43e49db76d011a254091c442ef2a881650;hb=cf2649810f00335bd657355d81bcc9384a620135;hp=8f7c8cdd7260505260009a457a92a77afb878112;hpb=076988851d1cdb6cc455615593d7b9380b21955a;p=p5sagit%2Fp5-mst-13.2.git diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 8f7c8cd..c0a78a4 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -158,13 +158,14 @@ that a metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! "The interval is [0,1)." =~ /\[0,1\)\./ # matches - "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches + "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches In the last regexp, the forward slash C<'/'> is also backslashed, because it is used to delimit the regexp. This can lead to LTS (leaning toothpick syndrome), however, and it is often more readable to change delimiters. + "/usr/bin/perl" =~ m!/usr/bin/perl!; # easier to read The backslash character C<'\'> is a metacharacter itself and needs to be backslashed: @@ -689,10 +690,11 @@ inside goes into the special variables C<$1>, C<$2>, etc. They can be used just as ordinary variables: # extract hours, minutes, seconds - $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format - $hours = $1; - $minutes = $2; - $seconds = $3; + if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format + $hours = $1; + $minutes = $2; + $seconds = $3; + } Now, we know that in scalar context, S > returns a true or false @@ -1323,9 +1325,9 @@ If you change C<$pattern> after the first substitution happens, perl will ignore it. If you don't want any substitutions at all, use the special delimiter C: - $pattern = 'Seuss'; + @pattern = ('Seuss'); while (<>) { - print if m'$pattern'; # matches '$pattern', not 'Seuss' + print if m'@pattern'; # matches literal '@pattern', not 'Seuss' } C acts like single quotes on a regexp; all other C delimiters @@ -1403,6 +1405,8 @@ off. C<\G> allows us to easily do context-sensitive matching: The combination of C and C<\G> allows us to process the string a bit at a time and use arbitrary Perl logic to decide what to do next. +Currently, the C<\G> anchor is only fully supported when used to anchor +to the start of the pattern. C<\G> is also invaluable in processing fixed length records with regexps. Suppose we have a snippet of coding region DNA, encoded as @@ -1705,7 +1709,7 @@ it matches I byte 0-255. So The last regexp matches, but is dangerous because the string I position is no longer synchronized to the string I position. This generates the warning 'Malformed UTF-8 -character'. C<\C> is best used for matching the binary data in strings +character'. The C<\C> is best used for matching the binary data in strings with binary data intermixed with Unicode characters. Let us now discuss the rest of the character classes. Just as with @@ -1738,7 +1742,7 @@ traditional Unicode classes: IsPrint /^([LMNPS]|Co|Zs)/ IsPunct /^P/ IsSpace /^Z/ || ($code =~ /^(0009|000A|000B|000C|000D)$/ - IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D)$/ + IsSpacePerl /^Z/ || ($code =~ /^(0009|000A|000C|000D|0085|2028|2029)$/ IsUpper /^L[ut]/ IsWord /^[LMN]/ || $code eq "005F" IsXDigit $code =~ /^00(3[0-9]|[46][1-6])$/ @@ -1750,9 +1754,9 @@ letter, the braces can be dropped. For instance, C<\pM> is the character class of Unicode 'marks', for example accent marks. For the full list see L. -The Unicode has also been separated into various sets of charaters +The Unicode has also been separated into various sets of characters which you can test with C<\p{In...}> (in) and C<\P{In...}> (not in), -for example C<\p{InLatin}>, C<\p{InGreek}>, or C<\P{InKatakana}>. +for example C<\p{Latin}>, C<\p{Greek}>, or C<\P{Katakana}>. For the full list see L. C<\X> is an abbreviation for a character class sequence that includes @@ -1782,10 +1786,11 @@ C<[:space:]> correspond to the familiar C<\d>, C<\w>, and C<\s> character classes. To negate a POSIX class, put a C<^> in front of the name, so that, e.g., C<[:^digit:]> corresponds to C<\D> and under C, C<\P{IsDigit}>. The Unicode and POSIX character classes can -be used just like C<\d>, both inside and outside of character classes: +be used just like C<\d>, with the exception that POSIX character +classes can only be used inside of a character class: /\s+[abc[:digit:]xyz]\s*/; # match a,b,c,x,y,z, or a digit - /^=item\s[:digit:]/; # match '=item', + /^=item\s[[:digit:]]/; # match '=item', # followed by a space and a digit use charnames ":full"; /\s+[abc\p{IsDigit}xyz]\s+/; # match a,b,c,x,y,z, or a digit @@ -2001,6 +2006,10 @@ They evaluate true if the regexps do I match: $x =~ /foo(?!baz)/; # matches, 'baz' doesn't follow 'foo' $x =~ /(? is unsupported in lookbehind, because the already +treacherous definition of C<\C> would become even more so +when going backwards. + =head2 Using independent subexpressions to prevent backtracking The last few extended patterns in this tutorial are experimental as of @@ -2262,7 +2271,7 @@ may surprise you: $pat = qr/(?{ $foo = 1 })/; # precompile code regexp /foo${pat}bar/; # compiles ok -If a regexp has (1) code expressions and interpolating variables,or +If a regexp has (1) code expressions and interpolating variables, or (2) a variable that interpolates a code expression, perl treats the regexp as an error. If the code expression is precompiled into a variable, however, interpolating is ok. The question is, why is this