From: Mark-Jason Dominus Date: Sat, 21 Apr 2001 21:48:51 +0000 (-0400) Subject: Re: Regex debugger patch X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=1c102323748677709a3cb1ae901516a4e38b750e;p=p5sagit%2Fp5-mst-13.2.git Re: Regex debugger patch Message-ID: <20010422014851.27165.qmail@plover.com> p4raw-id: //depot/perl@9777 --- diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod index 20cc546..02b5ab1 100644 --- a/pod/perldebguts.pod +++ b/pod/perldebguts.pod @@ -364,43 +364,58 @@ compile time and run time. It is not lexically scoped. The debugging output at compile time looks like this: - compiling RE `[bc]d(ef*g)+h[ij]k$' - size 43 first at 1 - 1: ANYOF(11) - 11: EXACT (13) - 13: CURLYX {1,32767}(27) - 15: OPEN1(17) - 17: EXACT (19) - 19: STAR(22) - 20: EXACT (0) - 22: EXACT (24) - 24: CLOSE1(26) - 26: WHILEM(0) - 27: NOTHING(28) - 28: EXACT (30) - 30: ANYOF(40) - 40: EXACT (42) - 42: EOL(43) - 43: END(0) - anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating) - stclass `ANYOF' minlen 7 + Compiling REx `[bc]d(ef*g)+h[ij]k$' + size 45 Got 364 bytes for offset annotations. + first at 1 + rarest char g at 0 + rarest char d at 0 + 1: ANYOF[bc](12) + 12: EXACT (14) + 14: CURLYX[0] {1,32767}(28) + 16: OPEN1(18) + 18: EXACT (20) + 20: STAR(23) + 21: EXACT (0) + 23: EXACT (25) + 25: CLOSE1(27) + 27: WHILEM[1/1](0) + 28: NOTHING(29) + 29: EXACT (31) + 31: ANYOF[ij](42) + 42: EXACT (44) + 44: EOL(45) + 45: END(0) + anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating) + stclass `ANYOF[bc]' minlen 7 + Offsets: [45] + 1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1] + 0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0] + 11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0] + 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0] + Omitting $` $& $' support. The first line shows the pre-compiled form of the regex. The second shows the size of the compiled form (in arbitrary units, usually -4-byte words) and the label I of the first node that does a -match. +4-byte words) and the total number of bytes allocated for the +offset/length table, usually 4+C*8. The next line shows the +label I of the first node that does a match. -The last line (split into two lines above) contains optimizer +The + + anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating) + stclass `ANYOF[bc]' minlen 7 + +line (split into two lines above) contains optimizer information. In the example shown, the optimizer found that the match should contain a substring C at offset 1, plus substring C at some offset between 3 and infinity. Moreover, when checking for these substrings (to abandon impossible matches quickly), Perl will check for the substring C before checking for the substring C. The optimizer may also use the knowledge that the match starts (at the -C I) with a character class, and the match cannot be -shorter than 7 chars. +C I) with a character class, and no string +shorter than 7 characters can possibly match. -The fields of interest which may appear in the last line are +The fields of interest which may appear in this line are =over 4 @@ -428,7 +443,7 @@ Don't scan for the found substrings. =item C -Means that the optimizer info is all that the regular +Means that the optimizer information is all that the regular expression contains, and thus one does not need to enter the regex engine at all. @@ -459,12 +474,12 @@ being C, C, or C. See the table below. If a substring is known to match at end-of-line only, it may be followed by C<$>, as in C. -The optimizer-specific info is used to avoid entering (a slow) regex -engine on strings that will not definitely match. If C flag +The optimizer-specific information is used to avoid entering (a slow) regex +engine on strings that will not definitely match. If the C flag is set, a call to the regex engine may be avoided even when the optimizer found an appropriate place for the match. -The rest of the output contains the list of I of the compiled +Above the optimizer section is the list of I of the compiled form of the regex. Each line has format C< >I: I I (I) @@ -583,6 +598,36 @@ Here are the possible types, with short descriptions: # To simplify debugging output, we mark it as if it were a node OPTIMIZED off Placeholder for dump. +=for unprinted-credits +Next section M-J. Dominus (mjd-perl-patch+@plover.com) 20010421 + +Following the optimizer information is a dump of the offset/length +table, here split across several lines: + + Offsets: [45] + 1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1] + 0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0] + 11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0] + 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0] + +The first line here indicates that the offset/length table contains 45 +entries. Each entry is a pair of integers, denoted by C. +Entries are numbered starting with, so entry #1 here is C<1[4]> and +entry #12 is C<5[1]>. C<1[4]> indicates that the node labeled C<1:> +(the C<1: ANYOF[bc]>) begins at character position 1 in the +pre-compiled form of the regex, and has a length of 4 characters. +C<5[1]> in position 12 +indicates that the node labeled C<12:> +(the C<< 12: EXACT >>) begins at character position 5 in the +pre-compiled form of the regex, and has a length of 1 character. +C<12[1]> in position 14 +indicates that the node labeled C<14:> +(the C<< 14: CURLYX[0] {1,32767} >>) begins at character position 12 in the +pre-compiled form of the regex, and has a length of 1 character---that +is, it corresponds to the C<+> symbol in the precompiled regex. + +C<0[0]> items indicate that there is no corresponding node. + =head2 Run-time output First of all, when doing a match, one may get no run-time output even