From: Yves Orton Date: Thu, 12 Oct 2006 14:45:25 +0000 (+0200) Subject: More regexp documentation X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=9af228c62a22d61074ac942be277a5f0b4bd7aff;p=p5sagit%2Fp5-mst-13.2.git More regexp documentation Message-ID: <9b18b3110610120545m3002e17cqace30f908b0e2277@mail.gmail.com> p4raw-id: //depot/perl@28999 --- diff --git a/pod/perlre.pod b/pod/perlre.pod index f79b8c7..c2da3bd 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -1004,7 +1004,51 @@ with the given name matched), the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME> in which case it will -be true only when evaluated during recursion into the named group. +be true only when evaluated during recursion in the named group. + +Here's a summary of the possible predicates: + +=over 4 + +=item (1) (2) ... + +Checks if the numbered capturing buffer has matched something. + +=item () ('NAME') + +Checks if a buffer with the given name has matched something. + +=item (?{ CODE }) + +Treats the code block as the condition + +=item (R) + +Checks if the expression has been evaluated inside of recursion. + +=item (R1) (R2) ... + +Checks if the expression has been evaluated while executing directly +inside of the n-th capture group. This check is the regex equivalent of + + if ((caller(0))[3] eq 'subname') { .. } + +In other words, it does not check the full recursion stack. + +=item (R&NAME) + +Similar to C<(R1)>, this predicate checks to see if we're executing +directly inside of the leftmost group with a given name (this is the same +logic used by C<(?&NAME)> to disambiguate). It does not check the full +stack, but only the name of the innermost active recursion. + +=item (DEFINE) + +In this case, the yes-pattern is never directly executed, and no +no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. +See below for details. + +=back For example: @@ -1016,9 +1060,31 @@ For example: matches a chunk of non-parentheses, possibly included in parentheses themselves. -An additional special form of this pattern is the DEFINE pattern, which -never executes its yes-pattern except by recursion, and does not allow -a no-pattern. +A special form is the C<(DEFINE)> predicate, which never executes directly +its yes-pattern, and does not allow a no-pattern. This allows to define +subpatterns which will be executed only by using the recursion mechanism. +This way, you can define a set of regular expression rules that can be +bundled into any pattern you choose. + +It is recommended that for this usage you put the DEFINE block at the +end of the pattern, and that you name any subpatterns defined within it. + +Also, it's worth noting that patterns defined this way probably will +not be as efficient, as the optimiser is not very clever about +handling them. YMMV. + +An example of how this might be used is as follows: + + /(?(&NAME_PAT))(?(&ADDRESS_PAT)) + (?(DEFINE) + (....) + (....) + )/x + +Note that capture buffers matched inside of recursion are not accessible +after the recursion returns, so the extra layer of capturing buffers are +necessary. Thus C<$+{NAME_PAT}> would not be defined even though +C<$+{NAME}> would be. =back diff --git a/pod/perlreguts.pod b/pod/perlreguts.pod index fb7669c..4ee2be1 100644 --- a/pod/perlreguts.pod +++ b/pod/perlreguts.pod @@ -679,9 +679,9 @@ subroutines but the bulk are inline code. =head2 Unicode and Localisation Support When dealing with strings containing characters that cannot be represented -using an eight-bit character set, perl uses an internal representation +using an eight-bit character set, perl uses an internal representation that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single -bytes to represent characters from the ASCII character set, and sequences +bytes to represent characters from the ASCII character set, and sequences of two or more bytes for all other characters. (See L for more information about the relationship between UTF-8 and perl's encoding, utf8 -- the difference isn't important for this discussion.) @@ -745,62 +745,227 @@ tricky this can be: F contains the base structure definition: typedef struct regexp { - I32 *startp; - I32 *endp; - regnode *regstclass; - struct reg_substr_data *substrs; - char *precomp; /* pre-compilation regular expression */ - struct reg_data *data; /* Additional data. */ - char *subbeg; /* saved or original string - so \digit works forever. */ + I32 *startp; + I32 *endp; + regnode *regstclass; + struct reg_substr_data *substrs; + char *precomp; /* pre-compilation regular expression */ + struct reg_data *data; /* Additional data. */ + char *subbeg; /* saved or original string + so \digit works forever. */ #ifdef PERL_OLD_COPY_ON_WRITE - SV *saved_copy; /* If non-NULL, SV which is COW from original */ + SV *saved_copy; /* If non-NULL, SV which is COW from original */ #endif - U32 *offsets; /* offset annotations 20001228 MJD */ - I32 sublen; /* Length of string pointed by subbeg */ - I32 refcnt; - I32 minlen; /* minimum possible length of $& */ - I32 prelen; /* length of precomp */ - U32 nparens; /* number of parentheses */ - U32 lastparen; /* last paren matched */ - U32 lastcloseparen; /* last paren matched */ - U32 reganch; /* Internal use only + - Tainted information used by regexec? */ - regnode program[1]; /* Unwarranted chumminess with compiler. */ + U32 *offsets; /* offset annotations 20001228 MJD */ + I32 sublen; /* Length of string pointed by subbeg */ + I32 refcnt; + I32 minlen; /* mininum possible length of $& */ + I32 prelen; /* length of precomp */ + U32 nparens; /* number of parentheses */ + U32 lastparen; /* last paren matched */ + U32 lastcloseparen; /* last paren matched */ + U32 reganch; /* Internal use only + + Tainted information used by regexec? */ + HV *paren_names; /* Paren names */ + const struct regexp_engine* engine; + regnode program[1]; /* Unwarranted chumminess with compiler. */ } regexp; -C, and C are the primary fields of concern in terms of -program structure. C is the actual array of nodes, and C is -an array of "whatever", with each whatever being typed by letter, and -freed or cloned as needed based on this type. regops use the data -array to store reference data that isn't convenient to store in the regop -itself. It also means memory management code doesn't need to traverse the -program to find pointers. So for instance, if a regop needs a pointer, the -normal procedure is use a C store the data index in the C -field and look it up from the data array. - =over 5 -=item - +=item C + +Compiled program. Inlined into the structure so the entire struct can be +treated as a single blob. + +=item C + +This field points at a reg_data structure, which is defined as follows + + struct reg_data { + U32 count; + U8 *what; + void* data[1]; + }; + +This structure is used for handling data structures that the regex engine +needs to handle specially during a clone or free operation on the compiled +product. Each element in the data array has a corresponding element in the +what array. During compilation regops that need special structures stored +will add an element to each array using the add_data() routine and then store +the index in the regop. + +=item C, C, and C + +These fields are used to keep track of how many paren groups could be matched +in the pattern, which was the last open paren to be entered, and which was +the last close paren to be entered. + +=item C, C + +These fields store arrays that are used to hold the offsets of the begining +and end of each capture group that has matched. -1 is used to indicate no match. + +These are the source for @- and @+. + +=item C C C + +These are used during execution phase for managing search and replace +patterns. -C, C, C, C, and C are used to manage capture -buffers. +=item C C C -=item - +Used for debugging purposes. C holds a copy of the pattern +that was compiled, offsets holds a mapping of offset in the C +to offset in the C string. This is only used by ActiveStates +visual regex debugger. -C and optional C are used during the execution phase for managing -replacements. +=item C -=item - +Holds information on the longest string that must occur at a fixed +offset from the start of the pattern, and the longest string that must +occur at a floating offset from the start of the pattern. Used to do +Fast-Boyer-Moore searches on the string to find out if its worth using +the regex engine at all, and if so where in the string to search. -C and C are used for debugging purposes. +=item C -=item - +Special regop that is used by C to check if a pattern +can match at a certain position. For instance if the regex engine knows +that the pattern must start with a 'Z' then it can scan the string until +it finds one and then launch the regex engine from there. The routine +that handles this is called C. Sometimes this field +points at a regop embedded in the program, and sometimes it points at +an independent synthetic regop that has been constructed by the optimiser. -The rest are used for start point optimisations. +=item C + +The minimum possible length of the final matching string. This is used +to prune the search space by not bothering to match any closer to the +end of a string than would allow a match. For instance there is no point +in even starting the regex engine if the minlen is 10 but the string +is only 5 characters long. There is no way that the pattern can match. + +=item C + +This is used to store various flags about the pattern, such as whether it +contains a \G or a ^ or $ symbol. + +=item C + +This is a hash used internally to track named capture buffers and their +offsets. The keys are the names of the buffers the values are dualvars, +with the IV slot holding the number of buffers with the given name and the +pv being an embedded array of I32. The values may also be contained +independently in the data array in cases where named backreferences are +used. + +=item C + +The number of times the structure is referenced. When this falls to 0 +the regexp is automatically freed by a call to pregfree. + +=item C + +This field points at a regexp_engine structure which contains pointers +to the subroutine that are to be used for performing a match. It +is the compiling routines responsibility to populate this field before +returning the regexp object. =back +=head2 Pluggable Interface + +As of Perl 5.9.5 there is a new interface for using other regexp engines +than the default one. Each engine is supposed to provide access to +a constant structure of the following format: + + typedef struct regexp_engine { + regexp* (*comp) (pTHX_ char* exp, char* xend, PMOP* pm); + I32 (*exec) (pTHX_ regexp* prog, char* stringarg, char* strend, + char* strbeg, I32 minend, SV* screamer, + void* data, U32 flags); + char* (*intuit) (pTHX_ regexp *prog, SV *sv, char *strpos, + char *strend, U32 flags, + struct re_scream_pos_data_s *data); + SV* (*checkstr) (pTHX_ regexp *prog); + void (*free) (pTHX_ struct regexp* r); + #ifdef USE_ITHREADS + regexp* (*dupe) (pTHX_ const regexp *r, CLONE_PARAMS *param); + #endif + } regexp_engine; + +When a regexp is compiled its C field is then set to point at +the appropriate structure so that when it needs to be used it can find +the right routines to do so. + +In order to install a new regexp handler, C<$^H{regcomp}> is set +to an integer which (when casted appropriately) resolves to one of these +structures. When compiling the C method is executed, and the +resulting regexp structures engine field is expected to point back at +the same structure. + +The pTHX_ symbol in the definition is a macro used by perl under threading +to provide an extra argument to the routine holding a pointer back to +the interpreter that is executing the regexp. So under threading all +routines get an extra argument. + +The routines are as follows: + +=over 4 + +=item comp + + regexp* comp(char *exp, char *xend, PMOP pm); + +Compile the pattern between exp and xend using the flags contained in +pm and return a pointer to a prepared regexp structure that can perform +the match. + +=item exec + + I32 exec(regexp* prog, + char *stringarg, char* strend, char* strbeg, + I32 minend, SV* screamer, + void* data, U32 flags); + +Execute a regexp. + +=item intuit + + char* intuit( regexp *prog, + SV *sv, char *strpos, char *strend, + U32 flags, struct re_scream_pos_data_s *data); + +Find the start position where a regex match should be attempted, +or possibly whether the regex engine should not be run because the +pattern can't match. + +=item checkstr + + SV* checkstr(regexp *prog); + +Return a SV containing a string that must appear in the pattern. Used +for optimising matches. + +=item free + + void free(regexp *prog); + +Release any resources allocated to store this pattern. After this +call prog is an invalid pointer. + +=item dupe + + regexp* dupe(const regexp *r, CLONE_PARAMS *param); + +On threaded builds a regexp may need to be duplicated so that the pattern +can be used by mutiple threads. This routine is expected to handle the +duplication. On unthreaded builds this field doesnt exist. + +=back + + =head2 De-allocation and Cloning Any patch that adds data items to the regexp will need to include diff --git a/regcomp.c b/regcomp.c index e64702a..821cb10 100644 --- a/regcomp.c +++ b/regcomp.c @@ -4290,6 +4290,7 @@ reStudy: #undef END_BLOCK #undef RE_ENGINE_PTR +#ifndef PERL_IN_XSUB_RE SV* Perl_reg_named_buff_sv(pTHX_ SV* namesv) { @@ -4323,7 +4324,7 @@ Perl_reg_named_buff_sv(pTHX_ SV* namesv) return GvSVn(gv_paren); } } - +#endif /* Scans the name of a named buffer from the pattern. * If flags is REG_RSN_RETURN_NULL returns null.