3 perlreapi - perl regular expression plugin interface
7 As of Perl 5.9.5 there is a new interface for using other regexp engines than
8 the default one. Each engine is supposed to provide access to a constant
9 structure of the following format:
11 typedef struct regexp_engine {
12 regexp* (*comp) (pTHX_ char* exp, char* xend, U32 pm_flags);
13 I32 (*exec) (pTHX_ regexp* prog, char* stringarg, char* strend,
14 char* strbeg, I32 minend, SV* screamer,
15 void* data, U32 flags);
16 char* (*intuit) (pTHX_ regexp *prog, SV *sv, char *strpos,
17 char *strend, U32 flags,
18 struct re_scream_pos_data_s *data);
19 SV* (*checkstr) (pTHX_ regexp *prog);
20 void (*free) (pTHX_ struct regexp* r);
21 SV* (*numbered_buff_get) (pTHX_ const REGEXP * const rx, I32 paren, SV* usesv);
22 SV* (*named_buff_get)(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags);
23 SV* (*qr_pkg)(pTHX_ const REGEXP * const rx);
25 void* (*dupe) (pTHX_ const regexp *r, CLONE_PARAMS *param);
29 When a regexp is compiled, its C<engine> field is then set to point at
30 the appropriate structure so that when it needs to be used Perl can find
31 the right routines to do so.
33 In order to install a new regexp handler, C<$^H{regcomp}> is set
34 to an integer which (when casted appropriately) resolves to one of these
35 structures. When compiling, the C<comp> method is executed, and the
36 resulting regexp structure's engine field is expected to point back at
39 The pTHX_ symbol in the definition is a macro used by perl under threading
40 to provide an extra argument to the routine holding a pointer back to
41 the interpreter that is executing the regexp. So under threading all
42 routines get an extra argument.
44 The routines are as follows:
48 regexp* comp(char *exp, char *xend, U32 flags);
50 Compile the pattern between exp and xend using the given flags and return a
51 pointer to a prepared regexp structure that can perform the match. See L</The
52 REGEXP structure> below for an explanation of the individual fields in the
55 The C<flags> paramater is a bitfield which indicates which of the
56 C<msixk> flags the regex was compiled with. In addition it contains
57 info about whether C<use locale> is in effect and optimization info
58 for C<split>. A regex engine might want to use the same split
59 optimizations with a different syntax, for instance a Perl6 engine
60 would treat C<split /^^/> equivalently to perl's C<split /^/>, see
61 L<split documentation|perlfunc> and the relevant code in C<pp_split>
62 in F<pp.c> to find out whether your engine should be setting these.
64 The C<eogc> flags are stripped out before being passed to the comp
65 routine. The regex engine does not need to know whether any of these
72 C<split ' '> or C<split> with no arguments (which really means
73 C<split(' ', $_> see L<split|perlfunc>).
77 Set if the pattern is C</^/> (C<<r->prelen == 1 && r->precomp[0] ==
78 '^'>>). Will be used by the C<split> operator to split the given
79 string on C<\n> (even under C</^/s>, see L<split|perlfunc>).
83 Set if the pattern is exactly C</\s+/> and used by C<split>, the
84 definition of whitespace varies depending on whether RXf_UTF8 or
85 RXf_PMf_LOCALE is set.
89 Makes C<split> use the locale dependant definition of whitespace under C<use
90 locale> when RXf_SKIPWHITE or RXf_WHITE is in effect. Under ASCII whitespace is
91 defined as per L<isSPACE|perlapi/ISSPACE>, and by the internal macros
92 C<is_utf8_space> under UTF-8 and C<isSPACE_LC> under C<use locale>.
94 =item RXf_PMf_MULTILINE
96 The C</m> flag, this ends up being passed to C<Perl_fbm_instr> by
97 C<pp_split> regardless of the engine.
99 =item RXf_PMf_SINGLELINE
101 The C</s> flag. Guaranteed not to be used outside the regex engine.
105 The C</i> flag. Guaranteed not to be used outside the regex engine.
107 =item RXf_PMf_EXTENDED
109 The C</x> flag. Guaranteed not to be used outside the regex
110 engine. However if present on a regex C<#> comments will be stripped
111 by the tokenizer regardless of the engine currently in use.
113 =item RXf_PMf_KEEPCOPY
119 Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime.
123 In general these flags should be preserved in regex->extflags after
124 compilation, although it is possible the regex includes constructs
125 that changes them. The perl engine for instance may upgrade non-utf8
126 strings to utf8 if the pattern includes constructs such as C<\x{...}>
127 that can only match unicode values. RXf_SKIPWHITE should always be
128 preserved verbatim in regex->extflags.
132 I32 exec(regexp* prog,
133 char *stringarg, char* strend, char* strbeg,
134 I32 minend, SV* screamer,
135 void* data, U32 flags);
141 char* intuit( regexp *prog,
142 SV *sv, char *strpos, char *strend,
143 U32 flags, struct re_scream_pos_data_s *data);
145 Find the start position where a regex match should be attempted,
146 or possibly whether the regex engine should not be run because the
147 pattern can't match. This is called as appropriate by the core
148 depending on the values of the extflags member of the regexp
153 SV* checkstr(regexp *prog);
155 Return a SV containing a string that must appear in the pattern. Used
156 by C<split> for optimising matches.
160 void free(regexp *prog);
162 Called by perl when it is freeing a regexp pattern so that the engine
163 can release any resources pointed to by the C<pprivate> member of the
164 regexp structure. This is only responsible for freeing private data;
165 perl will handle releasing anything else contained in the regexp structure.
167 =head2 numbered_buff_get
169 SV* numbered_buff_get(pTHX_ const REGEXP * const rx, I32 paren, SV* usesv);
173 =head2 named_buff_get
175 SV* named_buff_get(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags);
181 SV* qr_pkg(pTHX_ const REGEXP * const rx);
183 The package the qr// magic object is blessed into (as seen by C<ref
184 qr//>). It is recommended that engines change this to its package
188 Example_reg_qr_pkg(pTHX_ const REGEXP * const rx)
191 return newSVpvs("re::engine::Example");
194 Any method calls on an object created with C<qr//> will be dispatched to the
195 package as a normal object.
197 use re::engine::Example;
199 $re->meth; # dispatched to re::engine::Example::meth()
201 To retrieve the C<REGEXP> object from the scalar in an XS function use the
212 (sv = (SV*)SvRV(sv)) && /* assignment deliberate */
213 SvTYPE(sv) == SVt_PVMG &&
214 (mg = mg_find(sv, PERL_MAGIC_qr))) /* assignment deliberate */
216 re = (REGEXP *)mg->mg_obj;
219 Or use the (CURRENTLY UNDOCUMENETED!) C<Perl_get_re_arg> function:
223 const REGEXP * const re = (REGEXP *)Perl_get_re_arg( aTHX_ rv, 0, NULL );
227 void* dupe(const regexp *r, CLONE_PARAMS *param);
229 On threaded builds a regexp may need to be duplicated so that the pattern
230 can be used by mutiple threads. This routine is expected to handle the
231 duplication of any private data pointed to by the C<pprivate> member of
232 the regexp structure. It will be called with the preconstructed new
233 regexp structure as an argument, the C<pprivate> member will point at
234 the B<old> private structue, and it is this routine's responsibility to
235 construct a copy and return a pointer to it (which perl will then use to
236 overwrite the field as passed to this routine.)
238 This allows the engine to dupe its private data but also if necessary
239 modify the final structure if it really must.
241 On unthreaded builds this field doesn't exist.
243 =head1 The REGEXP structure
245 The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to
246 correctly build such a structure in their L</comp> routine.
248 The REGEXP structure contains all the data that perl needs to be aware of
249 to properly work with the regular expression. It includes data about
250 optimisations that perl can use to determine if the regex engine should
251 really be used, and various other control info that is needed to properly
252 execute patterns in various contexts such as is the pattern anchored in
253 some way, or what flags were used during the compile, or whether the
254 program contains special constructs that perl needs to be aware of.
256 In addition it contains two fields that are intended for the private use
257 of the regex engine that compiled the pattern. These are the C<intflags>
258 and pprivate members. The C<pprivate> is a void pointer to an arbitrary
259 structure whose use and management is the responsibility of the compiling
260 engine. perl will never modify either of these values.
262 typedef struct regexp {
263 /* what engine created this regexp? */
264 const struct regexp_engine* engine;
266 /* what re is this a lightweight copy of? */
267 struct regexp* mother_re;
269 /* Information about the match that the perl core uses to manage things */
270 U32 extflags; /* Flags used both externally and internally */
271 I32 minlen; /* mininum possible length of string to match */
272 I32 minlenret; /* mininum possible length of $& */
273 U32 gofs; /* chars left of pos that we search from */
275 /* substring data about strings that must appear
276 in the final match, used for optimisations */
277 struct reg_substr_data *substrs;
279 U32 nparens; /* number of capture buffers */
281 /* private engine specific data */
282 U32 intflags; /* Engine Specific Internal flags */
283 void *pprivate; /* Data private to the regex engine which
284 created this object. */
286 /* Data about the last/current match. These are modified during matching*/
287 U32 lastparen; /* last open paren matched */
288 U32 lastcloseparen; /* last close paren matched */
289 regexp_paren_pair *swap; /* Swap copy of *offs */
290 regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
292 char *subbeg; /* saved or original string so \digit works forever. */
293 SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
294 I32 sublen; /* Length of string pointed by subbeg */
296 /* Information about the match that isn't often used */
297 I32 prelen; /* length of precomp */
298 const char *precomp; /* pre-compilation regular expression */
300 /* wrapped can't be const char*, as it is returned by sv_2pv_flags */
301 char *wrapped; /* wrapped version of the pattern */
302 I32 wraplen; /* length of wrapped */
304 I32 seen_evals; /* number of eval groups in the pattern - for security checks */
305 HV *paren_names; /* Optional hash of paren names */
307 /* Refcount of this regexp */
308 I32 refcnt; /* Refcount of this regexp */
311 The fields are discussed in more detail below:
317 This field points at a regexp_engine structure which contains pointers
318 to the subroutines that are to be used for performing a match. It
319 is the compiling routine's responsibility to populate this field before
320 returning the regexp object.
322 Internally this is set to C<NULL> unless a custom engine is specified in
323 C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct
324 pointed to by C<RE_ENGINE_PTR>.
328 TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html>
332 This will be used by perl to see what flags the regexp was compiled with, this
333 will normally be set to the value of the flags parameter on L</comp>.
335 =item C<minlen> C<minlenret>
337 The minimum string length required for the pattern to match. This is used to
338 prune the search space by not bothering to match any closer to the end of a
339 string than would allow a match. For instance there is no point in even
340 starting the regex engine if the minlen is 10 but the string is only 5
341 characters long. There is no way that the pattern can match.
343 C<minlenret> is the minimum length of the string that would be found
346 The difference between C<minlen> and C<minlenret> can be seen in the
351 where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
352 required to match but is not actually included in the matched content. This
353 distinction is particularly important as the substitution logic uses the
354 C<minlenret> to tell whether it can do in-place substition which can result in
355 considerable speedup.
359 Left offset from pos() to start match at.
365 =item C<nparens>, C<lasparen>, and C<lastcloseparen>
367 These fields are used to keep track of how many paren groups could be matched
368 in the pattern, which was the last open paren to be entered, and which was
369 the last close paren to be entered.
373 The engine's private copy of the flags the pattern was compiled with. Usually
374 this is the same as C<extflags> unless the engine chose to modify one of them
378 A void* pointing to an engine-defined data structure. The perl engine uses the
379 C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
380 engine should use something else.
388 A C<regexp_paren_pair> structure which defines offsets into the string being
389 matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
390 C<regexp_paren_pair> struct is defined as follows:
392 typedef struct regexp_paren_pair {
397 If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
398 capture buffer did not match. C<< ->offs[0].start/end >> represents C<$&> (or
399 C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where
402 =item C<precomp> C<prelen>
404 Used for debugging purposes. C<precomp> holds a copy of the pattern
405 that was compiled and C<prelen> its length.
409 This is a hash used internally to track named capture buffers and their
410 offsets. The keys are the names of the buffers the values are dualvars,
411 with the IV slot holding the number of buffers with the given name and the
412 pv being an embedded array of I32. The values may also be contained
413 independently in the data array in cases where named backreferences are
416 =item C<reg_substr_data>
418 Holds information on the longest string that must occur at a fixed
419 offset from the start of the pattern, and the longest string that must
420 occur at a floating offset from the start of the pattern. Used to do
421 Fast-Boyer-Moore searches on the string to find out if its worth using
422 the regex engine at all, and if so where in the string to search.
424 =item C<startp>, C<endp>
426 These fields store arrays that are used to hold the offsets of the begining
427 and end of each capture group that has matched. -1 is used to indicate no match.
429 These are the source for @- and @+.
431 =item C<subbeg> C<sublen> C<saved_copy>
433 #define SAVEPVN(p,n) ((p) ? savepvn(p,n) : NULL)
434 if (RX_MATCH_COPIED(ret))
435 ret->subbeg = SAVEPVN(ret->subbeg, ret->sublen);
439 C<PL_sawampersand || rx->extflags & RXf_PMf_KEEPCOPY>
441 These are used during execution phase for managing search and replace
444 =item C<wrapped> C<wraplen>
446 Stores the string C<qr//> stringifies to, for example C<(?-xism:eek)>
447 in the case of C<qr/eek/>.
449 When using a custom engine that doesn't support the C<(?:)> construct for
450 inline modifiers it's best to have C<qr//> stringify to the supplied pattern,
451 note that this will create invalid patterns in cases such as:
453 my $x = qr/a|b/; # "a|b"
455 my $z = qr/$x$y/; # "a|bc"
457 There's no solution for such problems other than making the custom engine
458 understand some for of inline modifiers.
460 The C<Perl_reg_stringify> in F<regcomp.c> does the stringification work.
464 This stores the number of eval groups in the pattern. This is used for security
465 purposes when embedding compiled regexes into larger patterns with C<qr//>.
469 The number of times the structure is referenced. When this falls to 0 the
470 regexp is automatically freed by a call to pregfree. This should be set to 1 in
471 each engine's L</comp> routine.
475 =head2 De-allocation and Cloning
477 Any patch that adds data items to the REGEXP struct will need to include
478 changes to F<sv.c> (C<Perl_re_dup()>) and F<regcomp.c> (C<pregfree()>). This
479 involves freeing or cloning items in the regexp's data array based on the data
484 Originally part of L<perlreguts>.
488 Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
493 Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
495 This program is free software; you can redistribute it and/or modify it under
496 the same terms as Perl itself.