Commit | Line | Data |
108003db |
1 | =head1 NAME |
2 | |
3 | perlreapi - perl regular expression plugin interface |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | As of Perl 5.9.5 there is a new interface for using other regexp engines than |
8 | the default one. Each engine is supposed to provide access to a constant |
9 | structure of the following format: |
10 | |
11 | typedef struct regexp_engine { |
12 | regexp* (*comp) (pTHX_ char* exp, char* xend, U32 pm_flags); |
13 | I32 (*exec) (pTHX_ regexp* prog, char* stringarg, char* strend, |
14 | char* strbeg, I32 minend, SV* screamer, |
15 | void* data, U32 flags); |
16 | char* (*intuit) (pTHX_ regexp *prog, SV *sv, char *strpos, |
17 | char *strend, U32 flags, |
18 | struct re_scream_pos_data_s *data); |
19 | SV* (*checkstr) (pTHX_ regexp *prog); |
20 | void (*free) (pTHX_ struct regexp* r); |
21 | SV* (*numbered_buff_get) (pTHX_ const REGEXP * const rx, I32 paren, SV* usesv); |
22 | SV* (*named_buff_get)(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags); |
23 | SV* (*qr_pkg)(pTHX_ const REGEXP * const rx); |
24 | #ifdef USE_ITHREADS |
25 | void* (*dupe) (pTHX_ const regexp *r, CLONE_PARAMS *param); |
26 | #endif |
27 | } regexp_engine; |
28 | |
29 | When a regexp is compiled, its C<engine> field is then set to point at |
30 | the appropriate structure so that when it needs to be used Perl can find |
31 | the right routines to do so. |
32 | |
33 | In order to install a new regexp handler, C<$^H{regcomp}> is set |
34 | to an integer which (when casted appropriately) resolves to one of these |
35 | structures. When compiling, the C<comp> method is executed, and the |
36 | resulting regexp structure's engine field is expected to point back at |
37 | the same structure. |
38 | |
39 | The pTHX_ symbol in the definition is a macro used by perl under threading |
40 | to provide an extra argument to the routine holding a pointer back to |
41 | the interpreter that is executing the regexp. So under threading all |
42 | routines get an extra argument. |
43 | |
44 | The routines are as follows: |
45 | |
46 | =head2 comp |
47 | |
48 | regexp* comp(char *exp, char *xend, U32 flags); |
49 | |
50 | Compile the pattern between exp and xend using the given flags and return a |
51 | pointer to a prepared regexp structure that can perform the match. See L</The |
52 | REGEXP structure> below for an explanation of the individual fields in the |
53 | REGEXP struct. |
54 | |
55 | The C<flags> paramater is a bitfield which indicates which of the |
56 | C<msixk> flags the regex was compiled with. In addition it contains |
57 | info about whether C<use locale> is in effect and optimization info |
58 | for C<split>. A regex engine might want to use the same split |
59 | optimizations with a different syntax, for instance a Perl6 engine |
60 | would treat C<split /^^/> equivalently to perl's C<split /^/>, see |
61 | L<split documentation|perlfunc> and the relevant code in C<pp_split> |
62 | in F<pp.c> to find out whether your engine should be setting these. |
63 | |
64 | The C<eogc> flags are stripped out before being passed to the comp |
65 | routine. The regex engine does not need to know whether any of these |
66 | are set. |
67 | |
68 | =over 4 |
69 | |
70 | =item RXf_SKIPWHITE |
71 | |
72 | C<split ' '> or C<split> with no arguments (which really means |
73 | C<split(' ', $_> see L<split|perlfunc>). |
74 | |
75 | =item RXf_START_ONLY |
76 | |
77 | Set if the pattern is C</^/> (C<<r->prelen == 1 && r->precomp[0] == |
78 | '^'>>). Will be used by the C<split> operator to split the given |
79 | string on C<\n> (even under C</^/s>, see L<split|perlfunc>). |
80 | |
81 | =item RXf_WHITE |
82 | |
83 | Set if the pattern is exactly C</\s+/> and used by C<split>, the |
84 | definition of whitespace varies depending on whether RXf_UTF8 or |
85 | RXf_PMf_LOCALE is set. |
86 | |
87 | =item RXf_PMf_LOCALE |
88 | |
89 | Makes C<split> use the locale dependant definition of whitespace under C<use |
90 | locale> when RXf_SKIPWHITE or RXf_WHITE is in effect. Under ASCII whitespace is |
91 | defined as per L<isSPACE|perlapi/ISSPACE>, and by the internal macros |
92 | C<is_utf8_space> under UTF-8 and C<isSPACE_LC> under C<use locale>. |
93 | |
94 | =item RXf_PMf_MULTILINE |
95 | |
96 | The C</m> flag, this ends up being passed to C<Perl_fbm_instr> by |
97 | C<pp_split> regardless of the engine. |
98 | |
99 | =item RXf_PMf_SINGLELINE |
100 | |
101 | The C</s> flag. Guaranteed not to be used outside the regex engine. |
102 | |
103 | =item RXf_PMf_FOLD |
104 | |
105 | The C</i> flag. Guaranteed not to be used outside the regex engine. |
106 | |
107 | =item RXf_PMf_EXTENDED |
108 | |
109 | The C</x> flag. Guaranteed not to be used outside the regex |
110 | engine. However if present on a regex C<#> comments will be stripped |
111 | by the tokenizer regardless of the engine currently in use. |
112 | |
113 | =item RXf_PMf_KEEPCOPY |
114 | |
115 | The C</k> flag. |
116 | |
117 | =item RXf_UTF8 |
118 | |
119 | Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime. |
120 | |
121 | =back |
122 | |
123 | In general these flags should be preserved in regex->extflags after |
124 | compilation, although it is possible the regex includes constructs |
125 | that changes them. The perl engine for instance may upgrade non-utf8 |
126 | strings to utf8 if the pattern includes constructs such as C<\x{...}> |
127 | that can only match unicode values. RXf_SKIPWHITE should always be |
128 | preserved verbatim in regex->extflags. |
129 | |
130 | =head2 exec |
131 | |
132 | I32 exec(regexp* prog, |
133 | char *stringarg, char* strend, char* strbeg, |
134 | I32 minend, SV* screamer, |
135 | void* data, U32 flags); |
136 | |
137 | Execute a regexp. |
138 | |
139 | =head2 intuit |
140 | |
141 | char* intuit( regexp *prog, |
142 | SV *sv, char *strpos, char *strend, |
143 | U32 flags, struct re_scream_pos_data_s *data); |
144 | |
145 | Find the start position where a regex match should be attempted, |
146 | or possibly whether the regex engine should not be run because the |
147 | pattern can't match. This is called as appropriate by the core |
148 | depending on the values of the extflags member of the regexp |
149 | structure. |
150 | |
151 | =head2 checkstr |
152 | |
153 | SV* checkstr(regexp *prog); |
154 | |
155 | Return a SV containing a string that must appear in the pattern. Used |
156 | by C<split> for optimising matches. |
157 | |
158 | =head2 free |
159 | |
160 | void free(regexp *prog); |
161 | |
162 | Called by perl when it is freeing a regexp pattern so that the engine |
163 | can release any resources pointed to by the C<pprivate> member of the |
164 | regexp structure. This is only responsible for freeing private data; |
165 | perl will handle releasing anything else contained in the regexp structure. |
166 | |
167 | =head2 numbered_buff_get |
168 | |
169 | SV* numbered_buff_get(pTHX_ const REGEXP * const rx, I32 paren, SV* usesv); |
170 | |
171 | TODO: document |
172 | |
173 | =head2 named_buff_get |
174 | |
175 | SV* named_buff_get(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags); |
176 | |
177 | TODO: document |
178 | |
179 | =head2 qr_pkg |
180 | |
181 | SV* qr_pkg(pTHX_ const REGEXP * const rx); |
182 | |
183 | The package the qr// magic object is blessed into (as seen by C<ref |
184 | qr//>). It is recommended that engines change this to its package |
185 | name, for instance: |
186 | |
187 | SV* |
188 | Example_reg_qr_pkg(pTHX_ const REGEXP * const rx) |
189 | { |
190 | PERL_UNUSED_ARG(rx); |
191 | return newSVpvs("re::engine::Example"); |
192 | } |
193 | |
194 | Any method calls on an object created with C<qr//> will be dispatched to the |
195 | package as a normal object. |
196 | |
197 | use re::engine::Example; |
198 | my $re = qr//; |
199 | $re->meth; # dispatched to re::engine::Example::meth() |
200 | |
201 | To retrieve the C<REGEXP> object from the scalar in an XS function use the |
202 | following snippet: |
203 | |
204 | void meth(SV * rv) |
205 | PPCODE: |
206 | MAGIC * mg; |
207 | REGEXP * re; |
208 | |
209 | if (SvMAGICAL(sv)) |
210 | mg_get(sv); |
211 | if (SvROK(sv) && |
212 | (sv = (SV*)SvRV(sv)) && /* assignment deliberate */ |
213 | SvTYPE(sv) == SVt_PVMG && |
214 | (mg = mg_find(sv, PERL_MAGIC_qr))) /* assignment deliberate */ |
215 | { |
216 | re = (REGEXP *)mg->mg_obj; |
217 | } |
218 | |
219 | Or use the (CURRENTLY UNDOCUMENETED!) C<Perl_get_re_arg> function: |
220 | |
221 | void meth(SV * rv) |
222 | PPCODE: |
223 | const REGEXP * const re = (REGEXP *)Perl_get_re_arg( aTHX_ rv, 0, NULL ); |
224 | |
225 | =head2 dupe |
226 | |
227 | void* dupe(const regexp *r, CLONE_PARAMS *param); |
228 | |
229 | On threaded builds a regexp may need to be duplicated so that the pattern |
230 | can be used by mutiple threads. This routine is expected to handle the |
231 | duplication of any private data pointed to by the C<pprivate> member of |
232 | the regexp structure. It will be called with the preconstructed new |
233 | regexp structure as an argument, the C<pprivate> member will point at |
234 | the B<old> private structue, and it is this routine's responsibility to |
235 | construct a copy and return a pointer to it (which perl will then use to |
236 | overwrite the field as passed to this routine.) |
237 | |
238 | This allows the engine to dupe its private data but also if necessary |
239 | modify the final structure if it really must. |
240 | |
241 | On unthreaded builds this field doesn't exist. |
242 | |
243 | =head1 The REGEXP structure |
244 | |
245 | The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to |
246 | correctly build such a structure in their L</comp> routine. |
247 | |
248 | The REGEXP structure contains all the data that perl needs to be aware of |
249 | to properly work with the regular expression. It includes data about |
250 | optimisations that perl can use to determine if the regex engine should |
251 | really be used, and various other control info that is needed to properly |
252 | execute patterns in various contexts such as is the pattern anchored in |
253 | some way, or what flags were used during the compile, or whether the |
254 | program contains special constructs that perl needs to be aware of. |
255 | |
256 | In addition it contains two fields that are intended for the private use |
257 | of the regex engine that compiled the pattern. These are the C<intflags> |
258 | and pprivate members. The C<pprivate> is a void pointer to an arbitrary |
259 | structure whose use and management is the responsibility of the compiling |
260 | engine. perl will never modify either of these values. |
261 | |
262 | typedef struct regexp { |
263 | /* what engine created this regexp? */ |
264 | const struct regexp_engine* engine; |
265 | |
266 | /* what re is this a lightweight copy of? */ |
267 | struct regexp* mother_re; |
268 | |
269 | /* Information about the match that the perl core uses to manage things */ |
270 | U32 extflags; /* Flags used both externally and internally */ |
271 | I32 minlen; /* mininum possible length of string to match */ |
272 | I32 minlenret; /* mininum possible length of $& */ |
273 | U32 gofs; /* chars left of pos that we search from */ |
274 | |
275 | /* substring data about strings that must appear |
276 | in the final match, used for optimisations */ |
277 | struct reg_substr_data *substrs; |
278 | |
279 | U32 nparens; /* number of capture buffers */ |
280 | |
281 | /* private engine specific data */ |
282 | U32 intflags; /* Engine Specific Internal flags */ |
283 | void *pprivate; /* Data private to the regex engine which |
284 | created this object. */ |
285 | |
286 | /* Data about the last/current match. These are modified during matching*/ |
287 | U32 lastparen; /* last open paren matched */ |
288 | U32 lastcloseparen; /* last close paren matched */ |
289 | regexp_paren_pair *swap; /* Swap copy of *offs */ |
290 | regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */ |
291 | |
292 | char *subbeg; /* saved or original string so \digit works forever. */ |
293 | SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ |
294 | I32 sublen; /* Length of string pointed by subbeg */ |
295 | |
296 | /* Information about the match that isn't often used */ |
297 | I32 prelen; /* length of precomp */ |
298 | const char *precomp; /* pre-compilation regular expression */ |
299 | |
300 | /* wrapped can't be const char*, as it is returned by sv_2pv_flags */ |
301 | char *wrapped; /* wrapped version of the pattern */ |
302 | I32 wraplen; /* length of wrapped */ |
303 | |
304 | I32 seen_evals; /* number of eval groups in the pattern - for security checks */ |
305 | HV *paren_names; /* Optional hash of paren names */ |
306 | |
307 | /* Refcount of this regexp */ |
308 | I32 refcnt; /* Refcount of this regexp */ |
309 | } regexp; |
310 | |
311 | The fields are discussed in more detail below: |
312 | |
313 | =over 4 |
314 | |
315 | =item C<engine> |
316 | |
317 | This field points at a regexp_engine structure which contains pointers |
318 | to the subroutines that are to be used for performing a match. It |
319 | is the compiling routine's responsibility to populate this field before |
320 | returning the regexp object. |
321 | |
322 | Internally this is set to C<NULL> unless a custom engine is specified in |
323 | C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct |
324 | pointed to by C<RE_ENGINE_PTR>. |
325 | |
326 | =item C<mother_re> |
327 | |
328 | TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html> |
329 | |
330 | =item C<extflags> |
331 | |
332 | This will be used by perl to see what flags the regexp was compiled with, this |
333 | will normally be set to the value of the flags parameter on L</comp>. |
334 | |
335 | =item C<minlen> C<minlenret> |
336 | |
337 | The minimum string length required for the pattern to match. This is used to |
338 | prune the search space by not bothering to match any closer to the end of a |
339 | string than would allow a match. For instance there is no point in even |
340 | starting the regex engine if the minlen is 10 but the string is only 5 |
341 | characters long. There is no way that the pattern can match. |
342 | |
343 | C<minlenret> is the minimum length of the string that would be found |
344 | in $& after a match. |
345 | |
346 | The difference between C<minlen> and C<minlenret> can be seen in the |
347 | following pattern: |
348 | |
349 | /ns(?=\d)/ |
350 | |
351 | where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is |
352 | required to match but is not actually included in the matched content. This |
353 | distinction is particularly important as the substitution logic uses the |
354 | C<minlenret> to tell whether it can do in-place substition which can result in |
355 | considerable speedup. |
356 | |
357 | =item C<gofs> |
358 | |
359 | Left offset from pos() to start match at. |
360 | |
361 | =item C<substrs> |
362 | |
363 | TODO: document |
364 | |
365 | =item C<nparens>, C<lasparen>, and C<lastcloseparen> |
366 | |
367 | These fields are used to keep track of how many paren groups could be matched |
368 | in the pattern, which was the last open paren to be entered, and which was |
369 | the last close paren to be entered. |
370 | |
371 | =item C<intflags> |
372 | |
373 | The engine's private copy of the flags the pattern was compiled with. Usually |
374 | this is the same as C<extflags> unless the engine chose to modify one of them |
375 | |
376 | =item C<pprivate> |
377 | |
378 | A void* pointing to an engine-defined data structure. The perl engine uses the |
379 | C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom |
380 | engine should use something else. |
381 | |
382 | =item C<swap> |
383 | |
384 | TODO: document |
385 | |
386 | =item C<offs> |
387 | |
388 | A C<regexp_paren_pair> structure which defines offsets into the string being |
389 | matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the |
390 | C<regexp_paren_pair> struct is defined as follows: |
391 | |
392 | typedef struct regexp_paren_pair { |
393 | I32 start; |
394 | I32 end; |
395 | } regexp_paren_pair; |
396 | |
397 | If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that |
398 | capture buffer did not match. C<< ->offs[0].start/end >> represents C<$&> (or |
399 | C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where |
400 | C<$paren >= 1>. |
401 | |
402 | =item C<precomp> C<prelen> |
403 | |
404 | Used for debugging purposes. C<precomp> holds a copy of the pattern |
405 | that was compiled and C<prelen> its length. |
406 | |
407 | =item C<paren_names> |
408 | |
409 | This is a hash used internally to track named capture buffers and their |
410 | offsets. The keys are the names of the buffers the values are dualvars, |
411 | with the IV slot holding the number of buffers with the given name and the |
412 | pv being an embedded array of I32. The values may also be contained |
413 | independently in the data array in cases where named backreferences are |
414 | used. |
415 | |
416 | =item C<reg_substr_data> |
417 | |
418 | Holds information on the longest string that must occur at a fixed |
419 | offset from the start of the pattern, and the longest string that must |
420 | occur at a floating offset from the start of the pattern. Used to do |
421 | Fast-Boyer-Moore searches on the string to find out if its worth using |
422 | the regex engine at all, and if so where in the string to search. |
423 | |
424 | =item C<startp>, C<endp> |
425 | |
426 | These fields store arrays that are used to hold the offsets of the begining |
427 | and end of each capture group that has matched. -1 is used to indicate no match. |
428 | |
429 | These are the source for @- and @+. |
430 | |
431 | =item C<subbeg> C<sublen> C<saved_copy> |
432 | |
433 | #define SAVEPVN(p,n) ((p) ? savepvn(p,n) : NULL) |
434 | if (RX_MATCH_COPIED(ret)) |
435 | ret->subbeg = SAVEPVN(ret->subbeg, ret->sublen); |
436 | else |
437 | ret->subbeg = NULL; |
438 | |
439 | C<PL_sawampersand || rx->extflags & RXf_PMf_KEEPCOPY> |
440 | |
441 | These are used during execution phase for managing search and replace |
442 | patterns. |
443 | |
444 | =item C<wrapped> C<wraplen> |
445 | |
446 | Stores the string C<qr//> stringifies to, for example C<(?-xism:eek)> |
447 | in the case of C<qr/eek/>. |
448 | |
449 | When using a custom engine that doesn't support the C<(?:)> construct for |
450 | inline modifiers it's best to have C<qr//> stringify to the supplied pattern, |
451 | note that this will create invalid patterns in cases such as: |
452 | |
453 | my $x = qr/a|b/; # "a|b" |
454 | my $y = qr/c/; # "c" |
455 | my $z = qr/$x$y/; # "a|bc" |
456 | |
457 | There's no solution for such problems other than making the custom engine |
458 | understand some for of inline modifiers. |
459 | |
460 | The C<Perl_reg_stringify> in F<regcomp.c> does the stringification work. |
461 | |
462 | =item C<seen_evals> |
463 | |
464 | This stores the number of eval groups in the pattern. This is used for security |
465 | purposes when embedding compiled regexes into larger patterns with C<qr//>. |
466 | |
467 | =item C<refcnt> |
468 | |
469 | The number of times the structure is referenced. When this falls to 0 the |
470 | regexp is automatically freed by a call to pregfree. This should be set to 1 in |
471 | each engine's L</comp> routine. |
472 | |
473 | =back |
474 | |
475 | =head2 De-allocation and Cloning |
476 | |
477 | Any patch that adds data items to the REGEXP struct will need to include |
478 | changes to F<sv.c> (C<Perl_re_dup()>) and F<regcomp.c> (C<pregfree()>). This |
479 | involves freeing or cloning items in the regexp's data array based on the data |
480 | item's type. |
481 | |
482 | =head1 HISTORY |
483 | |
484 | Originally part of L<perlreguts>. |
485 | |
486 | =head1 AUTHORS |
487 | |
488 | Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> |
489 | Bjarmason. |
490 | |
491 | =head1 LICENSE |
492 | |
493 | Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. |
494 | |
495 | This program is free software; you can redistribute it and/or modify it under |
496 | the same terms as Perl itself. |
497 | |
498 | =cut |