Commit | Line | Data |
108003db |
1 | =head1 NAME |
2 | |
3 | perlreapi - perl regular expression plugin interface |
4 | |
5 | =head1 DESCRIPTION |
6 | |
c998b245 |
7 | As of Perl 5.9.5 there is a new interface for using other regexp |
8 | engines than the default one. Each engine is supposed to provide |
9 | access to a constant structure of the following format: |
108003db |
10 | |
11 | typedef struct regexp_engine { |
3ab4a224 |
12 | REGEXP* (*comp) (pTHX_ const SV * const pattern, const U32 flags); |
49d7dfbc |
13 | I32 (*exec) (pTHX_ REGEXP * const rx, char* stringarg, char* strend, |
2fdbfb4d |
14 | char* strbeg, I32 minend, SV* screamer, |
15 | void* data, U32 flags); |
49d7dfbc |
16 | char* (*intuit) (pTHX_ REGEXP * const rx, SV *sv, char *strpos, |
2fdbfb4d |
17 | char *strend, U32 flags, |
18 | struct re_scream_pos_data_s *data); |
49d7dfbc |
19 | SV* (*checkstr) (pTHX_ REGEXP * const rx); |
20 | void (*free) (pTHX_ REGEXP * const rx); |
2fdbfb4d |
21 | void (*numbered_buff_FETCH) (pTHX_ REGEXP * const rx, const I32 paren, |
22 | SV * const sv); |
23 | void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren, |
24 | SV const * const value); |
25 | I32 (*numbered_buff_LENGTH) (pTHX_ REGEXP * const rx, const SV * const sv, |
26 | const I32 paren); |
192b9cd1 |
27 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, |
28 | SV * const value, U32 flags); |
29 | SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey, |
30 | const U32 flags); |
49d7dfbc |
31 | SV* (*qr_package)(pTHX_ REGEXP * const rx); |
108003db |
32 | #ifdef USE_ITHREADS |
49d7dfbc |
33 | void* (*dupe) (pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db |
34 | #endif |
108003db |
35 | |
36 | When a regexp is compiled, its C<engine> field is then set to point at |
37 | the appropriate structure so that when it needs to be used Perl can find |
38 | the right routines to do so. |
39 | |
40 | In order to install a new regexp handler, C<$^H{regcomp}> is set |
41 | to an integer which (when casted appropriately) resolves to one of these |
42 | structures. When compiling, the C<comp> method is executed, and the |
43 | resulting regexp structure's engine field is expected to point back at |
44 | the same structure. |
45 | |
46 | The pTHX_ symbol in the definition is a macro used by perl under threading |
47 | to provide an extra argument to the routine holding a pointer back to |
48 | the interpreter that is executing the regexp. So under threading all |
49 | routines get an extra argument. |
50 | |
882227b7 |
51 | =head1 Callbacks |
108003db |
52 | |
53 | =head2 comp |
54 | |
3ab4a224 |
55 | REGEXP* comp(pTHX_ const SV * const pattern, const U32 flags); |
108003db |
56 | |
3ab4a224 |
57 | Compile the pattern stored in C<pattern> using the given C<flags> and |
58 | return a pointer to a prepared C<REGEXP> structure that can perform |
59 | the match. See L</The REGEXP structure> below for an explanation of |
60 | the individual fields in the REGEXP struct. |
61 | |
62 | The C<pattern> parameter is the scalar that was used as the |
63 | pattern. previous versions of perl would pass two C<char*> indicating |
64 | the start and end of the stringifed pattern, the following snippet can |
65 | be used to get the old parameters: |
66 | |
67 | STRLEN plen; |
68 | char* exp = SvPV(pattern, plen); |
69 | char* xend = exp + plen; |
70 | |
71 | Since any scalar can be passed as a pattern it's possible to implement |
72 | an engine that does something with an array (C<< "ook" =~ [ qw/ eek |
73 | hlagh / ] >>) or with the non-stringified form of a compiled regular |
74 | expression (C<< "ook" =~ qr/eek/ >>). perl's own engine will always |
75 | stringify everything using the snippet above but that doesn't mean |
76 | other engines have to. |
108003db |
77 | |
78 | The C<flags> paramater is a bitfield which indicates which of the |
c998b245 |
79 | C<msixp> flags the regex was compiled with. It also contains |
80 | additional info such as whether C<use locale> is in effect. |
108003db |
81 | |
82 | The C<eogc> flags are stripped out before being passed to the comp |
83 | routine. The regex engine does not need to know whether any of these |
3ab4a224 |
84 | are set as those flags should only affect what perl does with the |
c998b245 |
85 | pattern and its match variables, not how it gets compiled and |
86 | executed. |
108003db |
87 | |
c998b245 |
88 | By the time the comp callback is called, some of these flags have |
89 | already had effect (noted below where applicable). However most of |
90 | their effect occurs after the comp callback has run in routines that |
91 | read the C<< rx->extflags >> field which it populates. |
108003db |
92 | |
c998b245 |
93 | In general the flags should be preserved in C<< rx->extflags >> after |
94 | compilation, although the regex engine might want to add or delete |
95 | some of them to invoke or disable some special behavior in perl. The |
96 | flags along with any special behavior they cause are documented below: |
108003db |
97 | |
c998b245 |
98 | The pattern modifiers: |
108003db |
99 | |
c998b245 |
100 | =over 4 |
108003db |
101 | |
c998b245 |
102 | =item C</m> - RXf_PMf_MULTILINE |
108003db |
103 | |
c998b245 |
104 | If this is in C<< rx->extflags >> it will be passed to |
105 | C<Perl_fbm_instr> by C<pp_split> which will treat the subject string |
106 | as a multi-line string. |
108003db |
107 | |
c998b245 |
108 | =item C</s> - RXf_PMf_SINGLELINE |
108003db |
109 | |
c998b245 |
110 | =item C</i> - RXf_PMf_FOLD |
108003db |
111 | |
c998b245 |
112 | =item C</x> - RXf_PMf_EXTENDED |
108003db |
113 | |
c998b245 |
114 | If present on a regex C<#> comments will be handled differently by the |
115 | tokenizer in some cases. |
108003db |
116 | |
c998b245 |
117 | TODO: Document those cases. |
108003db |
118 | |
c998b245 |
119 | =item C</p> - RXf_PMf_KEEPCOPY |
108003db |
120 | |
c998b245 |
121 | =back |
108003db |
122 | |
c998b245 |
123 | Additional flags: |
108003db |
124 | |
c998b245 |
125 | =over 4 |
108003db |
126 | |
c998b245 |
127 | =item RXf_PMf_LOCALE |
108003db |
128 | |
c998b245 |
129 | Set if C<use locale> is in effect. If present in C<< rx->extflags >> |
130 | C<split> will use the locale dependant definition of whitespace under |
131 | when RXf_SKIPWHITE or RXf_WHITE are in effect. Under ASCII whitespace |
132 | is defined as per L<isSPACE|perlapi/ISSPACE>, and by the internal |
133 | macros C<is_utf8_space> under UTF-8 and C<isSPACE_LC> under C<use |
134 | locale>. |
108003db |
135 | |
136 | =item RXf_UTF8 |
137 | |
138 | Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime. |
139 | |
c998b245 |
140 | A regex engine may want to set or disable this flag during |
141 | compilation. The perl engine for instance may upgrade non-UTF-8 |
142 | strings to UTF-8 if the pattern includes constructs such as C<\x{...}> |
143 | that can only match Unicode values. |
144 | |
0ac6acae |
145 | =item RXf_SPLIT |
146 | |
147 | If C<split> is invoked as C<split ' '> or with no arguments (which |
5137fa37 |
148 | really means C<split(' ', $_)>, see L<split|perlfunc/split>), perl will |
0ac6acae |
149 | set this flag. The regex engine can then check for it and set the |
150 | SKIPWHITE and WHITE extflags. To do this the perl engine does: |
151 | |
152 | if (flags & RXf_SPLIT && r->prelen == 1 && r->precomp[0] == ' ') |
153 | r->extflags |= (RXf_SKIPWHITE|RXf_WHITE); |
154 | |
108003db |
155 | =back |
156 | |
c998b245 |
157 | These flags can be set during compilation to enable optimizations in |
158 | the C<split> operator. |
159 | |
160 | =over 4 |
161 | |
0ac6acae |
162 | =item RXf_SKIPWHITE |
163 | |
164 | If the flag is present in C<< rx->extflags >> C<split> will delete |
165 | whitespace from the start of the subject string before it's operated |
166 | on. What is considered whitespace depends on whether the subject is a |
167 | UTF-8 string and whether the C<RXf_PMf_LOCALE> flag is set. |
168 | |
169 | If RXf_WHITE is set in addition to this flag C<split> will behave like |
170 | C<split " "> under the perl engine. |
171 | |
c998b245 |
172 | =item RXf_START_ONLY |
173 | |
174 | Tells the split operator to split the target string on newlines |
175 | (C<\n>) without invoking the regex engine. |
176 | |
177 | Perl's engine sets this if the pattern is C</^/> (C<plen == 1 && *exp |
178 | == '^'>), even under C</^/s>, see L<split|perlfunc>. Of course a |
179 | different regex engine might want to use the same optimizations |
180 | with a different syntax. |
181 | |
182 | =item RXf_WHITE |
183 | |
184 | Tells the split operator to split the target string on whitespace |
185 | without invoking the regex engine. The definition of whitespace varies |
186 | depending on whether the target string is a UTF-8 string and on |
187 | whether RXf_PMf_LOCALE is set. |
188 | |
0ac6acae |
189 | Perl's engine sets this flag if the pattern is C<\s+>. |
c998b245 |
190 | |
640f820d |
191 | =item RXf_NULL |
192 | |
193 | Tells the split operatior to split the target string on |
194 | characters. The definition of character varies depending on whether |
195 | the target string is a UTF-8 string. |
196 | |
197 | Perl's engine sets this flag on empty patterns, this optimization |
198 | makes C<split //> much faster than it would otherwise be, it's even |
199 | faster than C<unpack>. |
200 | |
c998b245 |
201 | =back |
108003db |
202 | |
203 | =head2 exec |
204 | |
49d7dfbc |
205 | I32 exec(pTHX_ REGEXP * const rx, |
108003db |
206 | char *stringarg, char* strend, char* strbeg, |
207 | I32 minend, SV* screamer, |
208 | void* data, U32 flags); |
209 | |
210 | Execute a regexp. |
211 | |
212 | =head2 intuit |
213 | |
49d7dfbc |
214 | char* intuit(pTHX_ REGEXP * const rx, |
108003db |
215 | SV *sv, char *strpos, char *strend, |
49d7dfbc |
216 | const U32 flags, struct re_scream_pos_data_s *data); |
108003db |
217 | |
218 | Find the start position where a regex match should be attempted, |
219 | or possibly whether the regex engine should not be run because the |
220 | pattern can't match. This is called as appropriate by the core |
221 | depending on the values of the extflags member of the regexp |
222 | structure. |
223 | |
224 | =head2 checkstr |
225 | |
49d7dfbc |
226 | SV* checkstr(pTHX_ REGEXP * const rx); |
108003db |
227 | |
228 | Return a SV containing a string that must appear in the pattern. Used |
229 | by C<split> for optimising matches. |
230 | |
231 | =head2 free |
232 | |
49d7dfbc |
233 | void free(pTHX_ REGEXP * const rx); |
108003db |
234 | |
235 | Called by perl when it is freeing a regexp pattern so that the engine |
236 | can release any resources pointed to by the C<pprivate> member of the |
237 | regexp structure. This is only responsible for freeing private data; |
238 | perl will handle releasing anything else contained in the regexp structure. |
239 | |
192b9cd1 |
240 | =head2 Numbered capture callbacks |
108003db |
241 | |
192b9cd1 |
242 | Called to get/set the value of C<$`>, C<$'>, C<$&> and their named |
243 | equivalents, ${^PREMATCH}, ${^POSTMATCH} and $^{MATCH}, as well as the |
244 | numbered capture buffers (C<$1>, C<$2>, ...). |
49d7dfbc |
245 | |
246 | The C<paren> paramater will be C<-2> for C<$`>, C<-1> for C<$'>, C<0> |
247 | for C<$&>, C<1> for C<$1> and so forth. |
248 | |
192b9cd1 |
249 | The names have been chosen by analogy with L<Tie::Scalar> methods |
250 | names with an additional B<LENGTH> callback for efficiency. However |
251 | named capture variables are currently not tied internally but |
252 | implemented via magic. |
253 | |
254 | =head3 numbered_buff_FETCH |
255 | |
256 | void numbered_buff_FETCH(pTHX_ REGEXP * const rx, const I32 paren, |
257 | SV * const sv); |
258 | |
259 | Fetch a specified numbered capture. C<sv> should be set to the scalar |
260 | to return, the scalar is passed as an argument rather than being |
261 | returned from the function because when it's called perl already has a |
262 | scalar to store the value, creating another one would be |
263 | redundant. The scalar can be set with C<sv_setsv>, C<sv_setpvn> and |
264 | friends, see L<perlapi>. |
49d7dfbc |
265 | |
266 | This callback is where perl untaints its own capture variables under |
c998b245 |
267 | taint mode (see L<perlsec>). See the C<Perl_reg_numbered_buff_fetch> |
49d7dfbc |
268 | function in F<regcomp.c> for how to untaint capture variables if |
269 | that's something you'd like your engine to do as well. |
108003db |
270 | |
192b9cd1 |
271 | =head3 numbered_buff_STORE |
108003db |
272 | |
2fdbfb4d |
273 | void (*numbered_buff_STORE) (pTHX_ REGEXP * const rx, const I32 paren, |
274 | SV const * const value); |
108003db |
275 | |
192b9cd1 |
276 | Set the value of a numbered capture variable. C<value> is the scalar |
277 | that is to be used as the new value. It's up to the engine to make |
278 | sure this is used as the new value (or reject it). |
2fdbfb4d |
279 | |
280 | Example: |
281 | |
282 | if ("ook" =~ /(o*)/) { |
283 | # `paren' will be `1' and `value' will be `ee' |
284 | $1 =~ tr/o/e/; |
285 | } |
286 | |
287 | Perl's own engine will croak on any attempt to modify the capture |
288 | variables, to do this in another engine use the following callack |
289 | (copied from C<Perl_reg_numbered_buff_store>): |
290 | |
291 | void |
292 | Example_reg_numbered_buff_store(pTHX_ REGEXP * const rx, const I32 paren, |
293 | SV const * const value) |
294 | { |
295 | PERL_UNUSED_ARG(rx); |
296 | PERL_UNUSED_ARG(paren); |
297 | PERL_UNUSED_ARG(value); |
298 | |
299 | if (!PL_localizing) |
300 | Perl_croak(aTHX_ PL_no_modify); |
301 | } |
302 | |
303 | Actually perl 5.10 will not I<always> croak in a statement that looks |
304 | like it would modify a numbered capture variable. This is because the |
305 | STORE callback will not be called if perl can determine that it |
306 | doesn't have to modify the value. This is exactly how tied variables |
307 | behave in the same situation: |
308 | |
309 | package CaptureVar; |
310 | use base 'Tie::Scalar'; |
311 | |
312 | sub TIESCALAR { bless [] } |
313 | sub FETCH { undef } |
314 | sub STORE { die "This doesn't get called" } |
315 | |
316 | package main; |
317 | |
318 | tie my $sv => "CatptureVar"; |
319 | $sv =~ y/a/b/; |
320 | |
321 | Because C<$sv> is C<undef> when the C<y///> operator is applied to it |
322 | the transliteration won't actually execute and the program won't |
192b9cd1 |
323 | C<die>. This is different to how 5.8 and earlier versions behaved |
324 | since the capture variables were READONLY variables then, now they'll |
325 | just die when assigned to in the default engine. |
2fdbfb4d |
326 | |
192b9cd1 |
327 | =head3 numbered_buff_LENGTH |
2fdbfb4d |
328 | |
329 | I32 numbered_buff_LENGTH (pTHX_ REGEXP * const rx, const SV * const sv, |
330 | const I32 paren); |
331 | |
332 | Get the C<length> of a capture variable. There's a special callback |
333 | for this so that perl doesn't have to do a FETCH and run C<length> on |
192b9cd1 |
334 | the result, since the length is (in perl's case) known from an offset |
335 | stored in C<<rx->offs> this is much more efficient: |
2fdbfb4d |
336 | |
337 | I32 s1 = rx->offs[paren].start; |
338 | I32 s2 = rx->offs[paren].end; |
339 | I32 len = t1 - s1; |
340 | |
341 | This is a little bit more complex in the case of UTF-8, see what |
342 | C<Perl_reg_numbered_buff_length> does with |
343 | L<is_utf8_string_loclen|perlapi/is_utf8_string_loclen>. |
344 | |
192b9cd1 |
345 | =head2 Named capture callbacks |
346 | |
347 | Called to get/set the value of C<%+> and C<%-> as well as by some |
348 | utility functions in L<re>. |
349 | |
350 | There are two callbacks, C<named_buff> is called in all the cases the |
351 | FETCH, STORE, DELETE, CLEAR, EXISTS and SCALAR L<Tie::Hash> callbacks |
352 | would be on changes to C<%+> and C<%-> and C<named_buff_iter> in the |
353 | same cases as FIRSTKEY and NEXTKEY. |
354 | |
355 | The C<flags> parameter can be used to determine which of these |
356 | operations the callbacks should respond to, the following flags are |
357 | currently defined: |
358 | |
359 | Which L<Tie::Hash> operation is being performed from the Perl level on |
360 | C<%+> or C<%+>, if any: |
361 | |
f1b875a0 |
362 | RXapif_FETCH |
363 | RXapif_STORE |
364 | RXapif_DELETE |
365 | RXapif_CLEAR |
366 | RXapif_EXISTS |
367 | RXapif_SCALAR |
368 | RXapif_FIRSTKEY |
369 | RXapif_NEXTKEY |
192b9cd1 |
370 | |
371 | Whether C<%+> or C<%-> is being operated on, if any. |
2fdbfb4d |
372 | |
f1b875a0 |
373 | RXapif_ONE /* %+ */ |
374 | RXapif_ALL /* %- */ |
2fdbfb4d |
375 | |
192b9cd1 |
376 | Whether this is being called as C<re::regname>, C<re::regnames> or |
c998b245 |
377 | C<re::regnames_count>, if any. The first two will be combined with |
f1b875a0 |
378 | C<RXapif_ONE> or C<RXapif_ALL>. |
192b9cd1 |
379 | |
f1b875a0 |
380 | RXapif_REGNAME |
381 | RXapif_REGNAMES |
382 | RXapif_REGNAMES_COUNT |
192b9cd1 |
383 | |
384 | Internally C<%+> and C<%-> are implemented with a real tied interface |
385 | via L<Tie::Hash::NamedCapture>. The methods in that package will call |
386 | back into these functions. However the usage of |
387 | L<Tie::Hash::NamedCapture> for this purpose might change in future |
388 | releases. For instance this might be implemented by magic instead |
389 | (would need an extension to mgvtbl). |
390 | |
391 | =head3 named_buff |
392 | |
393 | SV* (*named_buff) (pTHX_ REGEXP * const rx, SV * const key, |
394 | SV * const value, U32 flags); |
395 | |
396 | =head3 named_buff_iter |
397 | |
398 | SV* (*named_buff_iter) (pTHX_ REGEXP * const rx, const SV * const lastkey, |
399 | const U32 flags); |
108003db |
400 | |
49d7dfbc |
401 | =head2 qr_package |
108003db |
402 | |
49d7dfbc |
403 | SV* qr_package(pTHX_ REGEXP * const rx); |
108003db |
404 | |
405 | The package the qr// magic object is blessed into (as seen by C<ref |
49d7dfbc |
406 | qr//>). It is recommended that engines change this to their package |
407 | name for identification regardless of whether they implement methods |
408 | on the object. |
409 | |
192b9cd1 |
410 | The package this method returns should also have the internal |
411 | C<Regexp> package in its C<@ISA>. C<qr//->isa("Regexp")> should always |
412 | be true regardless of what engine is being used. |
413 | |
414 | Example implementation might be: |
108003db |
415 | |
416 | SV* |
192b9cd1 |
417 | Example_qr_package(pTHX_ REGEXP * const rx) |
108003db |
418 | { |
419 | PERL_UNUSED_ARG(rx); |
420 | return newSVpvs("re::engine::Example"); |
421 | } |
422 | |
423 | Any method calls on an object created with C<qr//> will be dispatched to the |
424 | package as a normal object. |
425 | |
426 | use re::engine::Example; |
427 | my $re = qr//; |
428 | $re->meth; # dispatched to re::engine::Example::meth() |
429 | |
f7e71195 |
430 | To retrieve the C<REGEXP> object from the scalar in an XS function use |
431 | the C<SvRX> macro, see L<"REGEXP Functions" in perlapi|perlapi/REGEXP |
432 | Functions>. |
108003db |
433 | |
434 | void meth(SV * rv) |
435 | PPCODE: |
f7e71195 |
436 | REGEXP * re = SvRX(sv); |
108003db |
437 | |
108003db |
438 | =head2 dupe |
439 | |
49d7dfbc |
440 | void* dupe(pTHX_ REGEXP * const rx, CLONE_PARAMS *param); |
108003db |
441 | |
442 | On threaded builds a regexp may need to be duplicated so that the pattern |
443 | can be used by mutiple threads. This routine is expected to handle the |
444 | duplication of any private data pointed to by the C<pprivate> member of |
445 | the regexp structure. It will be called with the preconstructed new |
446 | regexp structure as an argument, the C<pprivate> member will point at |
447 | the B<old> private structue, and it is this routine's responsibility to |
448 | construct a copy and return a pointer to it (which perl will then use to |
449 | overwrite the field as passed to this routine.) |
450 | |
451 | This allows the engine to dupe its private data but also if necessary |
452 | modify the final structure if it really must. |
453 | |
454 | On unthreaded builds this field doesn't exist. |
455 | |
456 | =head1 The REGEXP structure |
457 | |
458 | The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to |
459 | correctly build such a structure in their L</comp> routine. |
460 | |
461 | The REGEXP structure contains all the data that perl needs to be aware of |
462 | to properly work with the regular expression. It includes data about |
463 | optimisations that perl can use to determine if the regex engine should |
464 | really be used, and various other control info that is needed to properly |
465 | execute patterns in various contexts such as is the pattern anchored in |
466 | some way, or what flags were used during the compile, or whether the |
467 | program contains special constructs that perl needs to be aware of. |
468 | |
882227b7 |
469 | In addition it contains two fields that are intended for the private |
470 | use of the regex engine that compiled the pattern. These are the |
471 | C<intflags> and C<pprivate> members. C<pprivate> is a void pointer to |
472 | an arbitrary structure whose use and management is the responsibility |
473 | of the compiling engine. perl will never modify either of these |
474 | values. |
108003db |
475 | |
476 | typedef struct regexp { |
477 | /* what engine created this regexp? */ |
478 | const struct regexp_engine* engine; |
479 | |
480 | /* what re is this a lightweight copy of? */ |
481 | struct regexp* mother_re; |
482 | |
483 | /* Information about the match that the perl core uses to manage things */ |
484 | U32 extflags; /* Flags used both externally and internally */ |
485 | I32 minlen; /* mininum possible length of string to match */ |
486 | I32 minlenret; /* mininum possible length of $& */ |
487 | U32 gofs; /* chars left of pos that we search from */ |
488 | |
489 | /* substring data about strings that must appear |
490 | in the final match, used for optimisations */ |
491 | struct reg_substr_data *substrs; |
492 | |
493 | U32 nparens; /* number of capture buffers */ |
494 | |
495 | /* private engine specific data */ |
496 | U32 intflags; /* Engine Specific Internal flags */ |
497 | void *pprivate; /* Data private to the regex engine which |
498 | created this object. */ |
499 | |
500 | /* Data about the last/current match. These are modified during matching*/ |
501 | U32 lastparen; /* last open paren matched */ |
502 | U32 lastcloseparen; /* last close paren matched */ |
503 | regexp_paren_pair *swap; /* Swap copy of *offs */ |
504 | regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */ |
505 | |
506 | char *subbeg; /* saved or original string so \digit works forever. */ |
507 | SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ |
508 | I32 sublen; /* Length of string pointed by subbeg */ |
509 | |
510 | /* Information about the match that isn't often used */ |
511 | I32 prelen; /* length of precomp */ |
512 | const char *precomp; /* pre-compilation regular expression */ |
513 | |
108003db |
514 | char *wrapped; /* wrapped version of the pattern */ |
515 | I32 wraplen; /* length of wrapped */ |
516 | |
517 | I32 seen_evals; /* number of eval groups in the pattern - for security checks */ |
518 | HV *paren_names; /* Optional hash of paren names */ |
519 | |
520 | /* Refcount of this regexp */ |
521 | I32 refcnt; /* Refcount of this regexp */ |
522 | } regexp; |
523 | |
524 | The fields are discussed in more detail below: |
525 | |
882227b7 |
526 | =head2 C<engine> |
108003db |
527 | |
528 | This field points at a regexp_engine structure which contains pointers |
529 | to the subroutines that are to be used for performing a match. It |
530 | is the compiling routine's responsibility to populate this field before |
531 | returning the regexp object. |
532 | |
533 | Internally this is set to C<NULL> unless a custom engine is specified in |
534 | C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct |
535 | pointed to by C<RE_ENGINE_PTR>. |
536 | |
882227b7 |
537 | =head2 C<mother_re> |
108003db |
538 | |
539 | TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html> |
540 | |
882227b7 |
541 | =head2 C<extflags> |
108003db |
542 | |
192b9cd1 |
543 | This will be used by perl to see what flags the regexp was compiled |
544 | with, this will normally be set to the value of the flags parameter by |
c998b245 |
545 | the L<comp|/comp> callback. See the L<comp|/comp> documentation for |
546 | valid flags. |
108003db |
547 | |
882227b7 |
548 | =head2 C<minlen> C<minlenret> |
108003db |
549 | |
550 | The minimum string length required for the pattern to match. This is used to |
551 | prune the search space by not bothering to match any closer to the end of a |
552 | string than would allow a match. For instance there is no point in even |
553 | starting the regex engine if the minlen is 10 but the string is only 5 |
554 | characters long. There is no way that the pattern can match. |
555 | |
556 | C<minlenret> is the minimum length of the string that would be found |
557 | in $& after a match. |
558 | |
559 | The difference between C<minlen> and C<minlenret> can be seen in the |
560 | following pattern: |
561 | |
562 | /ns(?=\d)/ |
563 | |
564 | where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is |
565 | required to match but is not actually included in the matched content. This |
566 | distinction is particularly important as the substitution logic uses the |
567 | C<minlenret> to tell whether it can do in-place substition which can result in |
568 | considerable speedup. |
569 | |
882227b7 |
570 | =head2 C<gofs> |
108003db |
571 | |
572 | Left offset from pos() to start match at. |
573 | |
882227b7 |
574 | =head2 C<substrs> |
108003db |
575 | |
192b9cd1 |
576 | Substring data about strings that must appear in the final match. This |
577 | is currently only used internally by perl's engine for but might be |
c998b245 |
578 | used in the future for all engines for optimisations. |
108003db |
579 | |
882227b7 |
580 | =head2 C<nparens>, C<lasparen>, and C<lastcloseparen> |
108003db |
581 | |
582 | These fields are used to keep track of how many paren groups could be matched |
583 | in the pattern, which was the last open paren to be entered, and which was |
584 | the last close paren to be entered. |
585 | |
882227b7 |
586 | =head2 C<intflags> |
108003db |
587 | |
588 | The engine's private copy of the flags the pattern was compiled with. Usually |
192b9cd1 |
589 | this is the same as C<extflags> unless the engine chose to modify one of them. |
108003db |
590 | |
882227b7 |
591 | =head2 C<pprivate> |
108003db |
592 | |
593 | A void* pointing to an engine-defined data structure. The perl engine uses the |
594 | C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom |
595 | engine should use something else. |
596 | |
882227b7 |
597 | =head2 C<swap> |
108003db |
598 | |
599 | TODO: document |
600 | |
882227b7 |
601 | =head2 C<offs> |
108003db |
602 | |
603 | A C<regexp_paren_pair> structure which defines offsets into the string being |
604 | matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the |
605 | C<regexp_paren_pair> struct is defined as follows: |
606 | |
607 | typedef struct regexp_paren_pair { |
608 | I32 start; |
609 | I32 end; |
610 | } regexp_paren_pair; |
611 | |
612 | If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that |
613 | capture buffer did not match. C<< ->offs[0].start/end >> represents C<$&> (or |
614 | C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where |
615 | C<$paren >= 1>. |
616 | |
882227b7 |
617 | =head2 C<precomp> C<prelen> |
108003db |
618 | |
192b9cd1 |
619 | Used for optimisations. C<precomp> holds a copy of the pattern that |
620 | was compiled and C<prelen> its length. When a new pattern is to be |
621 | compiled (such as inside a loop) the internal C<regcomp> operator |
622 | checks whether the last compiled C<REGEXP>'s C<precomp> and C<prelen> |
623 | are equivalent to the new one, and if so uses the old pattern instead |
624 | of compiling a new one. |
625 | |
626 | The relevant snippet from C<Perl_pp_regcomp>: |
627 | |
628 | if (!re || !re->precomp || re->prelen != (I32)len || |
629 | memNE(re->precomp, t, len)) |
630 | /* Compile a new pattern */ |
108003db |
631 | |
882227b7 |
632 | =head2 C<paren_names> |
108003db |
633 | |
634 | This is a hash used internally to track named capture buffers and their |
635 | offsets. The keys are the names of the buffers the values are dualvars, |
636 | with the IV slot holding the number of buffers with the given name and the |
637 | pv being an embedded array of I32. The values may also be contained |
638 | independently in the data array in cases where named backreferences are |
639 | used. |
640 | |
c998b245 |
641 | =head2 C<substrs> |
108003db |
642 | |
643 | Holds information on the longest string that must occur at a fixed |
644 | offset from the start of the pattern, and the longest string that must |
645 | occur at a floating offset from the start of the pattern. Used to do |
646 | Fast-Boyer-Moore searches on the string to find out if its worth using |
647 | the regex engine at all, and if so where in the string to search. |
648 | |
882227b7 |
649 | =head2 C<subbeg> C<sublen> C<saved_copy> |
108003db |
650 | |
c998b245 |
651 | Used during execution phase for managing search and replace patterns. |
108003db |
652 | |
882227b7 |
653 | =head2 C<wrapped> C<wraplen> |
108003db |
654 | |
c998b245 |
655 | Stores the string C<qr//> stringifies to. The perl engine for example |
656 | stores C<(?-xism:eek)> in the case of C<qr/eek/>. |
108003db |
657 | |
c998b245 |
658 | When using a custom engine that doesn't support the C<(?:)> construct |
659 | for inline modifiers, it's probably best to have C<qr//> stringify to |
660 | the supplied pattern, note that this will create undesired patterns in |
661 | cases such as: |
108003db |
662 | |
663 | my $x = qr/a|b/; # "a|b" |
192b9cd1 |
664 | my $y = qr/c/i; # "c" |
108003db |
665 | my $z = qr/$x$y/; # "a|bc" |
666 | |
192b9cd1 |
667 | There's no solution for this problem other than making the custom |
668 | engine understand a construct like C<(?:)>. |
108003db |
669 | |
882227b7 |
670 | =head2 C<seen_evals> |
108003db |
671 | |
672 | This stores the number of eval groups in the pattern. This is used for security |
673 | purposes when embedding compiled regexes into larger patterns with C<qr//>. |
674 | |
882227b7 |
675 | =head2 C<refcnt> |
108003db |
676 | |
677 | The number of times the structure is referenced. When this falls to 0 the |
678 | regexp is automatically freed by a call to pregfree. This should be set to 1 in |
679 | each engine's L</comp> routine. |
680 | |
108003db |
681 | =head1 HISTORY |
682 | |
683 | Originally part of L<perlreguts>. |
684 | |
685 | =head1 AUTHORS |
686 | |
687 | Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth> |
688 | Bjarmason. |
689 | |
690 | =head1 LICENSE |
691 | |
692 | Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason. |
693 | |
694 | This program is free software; you can redistribute it and/or modify it under |
695 | the same terms as Perl itself. |
696 | |
697 | =cut |