Change meaning of \v, \V, and add \h, \H to match Perl6, add \R to match PCRE and...
[p5sagit/p5-mst-13.2.git] / pod / perlreapi.pod
CommitLineData
108003db 1=head1 NAME
2
3perlreapi - perl regular expression plugin interface
4
5=head1 DESCRIPTION
6
7As of Perl 5.9.5 there is a new interface for using other regexp engines than
8the default one. Each engine is supposed to provide access to a constant
9structure of the following format:
10
11 typedef struct regexp_engine {
12 regexp* (*comp) (pTHX_ char* exp, char* xend, U32 pm_flags);
13 I32 (*exec) (pTHX_ regexp* prog, char* stringarg, char* strend,
14 char* strbeg, I32 minend, SV* screamer,
15 void* data, U32 flags);
16 char* (*intuit) (pTHX_ regexp *prog, SV *sv, char *strpos,
17 char *strend, U32 flags,
18 struct re_scream_pos_data_s *data);
19 SV* (*checkstr) (pTHX_ regexp *prog);
20 void (*free) (pTHX_ struct regexp* r);
21 SV* (*numbered_buff_get) (pTHX_ const REGEXP * const rx, I32 paren, SV* usesv);
22 SV* (*named_buff_get)(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags);
23 SV* (*qr_pkg)(pTHX_ const REGEXP * const rx);
24 #ifdef USE_ITHREADS
25 void* (*dupe) (pTHX_ const regexp *r, CLONE_PARAMS *param);
26 #endif
27 } regexp_engine;
28
29When a regexp is compiled, its C<engine> field is then set to point at
30the appropriate structure so that when it needs to be used Perl can find
31the right routines to do so.
32
33In order to install a new regexp handler, C<$^H{regcomp}> is set
34to an integer which (when casted appropriately) resolves to one of these
35structures. When compiling, the C<comp> method is executed, and the
36resulting regexp structure's engine field is expected to point back at
37the same structure.
38
39The pTHX_ symbol in the definition is a macro used by perl under threading
40to provide an extra argument to the routine holding a pointer back to
41the interpreter that is executing the regexp. So under threading all
42routines get an extra argument.
43
44The routines are as follows:
45
46=head2 comp
47
48 regexp* comp(char *exp, char *xend, U32 flags);
49
50Compile the pattern between exp and xend using the given flags and return a
51pointer to a prepared regexp structure that can perform the match. See L</The
52REGEXP structure> below for an explanation of the individual fields in the
53REGEXP struct.
54
55The C<flags> paramater is a bitfield which indicates which of the
56C<msixk> flags the regex was compiled with. In addition it contains
57info about whether C<use locale> is in effect and optimization info
58for C<split>. A regex engine might want to use the same split
59optimizations with a different syntax, for instance a Perl6 engine
60would treat C<split /^^/> equivalently to perl's C<split /^/>, see
61L<split documentation|perlfunc> and the relevant code in C<pp_split>
62in F<pp.c> to find out whether your engine should be setting these.
63
64The C<eogc> flags are stripped out before being passed to the comp
65routine. The regex engine does not need to know whether any of these
66are set.
67
68=over 4
69
70=item RXf_SKIPWHITE
71
72C<split ' '> or C<split> with no arguments (which really means
73C<split(' ', $_> see L<split|perlfunc>).
74
75=item RXf_START_ONLY
76
77Set if the pattern is C</^/> (C<<r->prelen == 1 && r->precomp[0] ==
78'^'>>). Will be used by the C<split> operator to split the given
79string on C<\n> (even under C</^/s>, see L<split|perlfunc>).
80
81=item RXf_WHITE
82
83Set if the pattern is exactly C</\s+/> and used by C<split>, the
84definition of whitespace varies depending on whether RXf_UTF8 or
85RXf_PMf_LOCALE is set.
86
87=item RXf_PMf_LOCALE
88
89Makes C<split> use the locale dependant definition of whitespace under C<use
90locale> when RXf_SKIPWHITE or RXf_WHITE is in effect. Under ASCII whitespace is
91defined as per L<isSPACE|perlapi/ISSPACE>, and by the internal macros
92C<is_utf8_space> under UTF-8 and C<isSPACE_LC> under C<use locale>.
93
94=item RXf_PMf_MULTILINE
95
96The C</m> flag, this ends up being passed to C<Perl_fbm_instr> by
97C<pp_split> regardless of the engine.
98
99=item RXf_PMf_SINGLELINE
100
101The C</s> flag. Guaranteed not to be used outside the regex engine.
102
103=item RXf_PMf_FOLD
104
105The C</i> flag. Guaranteed not to be used outside the regex engine.
106
107=item RXf_PMf_EXTENDED
108
109The C</x> flag. Guaranteed not to be used outside the regex
110engine. However if present on a regex C<#> comments will be stripped
111by the tokenizer regardless of the engine currently in use.
112
113=item RXf_PMf_KEEPCOPY
114
115The C</k> flag.
116
117=item RXf_UTF8
118
119Set if the pattern is L<SvUTF8()|perlapi/SvUTF8>, set by Perl_pmruntime.
120
121=back
122
123In general these flags should be preserved in regex->extflags after
124compilation, although it is possible the regex includes constructs
125that changes them. The perl engine for instance may upgrade non-utf8
126strings to utf8 if the pattern includes constructs such as C<\x{...}>
127that can only match unicode values. RXf_SKIPWHITE should always be
128preserved verbatim in regex->extflags.
129
130=head2 exec
131
132 I32 exec(regexp* prog,
133 char *stringarg, char* strend, char* strbeg,
134 I32 minend, SV* screamer,
135 void* data, U32 flags);
136
137Execute a regexp.
138
139=head2 intuit
140
141 char* intuit( regexp *prog,
142 SV *sv, char *strpos, char *strend,
143 U32 flags, struct re_scream_pos_data_s *data);
144
145Find the start position where a regex match should be attempted,
146or possibly whether the regex engine should not be run because the
147pattern can't match. This is called as appropriate by the core
148depending on the values of the extflags member of the regexp
149structure.
150
151=head2 checkstr
152
153 SV* checkstr(regexp *prog);
154
155Return a SV containing a string that must appear in the pattern. Used
156by C<split> for optimising matches.
157
158=head2 free
159
160 void free(regexp *prog);
161
162Called by perl when it is freeing a regexp pattern so that the engine
163can release any resources pointed to by the C<pprivate> member of the
164regexp structure. This is only responsible for freeing private data;
165perl will handle releasing anything else contained in the regexp structure.
166
167=head2 numbered_buff_get
168
169 SV* numbered_buff_get(pTHX_ const REGEXP * const rx, I32 paren, SV* usesv);
170
171TODO: document
172
173=head2 named_buff_get
174
175 SV* named_buff_get(pTHX_ const REGEXP * const rx, SV* namesv, U32 flags);
176
177TODO: document
178
179=head2 qr_pkg
180
181 SV* qr_pkg(pTHX_ const REGEXP * const rx);
182
183The package the qr// magic object is blessed into (as seen by C<ref
184qr//>). It is recommended that engines change this to its package
185name, for instance:
186
187 SV*
188 Example_reg_qr_pkg(pTHX_ const REGEXP * const rx)
189 {
190 PERL_UNUSED_ARG(rx);
191 return newSVpvs("re::engine::Example");
192 }
193
194Any method calls on an object created with C<qr//> will be dispatched to the
195package as a normal object.
196
197 use re::engine::Example;
198 my $re = qr//;
199 $re->meth; # dispatched to re::engine::Example::meth()
200
201To retrieve the C<REGEXP> object from the scalar in an XS function use the
202following snippet:
203
204 void meth(SV * rv)
205 PPCODE:
206 MAGIC * mg;
207 REGEXP * re;
208
209 if (SvMAGICAL(sv))
210 mg_get(sv);
211 if (SvROK(sv) &&
212 (sv = (SV*)SvRV(sv)) && /* assignment deliberate */
213 SvTYPE(sv) == SVt_PVMG &&
214 (mg = mg_find(sv, PERL_MAGIC_qr))) /* assignment deliberate */
215 {
216 re = (REGEXP *)mg->mg_obj;
217 }
218
219Or use the (CURRENTLY UNDOCUMENETED!) C<Perl_get_re_arg> function:
220
221 void meth(SV * rv)
222 PPCODE:
223 const REGEXP * const re = (REGEXP *)Perl_get_re_arg( aTHX_ rv, 0, NULL );
224
225=head2 dupe
226
227 void* dupe(const regexp *r, CLONE_PARAMS *param);
228
229On threaded builds a regexp may need to be duplicated so that the pattern
230can be used by mutiple threads. This routine is expected to handle the
231duplication of any private data pointed to by the C<pprivate> member of
232the regexp structure. It will be called with the preconstructed new
233regexp structure as an argument, the C<pprivate> member will point at
234the B<old> private structue, and it is this routine's responsibility to
235construct a copy and return a pointer to it (which perl will then use to
236overwrite the field as passed to this routine.)
237
238This allows the engine to dupe its private data but also if necessary
239modify the final structure if it really must.
240
241On unthreaded builds this field doesn't exist.
242
243=head1 The REGEXP structure
244
245The REGEXP struct is defined in F<regexp.h>. All regex engines must be able to
246correctly build such a structure in their L</comp> routine.
247
248The REGEXP structure contains all the data that perl needs to be aware of
249to properly work with the regular expression. It includes data about
250optimisations that perl can use to determine if the regex engine should
251really be used, and various other control info that is needed to properly
252execute patterns in various contexts such as is the pattern anchored in
253some way, or what flags were used during the compile, or whether the
254program contains special constructs that perl needs to be aware of.
255
256In addition it contains two fields that are intended for the private use
257of the regex engine that compiled the pattern. These are the C<intflags>
258and pprivate members. The C<pprivate> is a void pointer to an arbitrary
259structure whose use and management is the responsibility of the compiling
260engine. perl will never modify either of these values.
261
262 typedef struct regexp {
263 /* what engine created this regexp? */
264 const struct regexp_engine* engine;
265
266 /* what re is this a lightweight copy of? */
267 struct regexp* mother_re;
268
269 /* Information about the match that the perl core uses to manage things */
270 U32 extflags; /* Flags used both externally and internally */
271 I32 minlen; /* mininum possible length of string to match */
272 I32 minlenret; /* mininum possible length of $& */
273 U32 gofs; /* chars left of pos that we search from */
274
275 /* substring data about strings that must appear
276 in the final match, used for optimisations */
277 struct reg_substr_data *substrs;
278
279 U32 nparens; /* number of capture buffers */
280
281 /* private engine specific data */
282 U32 intflags; /* Engine Specific Internal flags */
283 void *pprivate; /* Data private to the regex engine which
284 created this object. */
285
286 /* Data about the last/current match. These are modified during matching*/
287 U32 lastparen; /* last open paren matched */
288 U32 lastcloseparen; /* last close paren matched */
289 regexp_paren_pair *swap; /* Swap copy of *offs */
290 regexp_paren_pair *offs; /* Array of offsets for (@-) and (@+) */
291
292 char *subbeg; /* saved or original string so \digit works forever. */
293 SV_SAVED_COPY /* If non-NULL, SV which is COW from original */
294 I32 sublen; /* Length of string pointed by subbeg */
295
296 /* Information about the match that isn't often used */
297 I32 prelen; /* length of precomp */
298 const char *precomp; /* pre-compilation regular expression */
299
300 /* wrapped can't be const char*, as it is returned by sv_2pv_flags */
301 char *wrapped; /* wrapped version of the pattern */
302 I32 wraplen; /* length of wrapped */
303
304 I32 seen_evals; /* number of eval groups in the pattern - for security checks */
305 HV *paren_names; /* Optional hash of paren names */
306
307 /* Refcount of this regexp */
308 I32 refcnt; /* Refcount of this regexp */
309 } regexp;
310
311The fields are discussed in more detail below:
312
313=over 4
314
315=item C<engine>
316
317This field points at a regexp_engine structure which contains pointers
318to the subroutines that are to be used for performing a match. It
319is the compiling routine's responsibility to populate this field before
320returning the regexp object.
321
322Internally this is set to C<NULL> unless a custom engine is specified in
323C<$^H{regcomp}>, perl's own set of callbacks can be accessed in the struct
324pointed to by C<RE_ENGINE_PTR>.
325
326=item C<mother_re>
327
328TODO, see L<http://www.mail-archive.com/perl5-changes@perl.org/msg17328.html>
329
330=item C<extflags>
331
332This will be used by perl to see what flags the regexp was compiled with, this
333will normally be set to the value of the flags parameter on L</comp>.
334
335=item C<minlen> C<minlenret>
336
337The minimum string length required for the pattern to match. This is used to
338prune the search space by not bothering to match any closer to the end of a
339string than would allow a match. For instance there is no point in even
340starting the regex engine if the minlen is 10 but the string is only 5
341characters long. There is no way that the pattern can match.
342
343C<minlenret> is the minimum length of the string that would be found
344in $& after a match.
345
346The difference between C<minlen> and C<minlenret> can be seen in the
347following pattern:
348
349 /ns(?=\d)/
350
351where the C<minlen> would be 3 but C<minlenret> would only be 2 as the \d is
352required to match but is not actually included in the matched content. This
353distinction is particularly important as the substitution logic uses the
354C<minlenret> to tell whether it can do in-place substition which can result in
355considerable speedup.
356
357=item C<gofs>
358
359Left offset from pos() to start match at.
360
361=item C<substrs>
362
363TODO: document
364
365=item C<nparens>, C<lasparen>, and C<lastcloseparen>
366
367These fields are used to keep track of how many paren groups could be matched
368in the pattern, which was the last open paren to be entered, and which was
369the last close paren to be entered.
370
371=item C<intflags>
372
373The engine's private copy of the flags the pattern was compiled with. Usually
374this is the same as C<extflags> unless the engine chose to modify one of them
375
376=item C<pprivate>
377
378A void* pointing to an engine-defined data structure. The perl engine uses the
379C<regexp_internal> structure (see L<perlreguts/Base Structures>) but a custom
380engine should use something else.
381
382=item C<swap>
383
384TODO: document
385
386=item C<offs>
387
388A C<regexp_paren_pair> structure which defines offsets into the string being
389matched which correspond to the C<$&> and C<$1>, C<$2> etc. captures, the
390C<regexp_paren_pair> struct is defined as follows:
391
392 typedef struct regexp_paren_pair {
393 I32 start;
394 I32 end;
395 } regexp_paren_pair;
396
397If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that
398capture buffer did not match. C<< ->offs[0].start/end >> represents C<$&> (or
399C<${^MATCH> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where
400C<$paren >= 1>.
401
402=item C<precomp> C<prelen>
403
404Used for debugging purposes. C<precomp> holds a copy of the pattern
405that was compiled and C<prelen> its length.
406
407=item C<paren_names>
408
409This is a hash used internally to track named capture buffers and their
410offsets. The keys are the names of the buffers the values are dualvars,
411with the IV slot holding the number of buffers with the given name and the
412pv being an embedded array of I32. The values may also be contained
413independently in the data array in cases where named backreferences are
414used.
415
416=item C<reg_substr_data>
417
418Holds information on the longest string that must occur at a fixed
419offset from the start of the pattern, and the longest string that must
420occur at a floating offset from the start of the pattern. Used to do
421Fast-Boyer-Moore searches on the string to find out if its worth using
422the regex engine at all, and if so where in the string to search.
423
424=item C<startp>, C<endp>
425
426These fields store arrays that are used to hold the offsets of the begining
427and end of each capture group that has matched. -1 is used to indicate no match.
428
429These are the source for @- and @+.
430
431=item C<subbeg> C<sublen> C<saved_copy>
432
433 #define SAVEPVN(p,n) ((p) ? savepvn(p,n) : NULL)
434 if (RX_MATCH_COPIED(ret))
435 ret->subbeg = SAVEPVN(ret->subbeg, ret->sublen);
436 else
437 ret->subbeg = NULL;
438
439C<PL_sawampersand || rx->extflags & RXf_PMf_KEEPCOPY>
440
441These are used during execution phase for managing search and replace
442patterns.
443
444=item C<wrapped> C<wraplen>
445
446Stores the string C<qr//> stringifies to, for example C<(?-xism:eek)>
447in the case of C<qr/eek/>.
448
449When using a custom engine that doesn't support the C<(?:)> construct for
450inline modifiers it's best to have C<qr//> stringify to the supplied pattern,
451note that this will create invalid patterns in cases such as:
452
453 my $x = qr/a|b/; # "a|b"
454 my $y = qr/c/; # "c"
455 my $z = qr/$x$y/; # "a|bc"
456
457There's no solution for such problems other than making the custom engine
458understand some for of inline modifiers.
459
460The C<Perl_reg_stringify> in F<regcomp.c> does the stringification work.
461
462=item C<seen_evals>
463
464This stores the number of eval groups in the pattern. This is used for security
465purposes when embedding compiled regexes into larger patterns with C<qr//>.
466
467=item C<refcnt>
468
469The number of times the structure is referenced. When this falls to 0 the
470regexp is automatically freed by a call to pregfree. This should be set to 1 in
471each engine's L</comp> routine.
472
473=back
474
475=head2 De-allocation and Cloning
476
477Any patch that adds data items to the REGEXP struct will need to include
478changes to F<sv.c> (C<Perl_re_dup()>) and F<regcomp.c> (C<pregfree()>). This
479involves freeing or cloning items in the regexp's data array based on the data
480item's type.
481
482=head1 HISTORY
483
484Originally part of L<perlreguts>.
485
486=head1 AUTHORS
487
488Originally written by Yves Orton, expanded by E<AElig>var ArnfjE<ouml>rE<eth>
489Bjarmason.
490
491=head1 LICENSE
492
493Copyright 2006 Yves Orton and 2007 E<AElig>var ArnfjE<ouml>rE<eth> Bjarmason.
494
495This program is free software; you can redistribute it and/or modify it under
496the same terms as Perl itself.
497
498=cut