Upgrade to Devel::PPPort 3.08_06
[p5sagit/p5-mst-13.2.git] / pod / perlreguts.pod
CommitLineData
b23a565d 1=head1 NAME
2
3perlreguts - Description of the Perl regular expression engine.
4
5=head1 DESCRIPTION
6
7This document is an attempt to shine some light on the guts of the regex
8engine and how it works. The regex engine represents a signifigant chunk
9of the perl codebase, but is relatively poorly understood. This document
10is a meagre attempt at addressing this situation. It is derived from the
be8e71aa 11author's experience, comments in the source code, other papers on the
12regex engine, feedback on the perl5-porters mail list, and no doubt other
13places as well.
b23a565d 14
15B<WARNING!> It should be clearly understood that this document
16represents the state of the regex engine as the author understands it at
be8e71aa 17the time of writing. It is B<NOT> an API definition; it is purely an
b23a565d 18internals guide for those who want to hack the regex engine, or
19understand how the regex engine works. Readers of this document are
be8e71aa 20expected to understand perl's regex syntax and its usage in detail. If
21you want to learn about the basics of Perl's regular expressions, see
22L<perlre>.
b23a565d 23
24=head1 OVERVIEW
25
26=head2 A quick note on terms
27
be8e71aa 28There is some debate as to whether to say "regexp" or "regex". In this
b23a565d 29document we will use the term "regex" unless there is a special reason
be8e71aa 30not to, in which case we will explain why.
b23a565d 31
32When speaking about regexes we need to distinguish between their source
33code form and their internal form. In this document we will use the term
34"pattern" when we speak of their textual, source code form, the term
35"program" when we speak of their internal representation. These
be8e71aa 36correspond to the terms I<S-regex> and I<B-regex> that Mark Jason
37Dominus employs in his paper on "Rx" ([1] in L</references>).
b23a565d 38
39=head2 What is a regular expression engine?
40
be8e71aa 41A regular expression engine is a program that takes a set of constraints
42specified in a mini-language, and then applies those constraints to a
43target string, and determines whether or not the string satisfies the
44constraints. See L<perlre> for a full definition of the language.
b23a565d 45
be8e71aa 46So in less grandiose terms the first part of the job is to turn a pattern into
b23a565d 47something the computer can efficiently use to find the matching point in
be8e71aa 48the string, and the second part is performing the search itself.
b23a565d 49
50To do this we need to produce a program by parsing the text. We then
51need to execute the program to find the point in the string that
52matches. And we need to do the whole thing efficiently.
53
54=head2 Structure of a Regexp Program
55
56=head3 High Level
57
be8e71aa 58Although it is a bit confusing and some people object to the terminology, it
b23a565d 59is worth taking a look at a comment that has
be8e71aa 60been in F<regexp.h> for years:
b23a565d 61
62I<This is essentially a linear encoding of a nondeterministic
63finite-state machine (aka syntax charts or "railroad normal form" in
64parsing technology).>
65
66The term "railroad normal form" is a bit esoteric, with "syntax
67diagram/charts", or "railroad diagram/charts" being more common terms.
68Nevertheless it provides a useful mental image of a regex program: Each
69node can be thought of as a unit of track, with a single entry and in
70most cases a single exit point (there are pieces of track that fork, but
be8e71aa 71statistically not many), and the whole forms a layout with a
b23a565d 72single entry and single exit point. The matching process can be thought
be8e71aa 73of as a car that moves along the track, with the particular route through
b23a565d 74the system being determined by the character read at each possible
be8e71aa 75connector point. A car can fall off the track at any point but it may
76only proceed as long as it matches the track.
b23a565d 77
78Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the
79following chart:
80
be8e71aa 81 [start]
82 |
83 <foo>
84 |
85 +-----+-----+
86 | | |
87 <\w+> <\d+> <\s+>
88 | | |
89 +-----+-----+
90 |
91 <bar>
92 |
93 [end]
94
95The truth of the matter is that perl's regular expressions these days are
96much more complex than this kind of structure, but visualizing it this way
97can help when trying to get your bearings, and it pretty closely with the
98current implementation.
99
100To be more precise, we will say that a regex program is an encoding
101of a graph. Each node in the graph corresponds to part of
b23a565d 102the original regex pattern, such as a literal string or a branch,
103and has a pointer to the nodes representing the next component
be8e71aa 104to be matched. Since "node" and "opcode" already have other meanings in the
105perl source, we will call the nodes in a regex program "regops".
b23a565d 106
be8e71aa 107The program is represented by an array of C<regnode> structures, one or
108more of which represent a single regop of the program. Struct
109C<regnode> is the smallest struct needed and has a field structure which is
b23a565d 110shared with all the other larger structures.
111
be8e71aa 112The "next" pointers of all regops except C<BRANCH> implement concatenation;
113a "next" pointer with a C<BRANCH> on both ends of it is connecting two
114alternatives. [Here we have one of the subtle syntax dependencies: an
115individual C<BRANCH> (as opposed to a collection of them) is never
116concatenated with anything because of operator precedence.]
b23a565d 117
118The operand of some types of regop is a literal string; for others,
119it is a regop leading into a sub-program. In particular, the operand
be8e71aa 120of a C<BRANCH> node is the first regop of the branch.
b23a565d 121
122B<NOTE>: As the railroad metaphor suggests this is B<not> a tree
123structure: the tail of the branch connects to the thing following the
be8e71aa 124set of C<BRANCH>es. It is a like a single line of railway track that
b23a565d 125splits as it goes into a station or railway yard and rejoins as it comes
126out the other side.
127
128=head3 Regops
129
be8e71aa 130The base structure of a regop is defined in F<regexp.h> as follows:
b23a565d 131
132 struct regnode {
be8e71aa 133 U8 flags; /* Various purposes, sometimes overridden */
b23a565d 134 U8 type; /* Opcode value as specified by regnodes.h */
135 U16 next_off; /* Offset in size regnode */
136 };
137
be8e71aa 138Other larger C<regnode>-like structures are defined in F<regcomp.h>. They
b23a565d 139are almost like subclasses in that they have the same fields as
140regnode, with possibly additional fields following in
141the structure, and in some cases the specific meaning (and name)
142of some of base fields are overriden. The following is a more
143complete description.
144
145=over 4
146
be8e71aa 147=item C<regnode_1>
b23a565d 148
be8e71aa 149=item C<regnode_2>
b23a565d 150
be8e71aa 151C<regnode_1> structures have the same header, followed by a single
152four-byte argument; C<regnode_2> structures contain two two-byte
b23a565d 153arguments instead:
154
155 regnode_1 U32 arg1;
156 regnode_2 U16 arg1; U16 arg2;
157
be8e71aa 158=item C<regnode_string>
b23a565d 159
be8e71aa 160C<regnode_string> structures, used for literal strings, follow the header
b23a565d 161with a one-byte length and then the string data. Strings are padded on
162the end with zero bytes so that the total length of the node is a
163multiple of four bytes:
164
165 regnode_string char string[1];
be8e71aa 166 U8 str_len; /* overrides flags */
b23a565d 167
be8e71aa 168=item C<regnode_charclass>
b23a565d 169
be8e71aa 170Character classes are represented by C<regnode_charclass> structures,
b23a565d 171which have a four-byte argument and then a 32-byte (256-bit) bitmap
172indicating which characters are included in the class.
173
174 regnode_charclass U32 arg1;
175 char bitmap[ANYOF_BITMAP_SIZE];
176
be8e71aa 177=item C<regnode_charclass_class>
b23a565d 178
179There is also a larger form of a char class structure used to represent
be8e71aa 180POSIX char classes called C<regnode_charclass_class> which has an
181additional 4-byte (32-bit) bitmap indicating which POSIX char class
182have been included.
b23a565d 183
184 regnode_charclass_class U32 arg1;
185 char bitmap[ANYOF_BITMAP_SIZE];
186 char classflags[ANYOF_CLASSBITMAP_SIZE];
187
188=back
189
be8e71aa 190F<regnodes.h> defines an array called C<regarglen[]> which gives the size
191of each opcode in units of C<size regnode> (4-byte). A macro is used
192to calculate the size of an C<EXACT> node based on its C<str_len> field.
b23a565d 193
be8e71aa 194The regops are defined in F<regnodes.h> which is generated from
195F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number
196of distinct regops is restricted to 256, with about a quarter already
b23a565d 197used.
198
be8e71aa 199A set of macros makes accessing the fields
200easier and more consistent. These include C<OP()> which is used to determine
201the type of a C<regnode>-like structure, C<NEXT_OFF()> which is the offset to
202the next node (more on this later), C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>,
203and equivelents for reading and setting the arguments, C<STR_LEN()>,
204C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing
b23a565d 205types.
206
be8e71aa 207=head3 What regop is next?
b23a565d 208
209There are three distinct concepts of "next" in the regex engine, and
210it is important to keep them clear.
211
212=over 4
213
214=item *
215
216There is the "next regnode" from a given regnode, a value which is
217rarely useful except that sometimes it matches up in terms of value
218with one of the others, and that sometimes the code assumes this to
219always be so.
220
221=item *
222
be8e71aa 223There is the "next regop" from a given regop/regnode. This is the
224regop physically located after the the current one, as determined by
225the size of the current regop. This is often useful, such as when
b23a565d 226dumping the structure we use this order to traverse. Sometimes the code
be8e71aa 227assumes that the "next regnode" is the same as the "next regop", or in
228other words assumes that the sizeof a given regop type is always going
229to be one regnode large.
b23a565d 230
231=item *
232
be8e71aa 233There is the "regnext" from a given regop. This is the regop which
234is reached by jumping forward by the value of C<NEXT_OFF()>,
235or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1>
236structure. The subroutine C<regnext()> handles this transparently.
b23a565d 237This is the logical successor of the node, which in some cases, like
be8e71aa 238that of the C<BRANCH> regop, has special meaning.
b23a565d 239
240=back
241
be8e71aa 242=head1 Process Overview
b23a565d 243
be8e71aa 244Broadly speaking, performing a match of a string against a pattern
245involves the following steps:
246
247=over 5
248
249=item A. Compilation
250
251=over 5
252
253=item 1. Parsing for size
254
255=item 2. Parsing for construction
256
257=item 3. Peep-hole optimisation and analysis
258
259=back
260
261=item B. Execution
262
263=over 5
264
265=item 4. Start position and no-match optimisations
266
267=item 5. Program execution
268
269=back
270
271=back
b23a565d 272
b23a565d 273
274Where these steps occur in the actual execution of a perl program is
275determined by whether the pattern involves interpolating any string
be8e71aa 276variables. If interpolation occurs, then compilation happens at run time. If it
277does not, then compilation is performed at compile time. (The C</o> modifier changes this,
278as does C<qr//> to a certain extent). The engine doesn't really care that
b23a565d 279much.
280
281=head2 Compilation
282
be8e71aa 283This code resides primarily in F<regcomp.c>, along with the header files
284F<regcomp.h>, F<regexp.h> and F<regnodes.h>.
b23a565d 285
286Compilation starts with C<pregcomp()>, which is mostly an initialization
287wrapper which farms out two other routines for the heavy lifting. The
288first being C<reg()> which is the start point for parsing, and
289C<study_chunk()> which is responsible for optimisation.
290
291Initialization in C<pregcomp()> mostly involves the creation and data
be8e71aa 292filling of a special structure C<RExC_state_t>, (defined in F<regcomp.c>).
293Almost all internally used routines in F<regcomp.h> take a pointer to one
294of these structures as their first argument, with the name C<pRExC_state>.
b23a565d 295This structure is used to store the compilation state and contains many
be8e71aa 296fields. Likewise there are many macros which operate on this
297variable. Anything that looks like C<RExC_xxxx> is a macro that operates on
b23a565d 298this pointer/structure.
299
300=head3 Parsing for size
301
302In this pass the input pattern is parsed in order to calculate how much
be8e71aa 303space is needed for each regop we would need to emit. The size is also
b23a565d 304used to determine whether long jumps will be required in the program.
305
be8e71aa 306This stage is controlled by the macro C<SIZE_ONLY> being set.
b23a565d 307
308The parse procedes pretty much exactly as it does during the
be8e71aa 309construction phase, except that most routines are short-circuited to
310change the size field C<RExC_size> and not do anything else.
b23a565d 311
312=head3 Parsing for construcution
313
be8e71aa 314Once the size of the program has been determined, the pattern is parsed
315again, but this time for real. Now C<SIZE_ONLY> will be false, and the
b23a565d 316actual construction can occur.
317
318C<reg()> is the start of the parse process. It is responsible for
319parsing an arbitrary chunk of pattern up to either the end of the
320string, or the first closing parenthesis it encounters in the pattern.
321This means it can be used to parse the toplevel regex, or any section
322inside of a grouping parenthesis. It also handles the "special parens"
be8e71aa 323that perl's regexes have. For instance when parsing C</x(?:foo)y/> C<reg()>
324will at one point be called to parse from the "?" symbol up to and
325including the ")".
b23a565d 326
be8e71aa 327Additionally, C<reg()> is responsible for parsing the one or more
b23a565d 328branches from the pattern, and for "finishing them off" by correctly
be8e71aa 329setting their next pointers. In order to do the parsing, it repeatedly
330calls out to C<regbranch()>, which is responsible for handling up to the
b23a565d 331first C<|> symbol it sees.
332
be8e71aa 333C<regbranch()> in turn calls C<regpiece()> which
334handles "things" followed by a quantifier. In order to parse the
335"things", C<regatom()> is called. This is the lowest level routine which
336parses out constant strings, character classes, and the
337various special symbols like C<$>. If C<regatom()> encounters a "("
b23a565d 338character it in turn calls C<reg()>.
339
340The routine C<regtail()> is called by both C<reg()>, C<regbranch()>
341in order to "set the tail pointer" correctly. When executing and
be8e71aa 342we get to the end of a branch, we need to go to the node following the
343grouping parens. When parsing, however, we don't know where the end will
b23a565d 344be until we get there, so when we do we must go back and update the
345offsets as appropriate. C<regtail> is used to make this easier.
346
be8e71aa 347A subtlety of the parsing process means that a regex like C</foo/> is
b23a565d 348originally parsed into an alternation with a single branch. It is only
349afterwards that the optimizer converts single branch alternations into the
350simpler form.
351
352=head3 Parse Call Graph and a Grammar
353
354The call graph looks like this:
355
356 reg() # parse a top level regex, or inside of parens
357 regbranch() # parse a single branch of an alternation
358 regpiece() # parse a pattern followed by a quantifier
359 regatom() # parse a simple pattern
360 regclass() # used to handle a class
361 reg() # used to handle a parenthesized subpattern
362 ....
363 ...
364 regtail() # finish off the branch
365 ...
366 regtail() # finish off the branch sequence. Tie each
367 # branches tail to the tail of the sequence
368 # (NEW) In Debug mode this is
369 # regtail_study().
370
371A grammar form might be something like this:
372
373 atom : constant | class
374 quant : '*' | '+' | '?' | '{min,max}'
375 _branch: piece
376 | piece _branch
377 | nothing
378 branch: _branch
379 | _branch '|' branch
380 group : '(' branch ')'
381 _piece: atom | group
382 piece : _piece
383 | _piece quant
384
385=head3 Debug Output
386
387In bleadperl you can C<< use re Debug => 'PARSE'; >> to see some trace
388information about the parse process. We will start with some simple
389patterns and build up to more complex patterns.
390
391So when we parse C</foo/> we see something like the following table. The
392left shows whats being parsed, the number indicates where the next regop
393would go. The stuff on the right is the trace output of the graph. The
394names are chosen to be short to make it less dense on the screen. 'tsdy'
395is a special form of C<regtail()> which does some extra analysis.
396
397 >foo< 1 reg
398 brnc
399 piec
400 atom
401 >< 4 tsdy~ EXACT <foo> (EXACT) (1)
402 ~ attach to END (3) offset to 2
403
404The resulting program then looks like:
405
406 1: EXACT <foo>(3)
407 3: END(0)
408
409As you can see, even though we parsed out a branch and a piece, it was ultimately
be8e71aa 410only an atom. The final program shows us how things work. We have an C<EXACT> regop,
411followed by an C<END> regop. The number in parens indicates where the C<regnext> of
412the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean
b23a565d 413we have successfully matched. The number on the left indicates the position of
414the regop in the regnode array.
415
be8e71aa 416Now let's try a harder pattern. We will add a quantifier, so now we have the pattern
b23a565d 417C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> regpiece twice.
418
419 >foo+< 1 reg
420 brnc
421 piec
422 atom
423 >o+< 3 piec
424 atom
425 >< 6 tail~ EXACT <fo> (1)
426 7 tsdy~ EXACT <fo> (EXACT) (1)
427 ~ PLUS (END) (3)
428 ~ attach to END (6) offset to 3
429
430And we end up with the program:
431
432 1: EXACT <fo>(3)
433 3: PLUS(6)
434 4: EXACT <o>(0)
435 6: END(0)
436
be8e71aa 437Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is
b23a565d 438because if it matches it should try to match itself again. The PLUS regop
be8e71aa 439handles the actual failure of the C<EXACT> regop and acts appropriately (going
440to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't.)
b23a565d 441
442Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/>
443
444 >x(?:foo*|b... 1 reg
445 brnc
446 piec
447 atom
448 >(?:foo*|b[... 3 piec
449 atom
450 >?:foo*|b[a... reg
451 >foo*|b[a][... brnc
452 piec
453 atom
454 >o*|b[a][rR... 5 piec
455 atom
456 >|b[a][rR])... 8 tail~ EXACT <fo> (3)
457 >b[a][rR])(... 9 brnc
458 10 piec
459 atom
460 >[a][rR])(f... 12 piec
461 atom
462 >a][rR])(fo... clas
463 >[rR])(foo|... 14 tail~ EXACT <b> (10)
464 piec
465 atom
466 >rR])(foo|b... clas
467 >)(foo|bar)... 25 tail~ EXACT <a> (12)
468 tail~ BRANCH (3)
469 26 tsdy~ BRANCH (END) (9)
470 ~ attach to TAIL (25) offset to 16
471 tsdy~ EXACT <fo> (EXACT) (4)
472 ~ STAR (END) (6)
473 ~ attach to TAIL (25) offset to 19
474 tsdy~ EXACT <b> (EXACT) (10)
475 ~ EXACT <a> (EXACT) (12)
476 ~ ANYOF[Rr] (END) (14)
477 ~ attach to TAIL (25) offset to 11
478 >(foo|bar)$< tail~ EXACT <x> (1)
479 piec
480 atom
481 >foo|bar)$< reg
482 28 brnc
483 piec
484 atom
485 >|bar)$< 31 tail~ OPEN1 (26)
486 >bar)$< brnc
487 32 piec
488 atom
489 >)$< 34 tail~ BRANCH (28)
490 36 tsdy~ BRANCH (END) (31)
491 ~ attach to CLOSE1 (34) offset to 3
492 tsdy~ EXACT <foo> (EXACT) (29)
493 ~ attach to CLOSE1 (34) offset to 5
494 tsdy~ EXACT <bar> (EXACT) (32)
495 ~ attach to CLOSE1 (34) offset to 2
496 >$< tail~ BRANCH (3)
497 ~ BRANCH (9)
498 ~ TAIL (25)
499 piec
500 atom
501 >< 37 tail~ OPEN1 (26)
502 ~ BRANCH (28)
503 ~ BRANCH (31)
504 ~ CLOSE1 (34)
505 38 tsdy~ EXACT <x> (EXACT) (1)
506 ~ BRANCH (END) (3)
507 ~ BRANCH (END) (9)
508 ~ TAIL (END) (25)
509 ~ OPEN1 (END) (26)
510 ~ BRANCH (END) (28)
511 ~ BRANCH (END) (31)
512 ~ CLOSE1 (END) (34)
513 ~ EOL (END) (36)
514 ~ attach to END (37) offset to 1<div></div>
515
516Resulting in the program
517
518 1: EXACT <x>(3)
519 3: BRANCH(9)
520 4: EXACT <fo>(6)
521 6: STAR(26)
522 7: EXACT <o>(0)
523 9: BRANCH(25)
524 10: EXACT <ba>(14)
525 12: OPTIMIZED (2 nodes)
526 14: ANYOF[Rr](26)
527 25: TAIL(26)
528 26: OPEN1(28)
529 28: TRIE-EXACT(34)
530 [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
531 <foo>
532 <bar>
533 30: OPTIMIZED (4 nodes)
534 34: CLOSE1(36)
535 36: EOL(37)
536 37: END(0)
537
538Here we can see a much more complex program, with various optimisations in
be8e71aa 539play. At regnode 10 we see an example where a character class with only
540one character in it was turned into an C<EXACT> node. We can also see where
541an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence,
b23a565d 542some of the regnodes have been marked as optimised away. We can see that
be8e71aa 543the C<$> symbol has been converted into an C<EOL> regop, a special piece of
544code that looks for C<\n> or the end of the string.
b23a565d 545
be8e71aa 546The next pointer for C<BRANCH>es is interesting in that it points at where
b23a565d 547execution should go if the branch fails. When executing if the engine
be8e71aa 548tries to traverse from a branch to a C<regnext> that isn't a branch then
549the engine will know that the entire set of branches have failed.
b23a565d 550
551=head3 Peep-hole Optimisation and Analysis
552
553The regular expression engine can be a weighty tool to wield. On long
554strings and complex patterns it can end up having to do a lot of work
555to find a match, and even more to decide that no match is possible.
556Consider a situation like the following pattern.
557
558 'ababababababababababab' =~ /(a|b)*z/
559
560The C<(a|b)*> part can match at every char in the string, and then fail
561every time because there is no C<z> in the string. So obviously we can
be8e71aa 562avoid using the regex engine unless there is a 'z' in the string.
b23a565d 563Likewise in a pattern like:
564
565 /foo(\w+)bar/
566
567In this case we know that the string must contain a C<foo> which must be
568followed by C<bar>. We can use Fast Boyer-More matching as implemented
be8e71aa 569in C<fbm_instr()> to find the location of these strings. If they don't exist
570then we don't need to resort to the much more expensive regex engine.
571Even better, if they do exist then we can use their positions to
b23a565d 572reduce the search space that the regex engine needs to cover to determine
be8e71aa 573if the entire pattern matches.
b23a565d 574
575There are various aspects of the pattern that can be used to facilitate
576optimisations along these lines:
577
578 * anchored fixed strings
579 * floating fixed strings
580 * minimum and maximum length requirements
581 * start class
582 * Beginning/End of line positions
583
584Another form of optimisation that can occur is post-parse "peep-hole"
be8e71aa 585optimisations, where inefficient constructs are replaced by
586more efficient constructs. An example of this are C<TAIL> regops which are used
b23a565d 587during parsing to mark the end of branches and the end of groups. These
588regops are used as place holders during construction and "always match"
589so they can be "optimised away" by making the things that point to the
be8e71aa 590TAIL point to thing that the C<TAIL> points to, thus "skipping" the node.
b23a565d 591
be8e71aa 592Another optimisation that can occur is that of "C<EXACT> merging" which is
593where two consecutive C<EXACT> nodes are merged into a single
594regop. An even more agressive form of this is that a branch
595sequence of the form CEXACT BRANCH ... EXACT> can be converted into a
596C<TRIE-EXACT> regop.
b23a565d 597
be8e71aa 598All of this occurs in the routine C<study_chunk()> which uses a special
599structure C<scan_data_t> to store the analysis that it has performed, and
600does the "peep-hole" optimisations as it goes.
b23a565d 601
be8e71aa 602The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-)
b23a565d 603
604=head2 Execution
605
606Execution of a regex generally involves two phases, the first being
607finding the start point in the string where we should match from,
608and the second being running the regop interpreter.
609
be8e71aa 610If we can tell that there is no valid start point then we don't bother running
611interpreter at all. Likewise, if we know from the analysis phase that we
612cannot detect a short-cut to the start position, we go straight to the
b23a565d 613interpreter.
614
be8e71aa 615The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines
b23a565d 616have a somewhat incestuous relationship with overlap between their functions,
be8e71aa 617and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless
b23a565d 618the perl source code may call into either, or both.
619
620Execution of the interpreter itself used to be recursive. Due to the
be8e71aa 621efforts of Dave Mitchell in blead perl, it is now iterative. Now an
b23a565d 622internal stack is maintained on the heap and the routine is fully
623iterative. This can make it tricky as the code is quite conservative
624about what state it stores which means that two consecutive lines in the
625code can actually be running in totally different contexts due to the
626simulated recursion.
627
628=head3 Start position and no-match optimisations
629
be8e71aa 630C<re_intuit_start()> is responsible for handling start points and no match
b23a565d 631optimisations as determined by the results of the analysis done by
be8e71aa 632C<study_chunk()> (and described in L<Peep-hole Optimisation and Analysis>).
b23a565d 633
634The basic structure of this routine is to try to find the start and/or
635end points of where the pattern could match, and to ensure that the string
636is long enough to match the pattern. It tries to use more efficent
637methods over less efficient methods and may involve considerable cross
638checking of constraints to find the place in the string that matches.
639For instance it may try to determine that a given fixed string must be
640not only present but a certain number of chars before the end of the
641string, or whatever.
642
be8e71aa 643It calls several other routines, such as C<fbm_instr()> which does
644"Fast Boyer More" matching and C<find_byclass()> which is responsible for
b23a565d 645finding the start using the first mandatory regop in the program.
646
be8e71aa 647When the optimisation criteria have been satisfied C<reg_try()> is called
b23a565d 648to perform the match.
649
650=head3 Program execution
651
652C<pregexec()> is the main entry point for running a regex. It contains
653support for initializing the regex interpreters state, running
be8e71aa 654C<re_intuit_start()> if needed, and running the intepreter on the string
b23a565d 655from various start positions as needed. When its necessary to use
656the regex interpreter C<pregexec()> calls C<regtry()>.
657
658C<regtry()> is the entry point into the regex interpreter. It expects
be8e71aa 659as arguments a pointer to a C<regmatch_info> structure and a pointer to
b23a565d 660a string. It returns an integer 1 for success and a 0 for failure.
661It is basically a setup wrapper around C<regmatch()>.
662
663C<regmatch> is the main "recursive loop" of the interpreter. It is
664basically a giant switch statement that executes the regops based on
665their type. A few of the regops are implemented as subroutines but
666the bulk are inline code.
667
668=head1 MISCELLANEOUS
669
670=head2 UNICODE and Localization Support
671
be8e71aa 672No matter how you look at it, Unicode support is going to be a pain in a
b23a565d 673regex engine. Tricks that might be fine when you have 256 possible
be8e71aa 674characters often won't scale to handle the size of the UTF-8 character
b23a565d 675set. Things you can take for granted with ASCII may not be true with
be8e71aa 676unicode. For instance, in ASCII, it is safe to assume that
677C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is
678vastly more complex than the simple rules of ASCII, and even when not
679using Unicode but only localized single byte encodings, things can get
680tricky (for example, GERMAN-SHARP-ESS should match 'ss' in localized case
681insensitive matching).
682
683Making things worse is that UTF-8 support was a later addition to the
684regex engine (as it was to perl) and this necessarily made things a lot
685more complicated. Obviously it is easier to design a regex engine with
686Unicode support in mind from the beginning than it is to retrofit it to
687one that wasn't.
688
689Nearly all regops that involves looking at the input string have
690two cases, one for UTF-8, and one not. In fact, it's often more complex
691than that, as the pattern may be UTF-8 as well.
b23a565d 692
693Care must be taken when making changes to make sure that you handle
be8e71aa 694UTF-8 properly, both at compile time and at execution time, including
b23a565d 695when the string and pattern are mismatched.
696
be8e71aa 697The following comment in F<regcomp.h> gives an example of exactly how
b23a565d 698tricky this can be:
699
700 Two problematic code points in Unicode casefolding of EXACT nodes:
701
702 U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
703 U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
704
705 which casefold to
706
707 Unicode UTF-8
708
709 U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81
710 U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81
711
712 This means that in case-insensitive matching (or "loose matching",
713 as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
714 byte length of the above casefolded versions) can match a target
715 string of length two (the byte length of UTF-8 encoded U+0390 or
716 U+03B0). This would rather mess up the minimum length computation.
717
718 What we'll do is to look for the tail four bytes, and then peek
719 at the preceding two bytes to see whether we need to decrease
720 the minimum length by four (six minus two).
721
722 Thanks to the design of UTF-8, there cannot be false matches:
723 A sequence of valid UTF-8 bytes cannot be a subsequence of
724 another valid sequence of UTF-8 bytes.
725
be8e71aa 726=head3 Base Struct
727
728regexp.h contains the base structure definition:
729
730 typedef struct regexp {
731 I32 *startp;
732 I32 *endp;
733 regnode *regstclass;
734 struct reg_substr_data *substrs;
735 char *precomp; /* pre-compilation regular expression */
736 struct reg_data *data; /* Additional data. */
737 char *subbeg; /* saved or original string
738 so \digit works forever. */
739 #ifdef PERL_OLD_COPY_ON_WRITE
740 SV *saved_copy; /* If non-NULL, SV which is COW from original */
741 #endif
742 U32 *offsets; /* offset annotations 20001228 MJD */
743 I32 sublen; /* Length of string pointed by subbeg */
744 I32 refcnt;
745 I32 minlen; /* mininum possible length of $& */
746 I32 prelen; /* length of precomp */
747 U32 nparens; /* number of parentheses */
748 U32 lastparen; /* last paren matched */
749 U32 lastcloseparen; /* last paren matched */
750 U32 reganch; /* Internal use only +
751 Tainted information used by regexec? */
752 regnode program[1]; /* Unwarranted chumminess with compiler. */
753 } regexp;
754
755C<program>, and C<data> are the primary fields of concern in terms of
756program structure. program is the actual array of nodes, and data is
757an array of "whatever", with each whatever being typed by letter, and
758freed or cloned as needed based on this type. regops use the data
759array to store reference data that isn't convenient to store in the regop
760itself. It also means memory management code doesnt need to traverse the
761program to find pointers. So for instance if a regop needs a pointer, the
762normal procedure is use a regnode_arg1 store the data index in the ARG
763field and look it up from the data array.
764
765startp,endp,nparens,lasparen,lastcloseparen are used to manage capture
766buffers.
767
768subbeg and optional saved_copy are used during exectuion phase for managing
769replacements.
770
771offsets and precomp are used for debugging purposes.
772
773And the rest are used for start point optimisations.
774
775
776=head2 Deallocation and Cloning
777
778Any patch that adds data items to the regexp will need to include
779changes to sv.c (Perl_re_dup) and regcomp.c (pregfree). This
780involves freeing or cloning items in the regexes data array based
781on the data items type.
782
b23a565d 783=head1 AUTHOR
784
785by Yves Orton, 2006.
786
787With excerpts from Perl, and contributions and suggestions from
788Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,
be8e71aa 789Stephen McCamant, and David Landgren.
b23a565d 790
791=head1 LICENSE
792
793Same terms as Perl.
794
795=head1 REFERENCES
796
797[1] http://perl.plover.com/Rx/paper/
798
799=cut