Commit | Line | Data |
b23a565d |
1 | =head1 NAME |
2 | |
3 | perlreguts - Description of the Perl regular expression engine. |
4 | |
5 | =head1 DESCRIPTION |
6 | |
7 | This document is an attempt to shine some light on the guts of the regex |
4ccfbf60 |
8 | engine and how it works. The regex engine represents a significant chunk |
b23a565d |
9 | of the perl codebase, but is relatively poorly understood. This document |
10 | is a meagre attempt at addressing this situation. It is derived from the |
be8e71aa |
11 | author's experience, comments in the source code, other papers on the |
12 | regex engine, feedback on the perl5-porters mail list, and no doubt other |
13 | places as well. |
b23a565d |
14 | |
15 | B<WARNING!> It should be clearly understood that this document |
16 | represents the state of the regex engine as the author understands it at |
be8e71aa |
17 | the time of writing. It is B<NOT> an API definition; it is purely an |
b23a565d |
18 | internals guide for those who want to hack the regex engine, or |
19 | understand how the regex engine works. Readers of this document are |
be8e71aa |
20 | expected to understand perl's regex syntax and its usage in detail. If |
21 | you want to learn about the basics of Perl's regular expressions, see |
22 | L<perlre>. |
b23a565d |
23 | |
24 | =head1 OVERVIEW |
25 | |
26 | =head2 A quick note on terms |
27 | |
be8e71aa |
28 | There is some debate as to whether to say "regexp" or "regex". In this |
b23a565d |
29 | document we will use the term "regex" unless there is a special reason |
be8e71aa |
30 | not to, in which case we will explain why. |
b23a565d |
31 | |
32 | When speaking about regexes we need to distinguish between their source |
33 | code form and their internal form. In this document we will use the term |
34 | "pattern" when we speak of their textual, source code form, the term |
35 | "program" when we speak of their internal representation. These |
be8e71aa |
36 | correspond to the terms I<S-regex> and I<B-regex> that Mark Jason |
e3950ac3 |
37 | Dominus employs in his paper on "Rx" ([1] in L</REFERENCES>). |
b23a565d |
38 | |
39 | =head2 What is a regular expression engine? |
40 | |
be8e71aa |
41 | A regular expression engine is a program that takes a set of constraints |
42 | specified in a mini-language, and then applies those constraints to a |
43 | target string, and determines whether or not the string satisfies the |
44 | constraints. See L<perlre> for a full definition of the language. |
b23a565d |
45 | |
4ccfbf60 |
46 | So in less grandiose terms the first part of the job is to turn a pattern into |
b23a565d |
47 | something the computer can efficiently use to find the matching point in |
be8e71aa |
48 | the string, and the second part is performing the search itself. |
b23a565d |
49 | |
50 | To do this we need to produce a program by parsing the text. We then |
51 | need to execute the program to find the point in the string that |
52 | matches. And we need to do the whole thing efficiently. |
53 | |
54 | =head2 Structure of a Regexp Program |
55 | |
56 | =head3 High Level |
57 | |
be8e71aa |
58 | Although it is a bit confusing and some people object to the terminology, it |
b23a565d |
59 | is worth taking a look at a comment that has |
be8e71aa |
60 | been in F<regexp.h> for years: |
b23a565d |
61 | |
62 | I<This is essentially a linear encoding of a nondeterministic |
63 | finite-state machine (aka syntax charts or "railroad normal form" in |
64 | parsing technology).> |
65 | |
66 | The term "railroad normal form" is a bit esoteric, with "syntax |
67 | diagram/charts", or "railroad diagram/charts" being more common terms. |
4ccfbf60 |
68 | Nevertheless it provides a useful mental image of a regex program: each |
b23a565d |
69 | node can be thought of as a unit of track, with a single entry and in |
70 | most cases a single exit point (there are pieces of track that fork, but |
be8e71aa |
71 | statistically not many), and the whole forms a layout with a |
b23a565d |
72 | single entry and single exit point. The matching process can be thought |
be8e71aa |
73 | of as a car that moves along the track, with the particular route through |
b23a565d |
74 | the system being determined by the character read at each possible |
be8e71aa |
75 | connector point. A car can fall off the track at any point but it may |
76 | only proceed as long as it matches the track. |
b23a565d |
77 | |
78 | Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as the |
79 | following chart: |
80 | |
be8e71aa |
81 | [start] |
82 | | |
83 | <foo> |
84 | | |
85 | +-----+-----+ |
86 | | | | |
87 | <\w+> <\d+> <\s+> |
88 | | | | |
89 | +-----+-----+ |
90 | | |
91 | <bar> |
92 | | |
93 | [end] |
94 | |
95 | The truth of the matter is that perl's regular expressions these days are |
4ccfbf60 |
96 | much more complex than this kind of structure, but visualising it this way |
97 | can help when trying to get your bearings, and it matches the |
98 | current implementation pretty closely. |
be8e71aa |
99 | |
100 | To be more precise, we will say that a regex program is an encoding |
101 | of a graph. Each node in the graph corresponds to part of |
b23a565d |
102 | the original regex pattern, such as a literal string or a branch, |
103 | and has a pointer to the nodes representing the next component |
be8e71aa |
104 | to be matched. Since "node" and "opcode" already have other meanings in the |
105 | perl source, we will call the nodes in a regex program "regops". |
b23a565d |
106 | |
be8e71aa |
107 | The program is represented by an array of C<regnode> structures, one or |
108 | more of which represent a single regop of the program. Struct |
4ccfbf60 |
109 | C<regnode> is the smallest struct needed, and has a field structure which is |
b23a565d |
110 | shared with all the other larger structures. |
111 | |
be8e71aa |
112 | The "next" pointers of all regops except C<BRANCH> implement concatenation; |
113 | a "next" pointer with a C<BRANCH> on both ends of it is connecting two |
114 | alternatives. [Here we have one of the subtle syntax dependencies: an |
115 | individual C<BRANCH> (as opposed to a collection of them) is never |
116 | concatenated with anything because of operator precedence.] |
b23a565d |
117 | |
118 | The operand of some types of regop is a literal string; for others, |
119 | it is a regop leading into a sub-program. In particular, the operand |
be8e71aa |
120 | of a C<BRANCH> node is the first regop of the branch. |
b23a565d |
121 | |
4ccfbf60 |
122 | B<NOTE>: As the railroad metaphor suggests, this is B<not> a tree |
b23a565d |
123 | structure: the tail of the branch connects to the thing following the |
be8e71aa |
124 | set of C<BRANCH>es. It is a like a single line of railway track that |
b23a565d |
125 | splits as it goes into a station or railway yard and rejoins as it comes |
126 | out the other side. |
127 | |
128 | =head3 Regops |
129 | |
be8e71aa |
130 | The base structure of a regop is defined in F<regexp.h> as follows: |
b23a565d |
131 | |
132 | struct regnode { |
be8e71aa |
133 | U8 flags; /* Various purposes, sometimes overridden */ |
b23a565d |
134 | U8 type; /* Opcode value as specified by regnodes.h */ |
135 | U16 next_off; /* Offset in size regnode */ |
136 | }; |
137 | |
be8e71aa |
138 | Other larger C<regnode>-like structures are defined in F<regcomp.h>. They |
b23a565d |
139 | are almost like subclasses in that they have the same fields as |
4ccfbf60 |
140 | C<regnode>, with possibly additional fields following in |
b23a565d |
141 | the structure, and in some cases the specific meaning (and name) |
4ccfbf60 |
142 | of some of base fields are overridden. The following is a more |
b23a565d |
143 | complete description. |
144 | |
145 | =over 4 |
146 | |
be8e71aa |
147 | =item C<regnode_1> |
b23a565d |
148 | |
be8e71aa |
149 | =item C<regnode_2> |
b23a565d |
150 | |
be8e71aa |
151 | C<regnode_1> structures have the same header, followed by a single |
152 | four-byte argument; C<regnode_2> structures contain two two-byte |
b23a565d |
153 | arguments instead: |
154 | |
155 | regnode_1 U32 arg1; |
156 | regnode_2 U16 arg1; U16 arg2; |
157 | |
be8e71aa |
158 | =item C<regnode_string> |
b23a565d |
159 | |
be8e71aa |
160 | C<regnode_string> structures, used for literal strings, follow the header |
b23a565d |
161 | with a one-byte length and then the string data. Strings are padded on |
162 | the end with zero bytes so that the total length of the node is a |
163 | multiple of four bytes: |
164 | |
165 | regnode_string char string[1]; |
be8e71aa |
166 | U8 str_len; /* overrides flags */ |
b23a565d |
167 | |
be8e71aa |
168 | =item C<regnode_charclass> |
b23a565d |
169 | |
be8e71aa |
170 | Character classes are represented by C<regnode_charclass> structures, |
b23a565d |
171 | which have a four-byte argument and then a 32-byte (256-bit) bitmap |
172 | indicating which characters are included in the class. |
173 | |
174 | regnode_charclass U32 arg1; |
175 | char bitmap[ANYOF_BITMAP_SIZE]; |
176 | |
be8e71aa |
177 | =item C<regnode_charclass_class> |
b23a565d |
178 | |
179 | There is also a larger form of a char class structure used to represent |
be8e71aa |
180 | POSIX char classes called C<regnode_charclass_class> which has an |
181 | additional 4-byte (32-bit) bitmap indicating which POSIX char class |
182 | have been included. |
b23a565d |
183 | |
184 | regnode_charclass_class U32 arg1; |
185 | char bitmap[ANYOF_BITMAP_SIZE]; |
186 | char classflags[ANYOF_CLASSBITMAP_SIZE]; |
187 | |
188 | =back |
189 | |
be8e71aa |
190 | F<regnodes.h> defines an array called C<regarglen[]> which gives the size |
191 | of each opcode in units of C<size regnode> (4-byte). A macro is used |
192 | to calculate the size of an C<EXACT> node based on its C<str_len> field. |
b23a565d |
193 | |
be8e71aa |
194 | The regops are defined in F<regnodes.h> which is generated from |
195 | F<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible number |
196 | of distinct regops is restricted to 256, with about a quarter already |
b23a565d |
197 | used. |
198 | |
be8e71aa |
199 | A set of macros makes accessing the fields |
4ccfbf60 |
200 | easier and more consistent. These include C<OP()>, which is used to determine |
201 | the type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset to |
202 | the next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>, |
203 | and equivalents for reading and setting the arguments; and C<STR_LEN()>, |
be8e71aa |
204 | C<STRING()> and C<OPERAND()> for manipulating strings and regop bearing |
b23a565d |
205 | types. |
206 | |
be8e71aa |
207 | =head3 What regop is next? |
b23a565d |
208 | |
209 | There are three distinct concepts of "next" in the regex engine, and |
210 | it is important to keep them clear. |
211 | |
212 | =over 4 |
213 | |
214 | =item * |
215 | |
216 | There is the "next regnode" from a given regnode, a value which is |
217 | rarely useful except that sometimes it matches up in terms of value |
218 | with one of the others, and that sometimes the code assumes this to |
219 | always be so. |
220 | |
221 | =item * |
222 | |
be8e71aa |
223 | There is the "next regop" from a given regop/regnode. This is the |
224 | regop physically located after the the current one, as determined by |
225 | the size of the current regop. This is often useful, such as when |
b23a565d |
226 | dumping the structure we use this order to traverse. Sometimes the code |
be8e71aa |
227 | assumes that the "next regnode" is the same as the "next regop", or in |
228 | other words assumes that the sizeof a given regop type is always going |
229 | to be one regnode large. |
b23a565d |
230 | |
231 | =item * |
232 | |
be8e71aa |
233 | There is the "regnext" from a given regop. This is the regop which |
234 | is reached by jumping forward by the value of C<NEXT_OFF()>, |
235 | or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1> |
236 | structure. The subroutine C<regnext()> handles this transparently. |
b23a565d |
237 | This is the logical successor of the node, which in some cases, like |
be8e71aa |
238 | that of the C<BRANCH> regop, has special meaning. |
b23a565d |
239 | |
240 | =back |
241 | |
be8e71aa |
242 | =head1 Process Overview |
b23a565d |
243 | |
be8e71aa |
244 | Broadly speaking, performing a match of a string against a pattern |
245 | involves the following steps: |
246 | |
247 | =over 5 |
248 | |
249 | =item A. Compilation |
250 | |
251 | =over 5 |
252 | |
253 | =item 1. Parsing for size |
254 | |
255 | =item 2. Parsing for construction |
256 | |
257 | =item 3. Peep-hole optimisation and analysis |
258 | |
259 | =back |
260 | |
261 | =item B. Execution |
262 | |
263 | =over 5 |
264 | |
265 | =item 4. Start position and no-match optimisations |
266 | |
267 | =item 5. Program execution |
268 | |
269 | =back |
270 | |
271 | =back |
b23a565d |
272 | |
b23a565d |
273 | |
274 | Where these steps occur in the actual execution of a perl program is |
275 | determined by whether the pattern involves interpolating any string |
be8e71aa |
276 | variables. If interpolation occurs, then compilation happens at run time. If it |
277 | does not, then compilation is performed at compile time. (The C</o> modifier changes this, |
4ccfbf60 |
278 | as does C<qr//> to a certain extent.) The engine doesn't really care that |
b23a565d |
279 | much. |
280 | |
281 | =head2 Compilation |
282 | |
be8e71aa |
283 | This code resides primarily in F<regcomp.c>, along with the header files |
284 | F<regcomp.h>, F<regexp.h> and F<regnodes.h>. |
b23a565d |
285 | |
4ccfbf60 |
286 | Compilation starts with C<pregcomp()>, which is mostly an initialisation |
287 | wrapper which farms work out to two other routines for the heavy lifting: the |
288 | first is C<reg()>, which is the start point for parsing; the second, |
289 | C<study_chunk()>, is responsible for optimisation. |
b23a565d |
290 | |
4ccfbf60 |
291 | Initialisation in C<pregcomp()> mostly involves the creation and data-filling |
292 | of a special structure, C<RExC_state_t> (defined in F<regcomp.c>). |
293 | Almost all internally-used routines in F<regcomp.h> take a pointer to one |
be8e71aa |
294 | of these structures as their first argument, with the name C<pRExC_state>. |
b23a565d |
295 | This structure is used to store the compilation state and contains many |
be8e71aa |
296 | fields. Likewise there are many macros which operate on this |
4ccfbf60 |
297 | variable: anything that looks like C<RExC_xxxx> is a macro that operates on |
b23a565d |
298 | this pointer/structure. |
299 | |
300 | =head3 Parsing for size |
301 | |
302 | In this pass the input pattern is parsed in order to calculate how much |
be8e71aa |
303 | space is needed for each regop we would need to emit. The size is also |
b23a565d |
304 | used to determine whether long jumps will be required in the program. |
305 | |
be8e71aa |
306 | This stage is controlled by the macro C<SIZE_ONLY> being set. |
b23a565d |
307 | |
4ccfbf60 |
308 | The parse proceeds pretty much exactly as it does during the |
be8e71aa |
309 | construction phase, except that most routines are short-circuited to |
310 | change the size field C<RExC_size> and not do anything else. |
b23a565d |
311 | |
4ccfbf60 |
312 | =head3 Parsing for construction |
b23a565d |
313 | |
be8e71aa |
314 | Once the size of the program has been determined, the pattern is parsed |
315 | again, but this time for real. Now C<SIZE_ONLY> will be false, and the |
b23a565d |
316 | actual construction can occur. |
317 | |
318 | C<reg()> is the start of the parse process. It is responsible for |
319 | parsing an arbitrary chunk of pattern up to either the end of the |
320 | string, or the first closing parenthesis it encounters in the pattern. |
4ccfbf60 |
321 | This means it can be used to parse the top-level regex, or any section |
b23a565d |
322 | inside of a grouping parenthesis. It also handles the "special parens" |
be8e71aa |
323 | that perl's regexes have. For instance when parsing C</x(?:foo)y/> C<reg()> |
324 | will at one point be called to parse from the "?" symbol up to and |
325 | including the ")". |
b23a565d |
326 | |
be8e71aa |
327 | Additionally, C<reg()> is responsible for parsing the one or more |
b23a565d |
328 | branches from the pattern, and for "finishing them off" by correctly |
be8e71aa |
329 | setting their next pointers. In order to do the parsing, it repeatedly |
330 | calls out to C<regbranch()>, which is responsible for handling up to the |
b23a565d |
331 | first C<|> symbol it sees. |
332 | |
be8e71aa |
333 | C<regbranch()> in turn calls C<regpiece()> which |
334 | handles "things" followed by a quantifier. In order to parse the |
335 | "things", C<regatom()> is called. This is the lowest level routine which |
336 | parses out constant strings, character classes, and the |
337 | various special symbols like C<$>. If C<regatom()> encounters a "(" |
b23a565d |
338 | character it in turn calls C<reg()>. |
339 | |
340 | The routine C<regtail()> is called by both C<reg()>, C<regbranch()> |
341 | in order to "set the tail pointer" correctly. When executing and |
be8e71aa |
342 | we get to the end of a branch, we need to go to the node following the |
343 | grouping parens. When parsing, however, we don't know where the end will |
b23a565d |
344 | be until we get there, so when we do we must go back and update the |
345 | offsets as appropriate. C<regtail> is used to make this easier. |
346 | |
be8e71aa |
347 | A subtlety of the parsing process means that a regex like C</foo/> is |
b23a565d |
348 | originally parsed into an alternation with a single branch. It is only |
4ccfbf60 |
349 | afterwards that the optimiser converts single branch alternations into the |
b23a565d |
350 | simpler form. |
351 | |
352 | =head3 Parse Call Graph and a Grammar |
353 | |
354 | The call graph looks like this: |
355 | |
356 | reg() # parse a top level regex, or inside of parens |
357 | regbranch() # parse a single branch of an alternation |
358 | regpiece() # parse a pattern followed by a quantifier |
359 | regatom() # parse a simple pattern |
360 | regclass() # used to handle a class |
4ccfbf60 |
361 | reg() # used to handle a parenthesised subpattern |
b23a565d |
362 | .... |
363 | ... |
364 | regtail() # finish off the branch |
365 | ... |
366 | regtail() # finish off the branch sequence. Tie each |
4ccfbf60 |
367 | # branch's tail to the tail of the sequence |
b23a565d |
368 | # (NEW) In Debug mode this is |
369 | # regtail_study(). |
370 | |
371 | A grammar form might be something like this: |
372 | |
373 | atom : constant | class |
374 | quant : '*' | '+' | '?' | '{min,max}' |
375 | _branch: piece |
376 | | piece _branch |
377 | | nothing |
378 | branch: _branch |
379 | | _branch '|' branch |
380 | group : '(' branch ')' |
381 | _piece: atom | group |
382 | piece : _piece |
383 | | _piece quant |
384 | |
385 | =head3 Debug Output |
386 | |
4ccfbf60 |
387 | In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE'; >> to see some trace |
b23a565d |
388 | information about the parse process. We will start with some simple |
389 | patterns and build up to more complex patterns. |
390 | |
391 | So when we parse C</foo/> we see something like the following table. The |
4ccfbf60 |
392 | left shows what is being parsed, and the number indicates where the next regop |
b23a565d |
393 | would go. The stuff on the right is the trace output of the graph. The |
394 | names are chosen to be short to make it less dense on the screen. 'tsdy' |
395 | is a special form of C<regtail()> which does some extra analysis. |
396 | |
4ccfbf60 |
397 | >foo< 1 reg |
398 | brnc |
399 | piec |
400 | atom |
401 | >< 4 tsdy~ EXACT <foo> (EXACT) (1) |
402 | ~ attach to END (3) offset to 2 |
b23a565d |
403 | |
404 | The resulting program then looks like: |
405 | |
406 | 1: EXACT <foo>(3) |
407 | 3: END(0) |
408 | |
409 | As you can see, even though we parsed out a branch and a piece, it was ultimately |
be8e71aa |
410 | only an atom. The final program shows us how things work. We have an C<EXACT> regop, |
411 | followed by an C<END> regop. The number in parens indicates where the C<regnext> of |
412 | the node goes. The C<regnext> of an C<END> regop is unused, as C<END> regops mean |
b23a565d |
413 | we have successfully matched. The number on the left indicates the position of |
414 | the regop in the regnode array. |
415 | |
be8e71aa |
416 | Now let's try a harder pattern. We will add a quantifier, so now we have the pattern |
4ccfbf60 |
417 | C</foo+/>. We will see that C<regbranch()> calls C<regpiece()> twice. |
418 | |
419 | >foo+< 1 reg |
420 | brnc |
421 | piec |
422 | atom |
423 | >o+< 3 piec |
424 | atom |
425 | >< 6 tail~ EXACT <fo> (1) |
426 | 7 tsdy~ EXACT <fo> (EXACT) (1) |
427 | ~ PLUS (END) (3) |
428 | ~ attach to END (6) offset to 3 |
b23a565d |
429 | |
430 | And we end up with the program: |
431 | |
432 | 1: EXACT <fo>(3) |
433 | 3: PLUS(6) |
434 | 4: EXACT <o>(0) |
435 | 6: END(0) |
436 | |
be8e71aa |
437 | Now we have a special case. The C<EXACT> regop has a C<regnext> of 0. This is |
4ccfbf60 |
438 | because if it matches it should try to match itself again. The C<PLUS> regop |
be8e71aa |
439 | handles the actual failure of the C<EXACT> regop and acts appropriately (going |
4ccfbf60 |
440 | to regnode 6 if the C<EXACT> matched at least once, or failing if it didn't). |
b23a565d |
441 | |
442 | Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/> |
443 | |
4ccfbf60 |
444 | >x(?:foo*|b... 1 reg |
445 | brnc |
446 | piec |
447 | atom |
448 | >(?:foo*|b[... 3 piec |
449 | atom |
450 | >?:foo*|b[a... reg |
451 | >foo*|b[a][... brnc |
b23a565d |
452 | piec |
453 | atom |
4ccfbf60 |
454 | >o*|b[a][rR... 5 piec |
455 | atom |
456 | >|b[a][rR])... 8 tail~ EXACT <fo> (3) |
457 | >b[a][rR])(... 9 brnc |
458 | 10 piec |
459 | atom |
460 | >[a][rR])(f... 12 piec |
b23a565d |
461 | atom |
4ccfbf60 |
462 | >a][rR])(fo... clas |
463 | >[rR])(foo|... 14 tail~ EXACT <b> (10) |
b23a565d |
464 | piec |
465 | atom |
4ccfbf60 |
466 | >rR])(foo|b... clas |
467 | >)(foo|bar)... 25 tail~ EXACT <a> (12) |
468 | tail~ BRANCH (3) |
469 | 26 tsdy~ BRANCH (END) (9) |
470 | ~ attach to TAIL (25) offset to 16 |
471 | tsdy~ EXACT <fo> (EXACT) (4) |
472 | ~ STAR (END) (6) |
473 | ~ attach to TAIL (25) offset to 19 |
474 | tsdy~ EXACT <b> (EXACT) (10) |
475 | ~ EXACT <a> (EXACT) (12) |
476 | ~ ANYOF[Rr] (END) (14) |
477 | ~ attach to TAIL (25) offset to 11 |
478 | >(foo|bar)$< tail~ EXACT <x> (1) |
479 | piec |
480 | atom |
481 | >foo|bar)$< reg |
482 | 28 brnc |
b23a565d |
483 | piec |
484 | atom |
4ccfbf60 |
485 | >|bar)$< 31 tail~ OPEN1 (26) |
486 | >bar)$< brnc |
487 | 32 piec |
488 | atom |
489 | >)$< 34 tail~ BRANCH (28) |
490 | 36 tsdy~ BRANCH (END) (31) |
491 | ~ attach to CLOSE1 (34) offset to 3 |
492 | tsdy~ EXACT <foo> (EXACT) (29) |
493 | ~ attach to CLOSE1 (34) offset to 5 |
494 | tsdy~ EXACT <bar> (EXACT) (32) |
495 | ~ attach to CLOSE1 (34) offset to 2 |
496 | >$< tail~ BRANCH (3) |
497 | ~ BRANCH (9) |
498 | ~ TAIL (25) |
499 | piec |
500 | atom |
501 | >< 37 tail~ OPEN1 (26) |
502 | ~ BRANCH (28) |
503 | ~ BRANCH (31) |
504 | ~ CLOSE1 (34) |
505 | 38 tsdy~ EXACT <x> (EXACT) (1) |
506 | ~ BRANCH (END) (3) |
507 | ~ BRANCH (END) (9) |
508 | ~ TAIL (END) (25) |
509 | ~ OPEN1 (END) (26) |
510 | ~ BRANCH (END) (28) |
511 | ~ BRANCH (END) (31) |
512 | ~ CLOSE1 (END) (34) |
513 | ~ EOL (END) (36) |
514 | ~ attach to END (37) offset to 1 |
b23a565d |
515 | |
516 | Resulting in the program |
517 | |
518 | 1: EXACT <x>(3) |
519 | 3: BRANCH(9) |
520 | 4: EXACT <fo>(6) |
521 | 6: STAR(26) |
522 | 7: EXACT <o>(0) |
523 | 9: BRANCH(25) |
524 | 10: EXACT <ba>(14) |
525 | 12: OPTIMIZED (2 nodes) |
526 | 14: ANYOF[Rr](26) |
527 | 25: TAIL(26) |
528 | 26: OPEN1(28) |
529 | 28: TRIE-EXACT(34) |
530 | [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf] |
531 | <foo> |
532 | <bar> |
533 | 30: OPTIMIZED (4 nodes) |
534 | 34: CLOSE1(36) |
535 | 36: EOL(37) |
536 | 37: END(0) |
537 | |
538 | Here we can see a much more complex program, with various optimisations in |
be8e71aa |
539 | play. At regnode 10 we see an example where a character class with only |
540 | one character in it was turned into an C<EXACT> node. We can also see where |
541 | an entire alternation was turned into a C<TRIE-EXACT> node. As a consequence, |
b23a565d |
542 | some of the regnodes have been marked as optimised away. We can see that |
be8e71aa |
543 | the C<$> symbol has been converted into an C<EOL> regop, a special piece of |
544 | code that looks for C<\n> or the end of the string. |
b23a565d |
545 | |
be8e71aa |
546 | The next pointer for C<BRANCH>es is interesting in that it points at where |
b23a565d |
547 | execution should go if the branch fails. When executing if the engine |
be8e71aa |
548 | tries to traverse from a branch to a C<regnext> that isn't a branch then |
549 | the engine will know that the entire set of branches have failed. |
b23a565d |
550 | |
551 | =head3 Peep-hole Optimisation and Analysis |
552 | |
553 | The regular expression engine can be a weighty tool to wield. On long |
554 | strings and complex patterns it can end up having to do a lot of work |
555 | to find a match, and even more to decide that no match is possible. |
556 | Consider a situation like the following pattern. |
557 | |
558 | 'ababababababababababab' =~ /(a|b)*z/ |
559 | |
560 | The C<(a|b)*> part can match at every char in the string, and then fail |
561 | every time because there is no C<z> in the string. So obviously we can |
4ccfbf60 |
562 | avoid using the regex engine unless there is a C<z> in the string. |
b23a565d |
563 | Likewise in a pattern like: |
564 | |
565 | /foo(\w+)bar/ |
566 | |
567 | In this case we know that the string must contain a C<foo> which must be |
4ccfbf60 |
568 | followed by C<bar>. We can use Fast Boyer-Moore matching as implemented |
be8e71aa |
569 | in C<fbm_instr()> to find the location of these strings. If they don't exist |
570 | then we don't need to resort to the much more expensive regex engine. |
571 | Even better, if they do exist then we can use their positions to |
b23a565d |
572 | reduce the search space that the regex engine needs to cover to determine |
be8e71aa |
573 | if the entire pattern matches. |
b23a565d |
574 | |
575 | There are various aspects of the pattern that can be used to facilitate |
576 | optimisations along these lines: |
577 | |
4ccfbf60 |
578 | =over 5 |
579 | |
580 | =item * anchored fixed strings |
581 | |
582 | =item * floating fixed strings |
583 | |
584 | =item * minimum and maximum length requirements |
585 | |
586 | =item * start class |
587 | |
588 | =item * Beginning/End of line positions |
589 | |
590 | =back |
b23a565d |
591 | |
592 | Another form of optimisation that can occur is post-parse "peep-hole" |
be8e71aa |
593 | optimisations, where inefficient constructs are replaced by |
594 | more efficient constructs. An example of this are C<TAIL> regops which are used |
b23a565d |
595 | during parsing to mark the end of branches and the end of groups. These |
4ccfbf60 |
596 | regops are used as place-holders during construction and "always match" |
b23a565d |
597 | so they can be "optimised away" by making the things that point to the |
4ccfbf60 |
598 | C<TAIL> point to thing that the C<TAIL> points to, thus "skipping" the node. |
b23a565d |
599 | |
be8e71aa |
600 | Another optimisation that can occur is that of "C<EXACT> merging" which is |
601 | where two consecutive C<EXACT> nodes are merged into a single |
4ccfbf60 |
602 | regop. An even more aggressive form of this is that a branch |
603 | sequence of the form C<EXACT BRANCH ... EXACT> can be converted into a |
be8e71aa |
604 | C<TRIE-EXACT> regop. |
b23a565d |
605 | |
be8e71aa |
606 | All of this occurs in the routine C<study_chunk()> which uses a special |
607 | structure C<scan_data_t> to store the analysis that it has performed, and |
608 | does the "peep-hole" optimisations as it goes. |
b23a565d |
609 | |
be8e71aa |
610 | The code involved in C<study_chunk()> is extremely cryptic. Be careful. :-) |
b23a565d |
611 | |
612 | =head2 Execution |
613 | |
614 | Execution of a regex generally involves two phases, the first being |
615 | finding the start point in the string where we should match from, |
616 | and the second being running the regop interpreter. |
617 | |
be8e71aa |
618 | If we can tell that there is no valid start point then we don't bother running |
619 | interpreter at all. Likewise, if we know from the analysis phase that we |
620 | cannot detect a short-cut to the start position, we go straight to the |
b23a565d |
621 | interpreter. |
622 | |
be8e71aa |
623 | The two entry points are C<re_intuit_start()> and C<pregexec()>. These routines |
b23a565d |
624 | have a somewhat incestuous relationship with overlap between their functions, |
be8e71aa |
625 | and C<pregexec()> may even call C<re_intuit_start()> on its own. Nevertheless |
4ccfbf60 |
626 | other parts of the the perl source code may call into either, or both. |
b23a565d |
627 | |
628 | Execution of the interpreter itself used to be recursive. Due to the |
4ccfbf60 |
629 | efforts of Dave Mitchell in the 5.9.x development track, it is now iterative. Now an |
b23a565d |
630 | internal stack is maintained on the heap and the routine is fully |
631 | iterative. This can make it tricky as the code is quite conservative |
4ccfbf60 |
632 | about what state it stores, with the result that that two consecutive lines in the |
b23a565d |
633 | code can actually be running in totally different contexts due to the |
634 | simulated recursion. |
635 | |
636 | =head3 Start position and no-match optimisations |
637 | |
4ccfbf60 |
638 | C<re_intuit_start()> is responsible for handling start points and no-match |
b23a565d |
639 | optimisations as determined by the results of the analysis done by |
be8e71aa |
640 | C<study_chunk()> (and described in L<Peep-hole Optimisation and Analysis>). |
b23a565d |
641 | |
4ccfbf60 |
642 | The basic structure of this routine is to try to find the start- and/or |
643 | end-points of where the pattern could match, and to ensure that the string |
644 | is long enough to match the pattern. It tries to use more efficient |
645 | methods over less efficient methods and may involve considerable |
646 | cross-checking of constraints to find the place in the string that matches. |
b23a565d |
647 | For instance it may try to determine that a given fixed string must be |
648 | not only present but a certain number of chars before the end of the |
649 | string, or whatever. |
650 | |
be8e71aa |
651 | It calls several other routines, such as C<fbm_instr()> which does |
4ccfbf60 |
652 | Fast Boyer Moore matching and C<find_byclass()> which is responsible for |
b23a565d |
653 | finding the start using the first mandatory regop in the program. |
654 | |
4ccfbf60 |
655 | When the optimisation criteria have been satisfied, C<reg_try()> is called |
b23a565d |
656 | to perform the match. |
657 | |
658 | =head3 Program execution |
659 | |
660 | C<pregexec()> is the main entry point for running a regex. It contains |
4ccfbf60 |
661 | support for initialising the regex interpreter's state, running |
662 | C<re_intuit_start()> if needed, and running the interpreter on the string |
663 | from various start positions as needed. When it is necessary to use |
b23a565d |
664 | the regex interpreter C<pregexec()> calls C<regtry()>. |
665 | |
666 | C<regtry()> is the entry point into the regex interpreter. It expects |
be8e71aa |
667 | as arguments a pointer to a C<regmatch_info> structure and a pointer to |
b23a565d |
668 | a string. It returns an integer 1 for success and a 0 for failure. |
4ccfbf60 |
669 | It is basically a set-up wrapper around C<regmatch()>. |
b23a565d |
670 | |
671 | C<regmatch> is the main "recursive loop" of the interpreter. It is |
e3950ac3 |
672 | basically a giant switch statement that implements a state machine, where |
673 | the possible states are the regops themselves, plus a number of additional |
674 | intermediate and failure states. A few of the states are implemented as |
675 | subroutines but the bulk are inline code. |
b23a565d |
676 | |
677 | =head1 MISCELLANEOUS |
678 | |
4ccfbf60 |
679 | =head2 Unicode and Localisation Support |
680 | |
681 | When dealing with strings containing characters that cannot be represented |
9af228c6 |
682 | using an eight-bit character set, perl uses an internal representation |
4ccfbf60 |
683 | that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single |
9af228c6 |
684 | bytes to represent characters from the ASCII character set, and sequences |
4ccfbf60 |
685 | of two or more bytes for all other characters. (See L<perlunitut> |
686 | for more information about the relationship between UTF-8 and perl's |
687 | encoding, utf8 -- the difference isn't important for this discussion.) |
b23a565d |
688 | |
be8e71aa |
689 | No matter how you look at it, Unicode support is going to be a pain in a |
b23a565d |
690 | regex engine. Tricks that might be fine when you have 256 possible |
be8e71aa |
691 | characters often won't scale to handle the size of the UTF-8 character |
b23a565d |
692 | set. Things you can take for granted with ASCII may not be true with |
4ccfbf60 |
693 | Unicode. For instance, in ASCII, it is safe to assume that |
be8e71aa |
694 | C<sizeof(char1) == sizeof(char2)>, but in UTF-8 it isn't. Unicode case folding is |
695 | vastly more complex than the simple rules of ASCII, and even when not |
4ccfbf60 |
696 | using Unicode but only localised single byte encodings, things can get |
697 | tricky (for example, GERMAN-SHARP-ESS should match 'SS' in localised |
698 | case-insensitive matching). |
be8e71aa |
699 | |
700 | Making things worse is that UTF-8 support was a later addition to the |
701 | regex engine (as it was to perl) and this necessarily made things a lot |
702 | more complicated. Obviously it is easier to design a regex engine with |
703 | Unicode support in mind from the beginning than it is to retrofit it to |
704 | one that wasn't. |
705 | |
4ccfbf60 |
706 | Nearly all regops that involve looking at the input string have |
be8e71aa |
707 | two cases, one for UTF-8, and one not. In fact, it's often more complex |
708 | than that, as the pattern may be UTF-8 as well. |
b23a565d |
709 | |
710 | Care must be taken when making changes to make sure that you handle |
be8e71aa |
711 | UTF-8 properly, both at compile time and at execution time, including |
b23a565d |
712 | when the string and pattern are mismatched. |
713 | |
be8e71aa |
714 | The following comment in F<regcomp.h> gives an example of exactly how |
b23a565d |
715 | tricky this can be: |
716 | |
717 | Two problematic code points in Unicode casefolding of EXACT nodes: |
718 | |
719 | U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS |
720 | U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS |
721 | |
722 | which casefold to |
723 | |
724 | Unicode UTF-8 |
725 | |
726 | U+03B9 U+0308 U+0301 0xCE 0xB9 0xCC 0x88 0xCC 0x81 |
727 | U+03C5 U+0308 U+0301 0xCF 0x85 0xCC 0x88 0xCC 0x81 |
728 | |
729 | This means that in case-insensitive matching (or "loose matching", |
730 | as Unicode calls it), an EXACTF of length six (the UTF-8 encoded |
731 | byte length of the above casefolded versions) can match a target |
732 | string of length two (the byte length of UTF-8 encoded U+0390 or |
733 | U+03B0). This would rather mess up the minimum length computation. |
734 | |
735 | What we'll do is to look for the tail four bytes, and then peek |
736 | at the preceding two bytes to see whether we need to decrease |
737 | the minimum length by four (six minus two). |
738 | |
739 | Thanks to the design of UTF-8, there cannot be false matches: |
740 | A sequence of valid UTF-8 bytes cannot be a subsequence of |
741 | another valid sequence of UTF-8 bytes. |
742 | |
4ccfbf60 |
743 | =head2 Base Struct |
be8e71aa |
744 | |
4ccfbf60 |
745 | F<regexp.h> contains the base structure definition: |
be8e71aa |
746 | |
747 | typedef struct regexp { |
9af228c6 |
748 | I32 *startp; |
749 | I32 *endp; |
750 | regnode *regstclass; |
751 | struct reg_substr_data *substrs; |
752 | char *precomp; /* pre-compilation regular expression */ |
753 | struct reg_data *data; /* Additional data. */ |
754 | char *subbeg; /* saved or original string |
755 | so \digit works forever. */ |
be8e71aa |
756 | #ifdef PERL_OLD_COPY_ON_WRITE |
9af228c6 |
757 | SV *saved_copy; /* If non-NULL, SV which is COW from original */ |
be8e71aa |
758 | #endif |
9af228c6 |
759 | U32 *offsets; /* offset annotations 20001228 MJD */ |
760 | I32 sublen; /* Length of string pointed by subbeg */ |
761 | I32 refcnt; |
762 | I32 minlen; /* mininum possible length of $& */ |
763 | I32 prelen; /* length of precomp */ |
764 | U32 nparens; /* number of parentheses */ |
765 | U32 lastparen; /* last paren matched */ |
766 | U32 lastcloseparen; /* last paren matched */ |
767 | U32 reganch; /* Internal use only + |
768 | Tainted information used by regexec? */ |
769 | HV *paren_names; /* Paren names */ |
770 | const struct regexp_engine* engine; |
771 | regnode program[1]; /* Unwarranted chumminess with compiler. */ |
be8e71aa |
772 | } regexp; |
773 | |
4ccfbf60 |
774 | =over 5 |
775 | |
9af228c6 |
776 | =item C<program> |
777 | |
778 | Compiled program. Inlined into the structure so the entire struct can be |
779 | treated as a single blob. |
780 | |
781 | =item C<data> |
782 | |
783 | This field points at a reg_data structure, which is defined as follows |
784 | |
785 | struct reg_data { |
786 | U32 count; |
787 | U8 *what; |
788 | void* data[1]; |
789 | }; |
790 | |
791 | This structure is used for handling data structures that the regex engine |
792 | needs to handle specially during a clone or free operation on the compiled |
793 | product. Each element in the data array has a corresponding element in the |
794 | what array. During compilation regops that need special structures stored |
795 | will add an element to each array using the add_data() routine and then store |
796 | the index in the regop. |
797 | |
798 | =item C<nparens>, C<lasparen>, and C<lastcloseparen> |
799 | |
800 | These fields are used to keep track of how many paren groups could be matched |
801 | in the pattern, which was the last open paren to be entered, and which was |
802 | the last close paren to be entered. |
803 | |
804 | =item C<startp>, C<endp> |
805 | |
806 | These fields store arrays that are used to hold the offsets of the begining |
807 | and end of each capture group that has matched. -1 is used to indicate no match. |
808 | |
809 | These are the source for @- and @+. |
810 | |
811 | =item C<subbeg> C<sublen> C<saved_copy> |
812 | |
813 | These are used during execution phase for managing search and replace |
814 | patterns. |
4ccfbf60 |
815 | |
9af228c6 |
816 | =item C<precomp> C<prelen> C<offsets> |
be8e71aa |
817 | |
9af228c6 |
818 | Used for debugging purposes. C<precomp> holds a copy of the pattern |
819 | that was compiled, offsets holds a mapping of offset in the C<program> |
820 | to offset in the C<precomp> string. This is only used by ActiveStates |
821 | visual regex debugger. |
4ccfbf60 |
822 | |
9af228c6 |
823 | =item C<reg_substr_data> |
be8e71aa |
824 | |
9af228c6 |
825 | Holds information on the longest string that must occur at a fixed |
826 | offset from the start of the pattern, and the longest string that must |
827 | occur at a floating offset from the start of the pattern. Used to do |
828 | Fast-Boyer-Moore searches on the string to find out if its worth using |
829 | the regex engine at all, and if so where in the string to search. |
be8e71aa |
830 | |
9af228c6 |
831 | =item C<regstclass> |
be8e71aa |
832 | |
9af228c6 |
833 | Special regop that is used by C<re_intuit_start()> to check if a pattern |
834 | can match at a certain position. For instance if the regex engine knows |
835 | that the pattern must start with a 'Z' then it can scan the string until |
836 | it finds one and then launch the regex engine from there. The routine |
837 | that handles this is called C<find_by_class()>. Sometimes this field |
838 | points at a regop embedded in the program, and sometimes it points at |
839 | an independent synthetic regop that has been constructed by the optimiser. |
be8e71aa |
840 | |
9af228c6 |
841 | =item C<minlen> |
842 | |
843 | The minimum possible length of the final matching string. This is used |
844 | to prune the search space by not bothering to match any closer to the |
845 | end of a string than would allow a match. For instance there is no point |
846 | in even starting the regex engine if the minlen is 10 but the string |
847 | is only 5 characters long. There is no way that the pattern can match. |
848 | |
849 | =item C<reganch> |
850 | |
851 | This is used to store various flags about the pattern, such as whether it |
852 | contains a \G or a ^ or $ symbol. |
853 | |
854 | =item C<paren_names> |
855 | |
856 | This is a hash used internally to track named capture buffers and their |
857 | offsets. The keys are the names of the buffers the values are dualvars, |
858 | with the IV slot holding the number of buffers with the given name and the |
859 | pv being an embedded array of I32. The values may also be contained |
860 | independently in the data array in cases where named backreferences are |
861 | used. |
862 | |
863 | =item C<refcnt> |
864 | |
865 | The number of times the structure is referenced. When this falls to 0 |
866 | the regexp is automatically freed by a call to pregfree. |
867 | |
868 | =item C<engine> |
869 | |
870 | This field points at a regexp_engine structure which contains pointers |
871 | to the subroutine that are to be used for performing a match. It |
872 | is the compiling routines responsibility to populate this field before |
873 | returning the regexp object. |
4ccfbf60 |
874 | |
875 | =back |
876 | |
9af228c6 |
877 | =head2 Pluggable Interface |
878 | |
879 | As of Perl 5.9.5 there is a new interface for using other regexp engines |
880 | than the default one. Each engine is supposed to provide access to |
881 | a constant structure of the following format: |
882 | |
883 | typedef struct regexp_engine { |
884 | regexp* (*comp) (pTHX_ char* exp, char* xend, PMOP* pm); |
885 | I32 (*exec) (pTHX_ regexp* prog, char* stringarg, char* strend, |
886 | char* strbeg, I32 minend, SV* screamer, |
887 | void* data, U32 flags); |
888 | char* (*intuit) (pTHX_ regexp *prog, SV *sv, char *strpos, |
889 | char *strend, U32 flags, |
890 | struct re_scream_pos_data_s *data); |
891 | SV* (*checkstr) (pTHX_ regexp *prog); |
892 | void (*free) (pTHX_ struct regexp* r); |
893 | #ifdef USE_ITHREADS |
894 | regexp* (*dupe) (pTHX_ const regexp *r, CLONE_PARAMS *param); |
895 | #endif |
896 | } regexp_engine; |
897 | |
898 | When a regexp is compiled its C<engine> field is then set to point at |
899 | the appropriate structure so that when it needs to be used it can find |
900 | the right routines to do so. |
901 | |
902 | In order to install a new regexp handler, C<$^H{regcomp}> is set |
903 | to an integer which (when casted appropriately) resolves to one of these |
904 | structures. When compiling the C<comp> method is executed, and the |
905 | resulting regexp structures engine field is expected to point back at |
906 | the same structure. |
907 | |
908 | The pTHX_ symbol in the definition is a macro used by perl under threading |
909 | to provide an extra argument to the routine holding a pointer back to |
910 | the interpreter that is executing the regexp. So under threading all |
911 | routines get an extra argument. |
912 | |
913 | The routines are as follows: |
914 | |
915 | =over 4 |
916 | |
917 | =item comp |
918 | |
919 | regexp* comp(char *exp, char *xend, PMOP pm); |
920 | |
921 | Compile the pattern between exp and xend using the flags contained in |
922 | pm and return a pointer to a prepared regexp structure that can perform |
923 | the match. |
924 | |
925 | =item exec |
926 | |
927 | I32 exec(regexp* prog, |
928 | char *stringarg, char* strend, char* strbeg, |
929 | I32 minend, SV* screamer, |
930 | void* data, U32 flags); |
931 | |
932 | Execute a regexp. |
933 | |
934 | =item intuit |
935 | |
936 | char* intuit( regexp *prog, |
937 | SV *sv, char *strpos, char *strend, |
938 | U32 flags, struct re_scream_pos_data_s *data); |
939 | |
940 | Find the start position where a regex match should be attempted, |
941 | or possibly whether the regex engine should not be run because the |
942 | pattern can't match. |
943 | |
944 | =item checkstr |
945 | |
946 | SV* checkstr(regexp *prog); |
947 | |
948 | Return a SV containing a string that must appear in the pattern. Used |
949 | for optimising matches. |
950 | |
951 | =item free |
952 | |
953 | void free(regexp *prog); |
954 | |
955 | Release any resources allocated to store this pattern. After this |
956 | call prog is an invalid pointer. |
957 | |
958 | =item dupe |
959 | |
960 | regexp* dupe(const regexp *r, CLONE_PARAMS *param); |
961 | |
962 | On threaded builds a regexp may need to be duplicated so that the pattern |
963 | can be used by mutiple threads. This routine is expected to handle the |
964 | duplication. On unthreaded builds this field doesnt exist. |
965 | |
966 | =back |
967 | |
968 | |
4ccfbf60 |
969 | =head2 De-allocation and Cloning |
be8e71aa |
970 | |
971 | Any patch that adds data items to the regexp will need to include |
4ccfbf60 |
972 | changes to F<sv.c> (C<Perl_re_dup()>) and F<regcomp.c> (C<pregfree()>). This |
be8e71aa |
973 | involves freeing or cloning items in the regexes data array based |
4ccfbf60 |
974 | on the data item's type. |
975 | |
976 | =head1 SEE ALSO |
977 | |
978 | L<perlre> |
979 | |
980 | L<perlunitut> |
be8e71aa |
981 | |
b23a565d |
982 | =head1 AUTHOR |
983 | |
984 | by Yves Orton, 2006. |
985 | |
986 | With excerpts from Perl, and contributions and suggestions from |
987 | Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, |
be8e71aa |
988 | Stephen McCamant, and David Landgren. |
b23a565d |
989 | |
4ccfbf60 |
990 | =head1 LICENCE |
b23a565d |
991 | |
992 | Same terms as Perl. |
993 | |
994 | =head1 REFERENCES |
995 | |
4ccfbf60 |
996 | [1] L<http://perl.plover.com/Rx/paper/> |
997 | |
998 | [2] L<http://www.unicode.org> |
b23a565d |
999 | |
1000 | =cut |