Commit | Line | Data |
3270c621 |
1 | =head1 NAME |
2 | |
3 | Text::Balanced - Extract delimited text sequences from strings. |
4 | |
5 | |
6 | =head1 SYNOPSIS |
7 | |
8 | use Text::Balanced qw ( |
9 | extract_delimited |
10 | extract_bracketed |
11 | extract_quotelike |
12 | extract_codeblock |
13 | extract_variable |
14 | extract_tagged |
15 | extract_multiple |
16 | |
17 | gen_delimited_pat |
18 | gen_extract_tagged |
19 | ); |
20 | |
21 | # Extract the initial substring of $text that is delimited by |
22 | # two (unescaped) instances of the first character in $delim. |
23 | |
24 | ($extracted, $remainder) = extract_delimited($text,$delim); |
25 | |
26 | |
27 | # Extract the initial substring of $text that is bracketed |
28 | # with a delimiter(s) specified by $delim (where the string |
29 | # in $delim contains one or more of '(){}[]<>'). |
30 | |
31 | ($extracted, $remainder) = extract_bracketed($text,$delim); |
32 | |
33 | |
34 | # Extract the initial substring of $text that is bounded by |
35 | # an HTML/XML tag. |
36 | |
37 | ($extracted, $remainder) = extract_tagged($text); |
38 | |
39 | |
40 | # Extract the initial substring of $text that is bounded by |
41 | # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags |
42 | |
43 | ($extracted, $remainder) = |
44 | extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]}); |
45 | |
46 | |
47 | # Extract the initial substring of $text that represents a |
48 | # Perl "quote or quote-like operation" |
49 | |
50 | ($extracted, $remainder) = extract_quotelike($text); |
51 | |
52 | |
53 | # Extract the initial substring of $text that represents a block |
54 | # of Perl code, bracketed by any of character(s) specified by $delim |
55 | # (where the string $delim contains one or more of '(){}[]<>'). |
56 | |
57 | ($extracted, $remainder) = extract_codeblock($text,$delim); |
58 | |
59 | |
60 | # Extract the initial substrings of $text that would be extracted by |
61 | # one or more sequential applications of the specified functions |
62 | # or regular expressions |
63 | |
64 | @extracted = extract_multiple($text, |
65 | [ \&extract_bracketed, |
66 | \&extract_quotelike, |
67 | \&some_other_extractor_sub, |
68 | qr/[xyz]*/, |
69 | 'literal', |
70 | ]); |
71 | |
72 | # Create a string representing an optimized pattern (a la Friedl) |
73 | # that matches a substring delimited by any of the specified characters |
74 | # (in this case: any type of quote or a slash) |
75 | |
76 | $patstring = gen_delimited_pat(q{'"`/}); |
77 | |
78 | |
79 | # Generate a reference to an anonymous sub that is just like extract_tagged |
80 | # but pre-compiled and optimized for a specific pair of tags, and consequently |
81 | # much faster (i.e. 3 times faster). It uses qr// for better performance on |
82 | # repeated calls, so it only works under Perl 5.005 or later. |
83 | |
84 | $extract_head = gen_extract_tagged('<HEAD>','</HEAD>'); |
85 | |
86 | ($extracted, $remainder) = $extract_head->($text); |
87 | |
88 | |
89 | =head1 DESCRIPTION |
90 | |
91 | The various C<extract_...> subroutines may be used to extract a |
92 | delimited string (possibly after skipping a specified prefix string). |
93 | The search for the string always begins at the current C<pos> |
94 | location of the string's variable (or at index zero, if no C<pos> |
95 | position is defined). |
96 | |
97 | =head2 General behaviour in list contexts |
98 | |
99 | In a list context, all the subroutines return a list, the first three |
100 | elements of which are always: |
101 | |
102 | =over 4 |
103 | |
104 | =item [0] |
105 | |
106 | The extracted string, including the specified delimiters. |
107 | If the extraction fails an empty string is returned. |
108 | |
109 | =item [1] |
110 | |
111 | The remainder of the input string (i.e. the characters after the |
112 | extracted string). On failure, the entire string is returned. |
113 | |
114 | =item [2] |
115 | |
116 | The skipped prefix (i.e. the characters before the extracted string). |
117 | On failure, the empty string is returned. |
118 | |
119 | =back |
120 | |
121 | Note that in a list context, the contents of the original input text (the first |
122 | argument) are not modified in any way. |
123 | |
124 | However, if the input text was passed in a variable, that variable's |
125 | C<pos> value is updated to point at the first character after the |
126 | extracted text. That means that in a list context the various |
127 | subroutines can be used much like regular expressions. For example: |
128 | |
129 | while ( $next = (extract_quotelike($text))[0] ) |
130 | { |
131 | # process next quote-like (in $next) |
132 | } |
133 | |
134 | |
135 | =head2 General behaviour in scalar and void contexts |
136 | |
137 | In a scalar context, the extracted string is returned, having first been |
138 | removed from the input text. Thus, the following code also processes |
139 | each quote-like operation, but actually removes them from $text: |
140 | |
141 | while ( $next = extract_quotelike($text) ) |
142 | { |
143 | # process next quote-like (in $next) |
144 | } |
145 | |
146 | Note that if the input text is a read-only string (i.e. a literal), |
147 | no attempt is made to remove the extracted text. |
148 | |
149 | In a void context the behaviour of the extraction subroutines is |
150 | exactly the same as in a scalar context, except (of course) that the |
151 | extracted substring is not returned. |
152 | |
153 | =head2 A note about prefixes |
154 | |
155 | Prefix patterns are matched without any trailing modifiers (C</gimsox> etc.) |
156 | This can bite you if you're expecting a prefix specification like |
157 | '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix |
158 | pattern will only succeed if the <H1> tag is on the current line, since |
159 | . normally doesn't match newlines. |
160 | |
161 | To overcome this limitation, you need to turn on /s matching within |
162 | the prefix pattern, using the C<(?s)> directive: '(?s).*?(?=<H1>)' |
163 | |
164 | |
165 | =head2 C<extract_delimited> |
166 | |
167 | The C<extract_delimited> function formalizes the common idiom |
168 | of extracting a single-character-delimited substring from the start of |
169 | a string. For example, to extract a single-quote delimited string, the |
170 | following code is typically used: |
171 | |
172 | ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s; |
173 | $extracted = $1; |
174 | |
175 | but with C<extract_delimited> it can be simplified to: |
176 | |
177 | ($extracted,$remainder) = extract_delimited($text, "'"); |
178 | |
179 | C<extract_delimited> takes up to four scalars (the input text, the |
180 | delimiters, a prefix pattern to be skipped, and any escape characters) |
181 | and extracts the initial substring of the text that |
182 | is appropriately delimited. If the delimiter string has multiple |
183 | characters, the first one encountered in the text is taken to delimit |
184 | the substring. |
185 | The third argument specifies a prefix pattern that is to be skipped |
186 | (but must be present!) before the substring is extracted. |
187 | The final argument specifies the escape character to be used for each |
188 | delimiter. |
189 | |
190 | All arguments are optional. If the escape characters are not specified, |
191 | every delimiter is escaped with a backslash (C<\>). |
192 | If the prefix is not specified, the |
193 | pattern C<'\s*'> - optional whitespace - is used. If the delimiter set |
194 | is also not specified, the set C</["'`]/> is used. If the text to be processed |
195 | is not specified either, C<$_> is used. |
196 | |
197 | In list context, C<extract_delimited> returns a array of three |
198 | elements, the extracted substring (I<including the surrounding |
199 | delimiters>), the remainder of the text, and the skipped prefix (if |
200 | any). If a suitable delimited substring is not found, the first |
201 | element of the array is the empty string, the second is the complete |
202 | original text, and the prefix returned in the third element is an |
203 | empty string. |
204 | |
205 | In a scalar context, just the extracted substring is returned. In |
206 | a void context, the extracted substring (and any prefix) are simply |
207 | removed from the beginning of the first argument. |
208 | |
209 | Examples: |
210 | |
211 | # Remove a single-quoted substring from the very beginning of $text: |
212 | |
213 | $substring = extract_delimited($text, "'", ''); |
214 | |
215 | # Remove a single-quoted Pascalish substring (i.e. one in which |
216 | # doubling the quote character escapes it) from the very |
217 | # beginning of $text: |
218 | |
219 | $substring = extract_delimited($text, "'", '', "'"); |
220 | |
221 | # Extract a single- or double- quoted substring from the |
222 | # beginning of $text, optionally after some whitespace |
223 | # (note the list context to protect $text from modification): |
224 | |
225 | ($substring) = extract_delimited $text, q{"'}; |
226 | |
227 | |
228 | # Delete the substring delimited by the first '/' in $text: |
229 | |
230 | $text = join '', (extract_delimited($text,'/','[^/]*')[2,1]; |
231 | |
232 | Note that this last example is I<not> the same as deleting the first |
233 | quote-like pattern. For instance, if C<$text> contained the string: |
234 | |
235 | "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }" |
236 | |
237 | then after the deletion it would contain: |
238 | |
239 | "if ('.$UNIXCMD/s) { $cmd = $1; }" |
240 | |
241 | not: |
242 | |
243 | "if ('./cmd' =~ ms) { $cmd = $1; }" |
244 | |
245 | |
246 | See L<"extract_quotelike"> for a (partial) solution to this problem. |
247 | |
248 | |
249 | =head2 C<extract_bracketed> |
250 | |
251 | Like C<"extract_delimited">, the C<extract_bracketed> function takes |
252 | up to three optional scalar arguments: a string to extract from, a delimiter |
253 | specifier, and a prefix pattern. As before, a missing prefix defaults to |
254 | optional whitespace and a missing text defaults to C<$_>. However, a missing |
255 | delimiter specifier defaults to C<'{}()[]E<lt>E<gt>'> (see below). |
256 | |
257 | C<extract_bracketed> extracts a balanced-bracket-delimited |
258 | substring (using any one (or more) of the user-specified delimiter |
259 | brackets: '(..)', '{..}', '[..]', or '<..>'). Optionally it will also |
260 | respect quoted unbalanced brackets (see below). |
261 | |
262 | A "delimiter bracket" is a bracket in list of delimiters passed as |
263 | C<extract_bracketed>'s second argument. Delimiter brackets are |
264 | specified by giving either the left or right (or both!) versions |
265 | of the required bracket(s). Note that the order in which |
266 | two or more delimiter brackets are specified is not significant. |
267 | |
268 | A "balanced-bracket-delimited substring" is a substring bounded by |
269 | matched brackets, such that any other (left or right) delimiter |
270 | bracket I<within> the substring is also matched by an opposite |
271 | (right or left) delimiter bracket I<at the same level of nesting>. Any |
272 | type of bracket not in the delimiter list is treated as an ordinary |
273 | character. |
274 | |
275 | In other words, each type of bracket specified as a delimiter must be |
276 | balanced and correctly nested within the substring, and any other kind of |
277 | ("non-delimiter") bracket in the substring is ignored. |
278 | |
279 | For example, given the string: |
280 | |
281 | $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }"; |
282 | |
283 | then a call to C<extract_bracketed> in a list context: |
284 | |
285 | @result = extract_bracketed( $text, '{}' ); |
286 | |
287 | would return: |
288 | |
289 | ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" ) |
290 | |
291 | since both sets of C<'{..}'> brackets are properly nested and evenly balanced. |
292 | (In a scalar context just the first element of the array would be returned. In |
293 | a void context, C<$text> would be replaced by an empty string.) |
294 | |
295 | Likewise the call in: |
296 | |
297 | @result = extract_bracketed( $text, '{[' ); |
298 | |
299 | would return the same result, since all sets of both types of specified |
300 | delimiter brackets are correctly nested and balanced. |
301 | |
302 | However, the call in: |
303 | |
304 | @result = extract_bracketed( $text, '{([<' ); |
305 | |
306 | would fail, returning: |
307 | |
308 | ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" ); |
309 | |
310 | because the embedded pairs of C<'(..)'>s and C<'[..]'>s are "cross-nested" and |
311 | the embedded C<'E<gt>'> is unbalanced. (In a scalar context, this call would |
312 | return an empty string. In a void context, C<$text> would be unchanged.) |
313 | |
314 | Note that the embedded single-quotes in the string don't help in this |
315 | case, since they have not been specified as acceptable delimiters and are |
316 | therefore treated as non-delimiter characters (and ignored). |
317 | |
318 | However, if a particular species of quote character is included in the |
319 | delimiter specification, then that type of quote will be correctly handled. |
320 | for example, if C<$text> is: |
321 | |
322 | $text = '<A HREF=">>>>">link</A>'; |
323 | |
324 | then |
325 | |
326 | @result = extract_bracketed( $text, '<">' ); |
327 | |
328 | returns: |
329 | |
330 | ( '<A HREF=">>>>">', 'link</A>', "" ) |
331 | |
332 | as expected. Without the specification of C<"> as an embedded quoter: |
333 | |
334 | @result = extract_bracketed( $text, '<>' ); |
335 | |
336 | the result would be: |
337 | |
338 | ( '<A HREF=">', '>>>">link</A>', "" ) |
339 | |
340 | In addition to the quote delimiters C<'>, C<">, and C<`>, full Perl quote-like |
341 | quoting (i.e. q{string}, qq{string}, etc) can be specified by including the |
342 | letter 'q' as a delimiter. Hence: |
343 | |
344 | @result = extract_bracketed( $text, '<q>' ); |
345 | |
346 | would correctly match something like this: |
347 | |
348 | $text = '<leftop: conj /and/ conj>'; |
349 | |
350 | See also: C<"extract_quotelike"> and C<"extract_codeblock">. |
351 | |
352 | |
353 | =head2 C<extract_tagged> |
354 | |
355 | C<extract_tagged> extracts and segments text between (balanced) |
356 | specified tags. |
357 | |
358 | The subroutine takes up to five optional arguments: |
359 | |
360 | =over 4 |
361 | |
362 | =item 1. |
363 | |
364 | A string to be processed (C<$_> if the string is omitted or C<undef>) |
365 | |
366 | =item 2. |
367 | |
368 | A string specifying a pattern to be matched as the opening tag. |
369 | If the pattern string is omitted (or C<undef>) then a pattern |
370 | that matches any standard HTML/XML tag is used. |
371 | |
372 | =item 3. |
373 | |
374 | A string specifying a pattern to be matched at the closing tag. |
375 | If the pattern string is omitted (or C<undef>) then the closing |
376 | tag is constructed by inserting a C</> after any leading bracket |
377 | characters in the actual opening tag that was matched (I<not> the pattern |
378 | that matched the tag). For example, if the opening tag pattern |
379 | is specified as C<'{{\w+}}'> and actually matched the opening tag |
380 | C<"{{DATA}}">, then the constructed closing tag would be C<"{{/DATA}}">. |
381 | |
382 | =item 4. |
383 | |
384 | A string specifying a pattern to be matched as a prefix (which is to be |
385 | skipped). If omitted, optional whitespace is skipped. |
386 | |
387 | =item 5. |
388 | |
389 | A hash reference containing various parsing options (see below) |
390 | |
391 | =back |
392 | |
393 | The various options that can be specified are: |
394 | |
395 | =over 4 |
396 | |
397 | =item C<reject =E<gt> $listref> |
398 | |
399 | The list reference contains one or more strings specifying patterns |
400 | that must I<not> appear within the tagged text. |
401 | |
402 | For example, to extract |
403 | an HTML link (which should not contain nested links) use: |
404 | |
405 | extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} ); |
406 | |
407 | =item C<ignore =E<gt> $listref> |
408 | |
409 | The list reference contains one or more strings specifying patterns |
410 | that are I<not> be be treated as nested tags within the tagged text |
411 | (even if they would match the start tag pattern). |
412 | |
413 | For example, to extract an arbitrary XML tag, but ignore "empty" elements: |
414 | |
415 | extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} ); |
416 | |
417 | (also see L<"gen_delimited_pat"> below). |
418 | |
419 | |
420 | =item C<fail =E<gt> $str> |
421 | |
422 | The C<fail> option indicates the action to be taken if a matching end |
423 | tag is not encountered (i.e. before the end of the string or some |
424 | C<reject> pattern matches). By default, a failure to match a closing |
425 | tag causes C<extract_tagged> to immediately fail. |
426 | |
427 | However, if the string value associated with <reject> is "MAX", then |
428 | C<extract_tagged> returns the complete text up to the point of failure. |
429 | If the string is "PARA", C<extract_tagged> returns only the first paragraph |
430 | after the tag (up to the first line that is either empty or contains |
431 | only whitespace characters). |
432 | If the string is "", the the default behaviour (i.e. failure) is reinstated. |
433 | |
434 | For example, suppose the start tag "/para" introduces a paragraph, which then |
435 | continues until the next "/endpara" tag or until another "/para" tag is |
436 | encountered: |
437 | |
438 | $text = "/para line 1\n\nline 3\n/para line 4"; |
439 | |
440 | extract_tagged($text, '/para', '/endpara', undef, |
441 | {reject => '/para', fail => MAX ); |
442 | |
443 | # EXTRACTED: "/para line 1\n\nline 3\n" |
444 | |
445 | Suppose instead, that if no matching "/endpara" tag is found, the "/para" |
446 | tag refers only to the immediately following paragraph: |
447 | |
448 | $text = "/para line 1\n\nline 3\n/para line 4"; |
449 | |
450 | extract_tagged($text, '/para', '/endpara', undef, |
451 | {reject => '/para', fail => MAX ); |
452 | |
453 | # EXTRACTED: "/para line 1\n" |
454 | |
455 | Note that the specified C<fail> behaviour applies to nested tags as well. |
456 | |
457 | =back |
458 | |
459 | On success in a list context, an array of 6 elements is returned. The elements are: |
460 | |
461 | =over 4 |
462 | |
463 | =item [0] |
464 | |
465 | the extracted tagged substring (including the outermost tags), |
466 | |
467 | =item [1] |
468 | |
469 | the remainder of the input text, |
470 | |
471 | =item [2] |
472 | |
473 | the prefix substring (if any), |
474 | |
475 | =item [3] |
476 | |
477 | the opening tag |
478 | |
479 | =item [4] |
480 | |
481 | the text between the opening and closing tags |
482 | |
483 | =item [5] |
484 | |
485 | the closing tag (or "" if no closing tag was found) |
486 | |
487 | =back |
488 | |
489 | On failure, all of these values (except the remaining text) are C<undef>. |
490 | |
491 | In a scalar context, C<extract_tagged> returns just the complete |
492 | substring that matched a tagged text (including the start and end |
493 | tags). C<undef> is returned on failure. In addition, the original input |
494 | text has the returned substring (and any prefix) removed from it. |
495 | |
496 | In a void context, the input text just has the matched substring (and |
497 | any specified prefix) removed. |
498 | |
499 | |
500 | =head2 C<gen_extract_tagged> |
501 | |
502 | (Note: This subroutine is only available under Perl5.005) |
503 | |
504 | C<gen_extract_tagged> generates a new anonymous subroutine which |
505 | extracts text between (balanced) specified tags. In other words, |
506 | it generates a function identical in function to C<extract_tagged>. |
507 | |
508 | The difference between C<extract_tagged> and the anonymous |
509 | subroutines generated by |
510 | C<gen_extract_tagged>, is that those generated subroutines: |
511 | |
512 | =over 4 |
513 | |
514 | =item * |
515 | |
516 | do not have to reparse tag specification or parsing options every time |
517 | they are called (whereas C<extract_tagged> has to effectively rebuild |
518 | its tag parser on every call); |
519 | |
520 | =item * |
521 | |
522 | make use of the new qr// construct to pre-compile the regexes they use |
523 | (whereas C<extract_tagged> uses standard string variable interpolation |
524 | to create tag-matching patterns). |
525 | |
526 | =back |
527 | |
528 | The subroutine takes up to four optional arguments (the same set as |
529 | C<extract_tagged> except for the string to be processed). It returns |
530 | a reference to a subroutine which in turn takes a single argument (the text to |
531 | be extracted from). |
532 | |
533 | In other words, the implementation of C<extract_tagged> is exactly |
534 | equivalent to: |
535 | |
536 | sub extract_tagged |
537 | { |
538 | my $text = shift; |
539 | $extractor = gen_extract_tagged(@_); |
540 | return $extractor->($text); |
541 | } |
542 | |
543 | (although C<extract_tagged> is not currently implemented that way, in order |
544 | to preserve pre-5.005 compatibility). |
545 | |
546 | Using C<gen_extract_tagged> to create extraction functions for specific tags |
547 | is a good idea if those functions are going to be called more than once, since |
548 | their performance is typically twice as good as the more general-purpose |
549 | C<extract_tagged>. |
550 | |
551 | |
552 | =head2 C<extract_quotelike> |
553 | |
554 | C<extract_quotelike> attempts to recognize, extract, and segment any |
555 | one of the various Perl quotes and quotelike operators (see |
556 | L<perlop(3)>) Nested backslashed delimiters, embedded balanced bracket |
557 | delimiters (for the quotelike operators), and trailing modifiers are |
558 | all caught. For example, in: |
559 | |
560 | extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #' |
561 | |
562 | extract_quotelike ' "You said, \"Use sed\"." ' |
563 | |
564 | extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; ' |
565 | |
566 | extract_quotelike ' tr/\\\/\\\\/\\\//ds; ' |
567 | |
568 | the full Perl quotelike operations are all extracted correctly. |
569 | |
570 | Note too that, when using the /x modifier on a regex, any comment |
571 | containing the current pattern delimiter will cause the regex to be |
572 | immediately terminated. In other words: |
573 | |
574 | 'm / |
575 | (?i) # CASE INSENSITIVE |
576 | [a-z_] # LEADING ALPHABETIC/UNDERSCORE |
577 | [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS |
578 | /x' |
579 | |
580 | will be extracted as if it were: |
581 | |
582 | 'm / |
583 | (?i) # CASE INSENSITIVE |
584 | [a-z_] # LEADING ALPHABETIC/' |
585 | |
586 | This behaviour is identical to that of the actual compiler. |
587 | |
588 | C<extract_quotelike> takes two arguments: the text to be processed and |
589 | a prefix to be matched at the very beginning of the text. If no prefix |
590 | is specified, optional whitespace is the default. If no text is given, |
591 | C<$_> is used. |
592 | |
593 | In a list context, an array of 11 elements is returned. The elements are: |
594 | |
595 | =over 4 |
596 | |
597 | =item [0] |
598 | |
599 | the extracted quotelike substring (including trailing modifiers), |
600 | |
601 | =item [1] |
602 | |
603 | the remainder of the input text, |
604 | |
605 | =item [2] |
606 | |
607 | the prefix substring (if any), |
608 | |
609 | =item [3] |
610 | |
611 | the name of the quotelike operator (if any), |
612 | |
613 | =item [4] |
614 | |
615 | the left delimiter of the first block of the operation, |
616 | |
617 | =item [5] |
618 | |
619 | the text of the first block of the operation |
620 | (that is, the contents of |
621 | a quote, the regex of a match or substitution or the target list of a |
622 | translation), |
623 | |
624 | =item [6] |
625 | |
626 | the right delimiter of the first block of the operation, |
627 | |
628 | =item [7] |
629 | |
630 | the left delimiter of the second block of the operation |
631 | (that is, if it is a C<s>, C<tr>, or C<y>), |
632 | |
633 | =item [8] |
634 | |
635 | the text of the second block of the operation |
636 | (that is, the replacement of a substitution or the translation list |
637 | of a translation), |
638 | |
639 | =item [9] |
640 | |
641 | the right delimiter of the second block of the operation (if any), |
642 | |
643 | =item [10] |
644 | |
645 | the trailing modifiers on the operation (if any). |
646 | |
647 | =back |
648 | |
649 | For each of the fields marked "(if any)" the default value on success is |
650 | an empty string. |
651 | On failure, all of these values (except the remaining text) are C<undef>. |
652 | |
653 | |
654 | In a scalar context, C<extract_quotelike> returns just the complete substring |
655 | that matched a quotelike operation (or C<undef> on failure). In a scalar or |
656 | void context, the input text has the same substring (and any specified |
657 | prefix) removed. |
658 | |
659 | Examples: |
660 | |
661 | # Remove the first quotelike literal that appears in text |
662 | |
663 | $quotelike = extract_quotelike($text,'.*?'); |
664 | |
665 | # Replace one or more leading whitespace-separated quotelike |
666 | # literals in $_ with "<QLL>" |
667 | |
668 | do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@; |
669 | |
670 | |
671 | # Isolate the search pattern in a quotelike operation from $text |
672 | |
673 | ($op,$pat) = (extract_quotelike $text)[3,5]; |
674 | if ($op =~ /[ms]/) |
675 | { |
676 | print "search pattern: $pat\n"; |
677 | } |
678 | else |
679 | { |
680 | print "$op is not a pattern matching operation\n"; |
681 | } |
682 | |
683 | |
684 | =head2 C<extract_quotelike> and "here documents" |
685 | |
686 | C<extract_quotelike> can successfully extract "here documents" from an input |
687 | string, but with an important caveat in list contexts. |
688 | |
689 | Unlike other types of quote-like literals, a here document is rarely |
690 | a contiguous substring. For example, a typical piece of code using |
691 | here document might look like this: |
692 | |
693 | <<'EOMSG' || die; |
694 | This is the message. |
695 | EOMSG |
696 | exit; |
697 | |
698 | Given this as an input string in a scalar context, C<extract_quotelike> |
699 | would correctly return the string "<<'EOMSG'\nThis is the message.\nEOMSG", |
700 | leaving the string " || die;\nexit;" in the original variable. In other words, |
701 | the two separate pieces of the here document are successfully extracted and |
702 | concatenated. |
703 | |
704 | In a list context, C<extract_quotelike> would return the list |
705 | |
706 | =over 4 |
707 | |
708 | =item [0] |
709 | |
710 | "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted here document, |
711 | including fore and aft delimiters), |
712 | |
713 | =item [1] |
714 | |
715 | " || die;\nexit;" (i.e. the remainder of the input text, concatenated), |
716 | |
717 | =item [2] |
718 | |
719 | "" (i.e. the prefix substring -- trivial in this case), |
720 | |
721 | =item [3] |
722 | |
723 | "<<" (i.e. the "name" of the quotelike operator) |
724 | |
725 | =item [4] |
726 | |
727 | "'EOMSG'" (i.e. the left delimiter of the here document, including any quotes), |
728 | |
729 | =item [5] |
730 | |
731 | "This is the message.\n" (i.e. the text of the here document), |
732 | |
733 | =item [6] |
734 | |
735 | "EOMSG" (i.e. the right delimiter of the here document), |
736 | |
737 | =item [7..10] |
738 | |
739 | "" (a here document has no second left delimiter, second text, second right |
740 | delimiter, or trailing modifiers). |
741 | |
742 | =back |
743 | |
744 | However, the matching position of the input variable would be set to |
745 | "exit;" (i.e. I<after> the closing delimiter of the here document), |
746 | which would cause the earlier " || die;\nexit;" to be skipped in any |
747 | sequence of code fragment extractions. |
748 | |
749 | To avoid this problem, when it encounters a here document whilst |
750 | extracting from a modifiable string, C<extract_quotelike> silently |
751 | rearranges the string to an equivalent piece of Perl: |
752 | |
753 | <<'EOMSG' |
754 | This is the message. |
755 | EOMSG |
756 | || die; |
757 | exit; |
758 | |
759 | in which the here document I<is> contiguous. It still leaves the |
760 | matching position after the here document, but now the rest of the line |
761 | on which the here document starts is not skipped. |
762 | |
763 | To prevent <extract_quotelike> from mucking about with the input in this way |
764 | (this is the only case where a list-context C<extract_quotelike> does so), |
765 | you can pass the input variable as an interpolated literal: |
766 | |
767 | $quotelike = extract_quotelike("$var"); |
768 | |
769 | |
770 | =head2 C<extract_codeblock> |
771 | |
772 | C<extract_codeblock> attempts to recognize and extract a balanced |
773 | bracket delimited substring that may contain unbalanced brackets |
774 | inside Perl quotes or quotelike operations. That is, C<extract_codeblock> |
775 | is like a combination of C<"extract_bracketed"> and |
776 | C<"extract_quotelike">. |
777 | |
778 | C<extract_codeblock> takes the same initial three parameters as C<extract_bracketed>: |
779 | a text to process, a set of delimiter brackets to look for, and a prefix to |
780 | match first. It also takes an optional fourth parameter, which allows the |
781 | outermost delimiter brackets to be specified separately (see below). |
782 | |
783 | Omitting the first argument (input text) means process C<$_> instead. |
784 | Omitting the second argument (delimiter brackets) indicates that only C<'{'> is to be used. |
785 | Omitting the third argument (prefix argument) implies optional whitespace at the start. |
786 | Omitting the fourth argument (outermost delimiter brackets) indicates that the |
787 | value of the second argument is to be used for the outermost delimiters. |
788 | |
789 | Once the prefix an dthe outermost opening delimiter bracket have been |
790 | recognized, code blocks are extracted by stepping through the input text and |
791 | trying the following alternatives in sequence: |
792 | |
793 | =over 4 |
794 | |
795 | =item 1. |
796 | |
797 | Try and match a closing delimiter bracket. If the bracket was the same |
798 | species as the last opening bracket, return the substring to that |
799 | point. If the bracket was mismatched, return an error. |
800 | |
801 | =item 2. |
802 | |
803 | Try to match a quote or quotelike operator. If found, call |
804 | C<extract_quotelike> to eat it. If C<extract_quotelike> fails, return |
805 | the error it returned. Otherwise go back to step 1. |
806 | |
807 | =item 3. |
808 | |
809 | Try to match an opening delimiter bracket. If found, call |
810 | C<extract_codeblock> recursively to eat the embedded block. If the |
811 | recursive call fails, return an error. Otherwise, go back to step 1. |
812 | |
813 | =item 4. |
814 | |
815 | Unconditionally match a bareword or any other single character, and |
816 | then go back to step 1. |
817 | |
818 | =back |
819 | |
820 | |
821 | Examples: |
822 | |
823 | # Find a while loop in the text |
824 | |
825 | if ($text =~ s/.*?while\s*\{/{/) |
826 | { |
827 | $loop = "while " . extract_codeblock($text); |
828 | } |
829 | |
830 | # Remove the first round-bracketed list (which may include |
831 | # round- or curly-bracketed code blocks or quotelike operators) |
832 | |
833 | extract_codeblock $text, "(){}", '[^(]*'; |
834 | |
835 | |
836 | The ability to specify a different outermost delimiter bracket is useful |
837 | in some circumstances. For example, in the Parse::RecDescent module, |
838 | parser actions which are to be performed only on a successful parse |
839 | are specified using a C<E<lt>defer:...E<gt>> directive. For example: |
840 | |
841 | sentence: subject verb object |
842 | <defer: {$::theVerb = $item{verb}} > |
843 | |
844 | Parse::RecDescent uses C<extract_codeblock($text, '{}E<lt>E<gt>')> to extract the code |
845 | within the C<E<lt>defer:...E<gt>> directive, but there's a problem. |
846 | |
847 | A deferred action like this: |
848 | |
849 | <defer: {if ($count>10) {$count--}} > |
850 | |
851 | will be incorrectly parsed as: |
852 | |
853 | <defer: {if ($count> |
854 | |
855 | because the "less than" operator is interpreted as a closing delimiter. |
856 | |
857 | But, by extracting the directive using |
858 | S<C<extract_codeblock($text, '{}', undef, 'E<lt>E<gt>')>> |
859 | the '>' character is only treated as a delimited at the outermost |
860 | level of the code block, so the directive is parsed correctly. |
861 | |
862 | =head2 C<extract_multiple> |
863 | |
864 | The C<extract_multiple> subroutine takes a string to be processed and a |
865 | list of extractors (subroutines or regular expressions) to apply to that string. |
866 | |
867 | In an array context C<extract_multiple> returns an array of substrings |
868 | of the original string, as extracted by the specified extractors. |
869 | In a scalar context, C<extract_multiple> returns the first |
870 | substring successfully extracted from the original string. In both |
871 | scalar and void contexts the original string has the first successfully |
872 | extracted substring removed from it. In all contexts |
873 | C<extract_multiple> starts at the current C<pos> of the string, and |
874 | sets that C<pos> appropriately after it matches. |
875 | |
876 | Hence, the aim of of a call to C<extract_multiple> in a list context |
877 | is to split the processed string into as many non-overlapping fields as |
878 | possible, by repeatedly applying each of the specified extractors |
879 | to the remainder of the string. Thus C<extract_multiple> is |
880 | a generalized form of Perl's C<split> subroutine. |
881 | |
882 | The subroutine takes up to four optional arguments: |
883 | |
884 | =over 4 |
885 | |
886 | =item 1. |
887 | |
888 | A string to be processed (C<$_> if the string is omitted or C<undef>) |
889 | |
890 | =item 2. |
891 | |
892 | A reference to a list of subroutine references and/or qr// objects and/or |
893 | literal strings and/or hash references, specifying the extractors |
894 | to be used to split the string. If this argument is omitted (or |
895 | C<undef>) the list: |
896 | |
897 | [ |
898 | sub { extract_variable($_[0], '') }, |
899 | sub { extract_quotelike($_[0],'') }, |
900 | sub { extract_codeblock($_[0],'{}','') }, |
901 | ] |
902 | |
903 | is used. |
904 | |
905 | |
906 | =item 3. |
907 | |
908 | An number specifying the maximum number of fields to return. If this |
909 | argument is omitted (or C<undef>), split continues as long as possible. |
910 | |
911 | If the third argument is I<N>, then extraction continues until I<N> fields |
912 | have been successfully extracted, or until the string has been completely |
913 | processed. |
914 | |
915 | Note that in scalar and void contexts the value of this argument is |
916 | automatically reset to 1 (under C<-w>, a warning is issued if the argument |
917 | has to be reset). |
918 | |
919 | =item 4. |
920 | |
921 | A value indicating whether unmatched substrings (see below) within the |
922 | text should be skipped or returned as fields. If the value is true, |
923 | such substrings are skipped. Otherwise, they are returned. |
924 | |
925 | =back |
926 | |
927 | The extraction process works by applying each extractor in |
928 | sequence to the text string. If the extractor is a subroutine it |
929 | is called in a list |
930 | context and is expected to return a list of a single element, namely |
931 | the extracted text. |
932 | Note that the value returned by an extractor subroutine need not bear any |
933 | relationship to the corresponding substring of the original text (see |
934 | examples below). |
935 | |
936 | If the extractor is a precompiled regular expression or a string, |
937 | it is matched against the text in a scalar context with a leading |
938 | '\G' and the gc modifiers enabled. The extracted value is either |
939 | $1 if that variable is defined after the match, or else the |
940 | complete match (i.e. $&). |
941 | |
942 | If the extractor is a hash reference, it must contain exactly one element. |
943 | The value of that element is one of the |
944 | above extractor types (subroutine reference, regular expression, or string). |
945 | The key of that element is the name of a class into which the successful |
946 | return value of the extractor will be blessed. |
947 | |
948 | If an extractor returns a defined value, that value is immediately |
949 | treated as the next extracted field and pushed onto the list of fields. |
950 | If the extractor was specified in a hash reference, the field is also |
951 | blessed into the appropriate class, |
952 | |
953 | If the extractor fails to match (in the case of a regex extractor), or returns an empty list or an undefined value (in the case of a subroutine extractor), it is |
954 | assumed to have failed to extract. |
955 | If none of the extractor subroutines succeeds, then one |
956 | character is extracted from the start of the text and the extraction |
957 | subroutines reapplied. Characters which are thus removed are accumulated and |
958 | eventually become the next field (unless the fourth argument is true, in which |
959 | case they are disgarded). |
960 | |
961 | For example, the following extracts substrings that are valid Perl variables: |
962 | |
963 | @fields = extract_multiple($text, |
964 | [ sub { extract_variable($_[0]) } ], |
965 | undef, 1); |
966 | |
967 | This example separates a text into fields which are quote delimited, |
968 | curly bracketed, and anything else. The delimited and bracketed |
969 | parts are also blessed to identify them (the "anything else" is unblessed): |
970 | |
971 | @fields = extract_multiple($text, |
972 | [ |
973 | { Delim => sub { extract_delimited($_[0],q{'"}) } }, |
974 | { Brack => sub { extract_bracketed($_[0],'{}') } }, |
975 | ]); |
976 | |
977 | This call extracts the next single substring that is a valid Perl quotelike |
978 | operator (and removes it from $text): |
979 | |
980 | $quotelike = extract_multiple($text, |
981 | [ |
982 | sub { extract_quotelike($_[0]) }, |
983 | ], undef, 1); |
984 | |
985 | Finally, here is yet another way to do comma-separated value parsing: |
986 | |
987 | @fields = extract_multiple($csv_text, |
988 | [ |
989 | sub { extract_delimited($_[0],q{'"}) }, |
990 | qr/([^,]+)(.*)/, |
991 | ], |
992 | undef,1); |
993 | |
994 | The list in the second argument means: |
995 | I<"Try and extract a ' or " delimited string, otherwise extract anything up to a comma...">. |
996 | The undef third argument means: |
997 | I<"...as many times as possible...">, |
998 | and the true value in the fourth argument means |
999 | I<"...discarding anything else that appears (i.e. the commas)">. |
1000 | |
1001 | If you wanted the commas preserved as separate fields (i.e. like split |
1002 | does if your split pattern has capturing parentheses), you would |
1003 | just make the last parameter undefined (or remove it). |
1004 | |
1005 | |
1006 | =head2 C<gen_delimited_pat> |
1007 | |
1008 | The C<gen_delimited_pat> subroutine takes a single (string) argument and |
1009 | builds a Friedl-style optimized regex that matches a string delimited |
1010 | by any one of the characters in the single argument. For example: |
1011 | |
1012 | gen_delimited_pat(q{'"}) |
1013 | |
1014 | returns the regex: |
1015 | |
1016 | (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\') |
1017 | |
1018 | Note that the specified delimiters are automatically quotemeta'd. |
1019 | |
1020 | A typical use of C<gen_delimited_pat> would be to build special purpose tags |
1021 | for C<extract_tagged>. For example, to properly ignore "empty" XML elements |
1022 | (which might contain quoted strings): |
1023 | |
1024 | my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>'; |
1025 | |
1026 | extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} ); |
1027 | |
1028 | |
1029 | C<gen_delimited_pat> may also be called with an optional second argument, |
1030 | which specifies the "escape" character(s) to be used for each delimiter. |
1031 | For example to match a Pascal-style string (where ' is the delimiter |
1032 | and '' is a literal ' within the string): |
1033 | |
1034 | gen_delimited_pat(q{'},q{'}); |
1035 | |
1036 | Different escape characters can be specified for different delimiters. |
1037 | For example, to specify that '/' is the escape for single quotes |
1038 | and '%' is the escape for double quotes: |
1039 | |
1040 | gen_delimited_pat(q{'"},q{/%}); |
1041 | |
1042 | If more delimiters than escape chars are specified, the last escape char |
1043 | is used for the remaining delimiters. |
1044 | If no escape char is specified for a given specified delimiter, '\' is used. |
1045 | |
1046 | Note that |
1047 | C<gen_delimited_pat> was previously called |
1048 | C<delimited_pat>. That name may still be used, but is now deprecated. |
1049 | |
1050 | |
1051 | =head1 DIAGNOSTICS |
1052 | |
1053 | In a list context, all the functions return C<(undef,$original_text)> |
1054 | on failure. In a scalar context, failure is indicated by returning C<undef> |
1055 | (in this case the input text is not modified in any way). |
1056 | |
1057 | In addition, on failure in I<any> context, the C<$@> variable is set. |
1058 | Accessing C<$@-E<gt>{error}> returns one of the error diagnostics listed |
1059 | below. |
1060 | Accessing C<$@-E<gt>{pos}> returns the offset into the original string at |
1061 | which the error was detected (although not necessarily where it occurred!) |
1062 | Printing C<$@> directly produces the error message, with the offset appended. |
1063 | On success, the C<$@> variable is guaranteed to be C<undef>. |
1064 | |
1065 | The available diagnostics are: |
1066 | |
1067 | =over 4 |
1068 | |
1069 | =item C<Did not find a suitable bracket: "%s"> |
1070 | |
1071 | The delimiter provided to C<extract_bracketed> was not one of |
1072 | C<'()[]E<lt>E<gt>{}'>. |
1073 | |
1074 | =item C<Did not find prefix: /%s/> |
1075 | |
1076 | A non-optional prefix was specified but wasn't found at the start of the text. |
1077 | |
1078 | =item C<Did not find opening bracket after prefix: "%s"> |
1079 | |
1080 | C<extract_bracketed> or C<extract_codeblock> was expecting a |
1081 | particular kind of bracket at the start of the text, and didn't find it. |
1082 | |
1083 | =item C<No quotelike operator found after prefix: "%s"> |
1084 | |
1085 | C<extract_quotelike> didn't find one of the quotelike operators C<q>, |
1086 | C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y> at the start of the substring |
1087 | it was extracting. |
1088 | |
1089 | =item C<Unmatched closing bracket: "%c"> |
1090 | |
1091 | C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> encountered |
1092 | a closing bracket where none was expected. |
1093 | |
1094 | =item C<Unmatched opening bracket(s): "%s"> |
1095 | |
1096 | C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> ran |
1097 | out of characters in the text before closing one or more levels of nested |
1098 | brackets. |
1099 | |
1100 | =item C<Unmatched embedded quote (%s)> |
1101 | |
1102 | C<extract_bracketed> attempted to match an embedded quoted substring, but |
1103 | failed to find a closing quote to match it. |
1104 | |
1105 | =item C<Did not find closing delimiter to match '%s'> |
1106 | |
1107 | C<extract_quotelike> was unable to find a closing delimiter to match the |
1108 | one that opened the quote-like operation. |
1109 | |
1110 | =item C<Mismatched closing bracket: expected "%c" but found "%s"> |
1111 | |
1112 | C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> found |
1113 | a valid bracket delimiter, but it was the wrong species. This usually |
1114 | indicates a nesting error, but may indicate incorrect quoting or escaping. |
1115 | |
1116 | =item C<No block delimiter found after quotelike "%s"> |
1117 | |
1118 | C<extract_quotelike> or C<extract_codeblock> found one of the |
1119 | quotelike operators C<q>, C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y> |
1120 | without a suitable block after it. |
1121 | |
1122 | =item C<Did not find leading dereferencer> |
1123 | |
1124 | C<extract_variable> was expecting one of '$', '@', or '%' at the start of |
1125 | a variable, but didn't find any of them. |
1126 | |
1127 | =item C<Bad identifier after dereferencer> |
1128 | |
1129 | C<extract_variable> found a '$', '@', or '%' indicating a variable, but that |
1130 | character was not followed by a legal Perl identifier. |
1131 | |
1132 | =item C<Did not find expected opening bracket at %s> |
1133 | |
1134 | C<extract_codeblock> failed to find any of the outermost opening brackets |
1135 | that were specified. |
1136 | |
1137 | =item C<Improperly nested codeblock at %s> |
1138 | |
1139 | A nested code block was found that started with a delimiter that was specified |
1140 | as being only to be used as an outermost bracket. |
1141 | |
1142 | =item C<Missing second block for quotelike "%s"> |
1143 | |
1144 | C<extract_codeblock> or C<extract_quotelike> found one of the |
1145 | quotelike operators C<s>, C<tr> or C<y> followed by only one block. |
1146 | |
1147 | =item C<No match found for opening bracket> |
1148 | |
1149 | C<extract_codeblock> failed to find a closing bracket to match the outermost |
1150 | opening bracket. |
1151 | |
1152 | =item C<Did not find opening tag: /%s/> |
1153 | |
1154 | C<extract_tagged> did not find a suitable opening tag (after any specified |
1155 | prefix was removed). |
1156 | |
1157 | =item C<Unable to construct closing tag to match: /%s/> |
1158 | |
1159 | C<extract_tagged> matched the specified opening tag and tried to |
1160 | modify the matched text to produce a matching closing tag (because |
1161 | none was specified). It failed to generate the closing tag, almost |
1162 | certainly because the opening tag did not start with a |
1163 | bracket of some kind. |
1164 | |
1165 | =item C<Found invalid nested tag: %s> |
1166 | |
1167 | C<extract_tagged> found a nested tag that appeared in the "reject" list |
1168 | (and the failure mode was not "MAX" or "PARA"). |
1169 | |
1170 | =item C<Found unbalanced nested tag: %s> |
1171 | |
1172 | C<extract_tagged> found a nested opening tag that was not matched by a |
1173 | corresponding nested closing tag (and the failure mode was not "MAX" or "PARA"). |
1174 | |
1175 | =item C<Did not find closing tag> |
1176 | |
1177 | C<extract_tagged> reached the end of the text without finding a closing tag |
1178 | to match the original opening tag (and the failure mode was not |
1179 | "MAX" or "PARA"). |
1180 | |
1181 | |
1182 | |
1183 | |
1184 | =back |
1185 | |
1186 | |
1187 | =head1 AUTHOR |
1188 | |
1189 | Damian Conway (damian@conway.org) |
1190 | |
1191 | |
1192 | =head1 BUGS AND IRRITATIONS |
1193 | |
1194 | There are undoubtedly serious bugs lurking somewhere in this code, if |
1195 | only because parts of it give the impression of understanding a great deal |
1196 | more about Perl than they really do. |
1197 | |
1198 | Bug reports and other feedback are most welcome. |
1199 | |
1200 | |
1201 | =head1 COPYRIGHT |
1202 | |
1203 | Copyright (c) 1997-2000, Damian Conway. All Rights Reserved. |
7a57cd46 |
1204 | This module is free software; you can redistribute it and/or |
1205 | modify it under the same terms as Perl itself. |