Commit | Line | Data |
85831461 |
1 | NAME |
55a1c97c |
2 | Text::Balanced - Extract delimited text sequences from strings. |
3 | |
85831461 |
4 | SYNOPSIS |
5 | use Text::Balanced qw ( |
6 | extract_delimited |
7 | extract_bracketed |
8 | extract_quotelike |
9 | extract_codeblock |
10 | extract_variable |
11 | extract_tagged |
12 | extract_multiple |
13 | gen_delimited_pat |
14 | gen_extract_tagged |
15 | ); |
16 | |
17 | # Extract the initial substring of $text that is delimited by |
18 | # two (unescaped) instances of the first character in $delim. |
19 | |
20 | ($extracted, $remainder) = extract_delimited($text,$delim); |
21 | |
22 | |
23 | # Extract the initial substring of $text that is bracketed |
24 | # with a delimiter(s) specified by $delim (where the string |
25 | # in $delim contains one or more of '(){}[]<>'). |
26 | |
27 | ($extracted, $remainder) = extract_bracketed($text,$delim); |
28 | |
29 | |
30 | # Extract the initial substring of $text that is bounded by |
31 | # an XML tag. |
32 | |
33 | ($extracted, $remainder) = extract_tagged($text); |
34 | |
35 | |
36 | # Extract the initial substring of $text that is bounded by |
37 | # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags |
38 | |
39 | ($extracted, $remainder) = |
40 | extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]}); |
41 | |
42 | |
43 | # Extract the initial substring of $text that represents a |
44 | # Perl "quote or quote-like operation" |
45 | |
46 | ($extracted, $remainder) = extract_quotelike($text); |
47 | |
48 | |
49 | # Extract the initial substring of $text that represents a block |
50 | # of Perl code, bracketed by any of character(s) specified by $delim |
51 | # (where the string $delim contains one or more of '(){}[]<>'). |
52 | |
53 | ($extracted, $remainder) = extract_codeblock($text,$delim); |
54 | |
55 | |
56 | # Extract the initial substrings of $text that would be extracted by |
57 | # one or more sequential applications of the specified functions |
58 | # or regular expressions |
59 | |
60 | @extracted = extract_multiple($text, |
61 | [ \&extract_bracketed, |
62 | \&extract_quotelike, |
63 | \&some_other_extractor_sub, |
64 | qr/[xyz]*/, |
65 | 'literal', |
66 | ]); |
67 | |
68 | # Create a string representing an optimized pattern (a la Friedl) # that |
69 | matches a substring delimited by any of the specified characters # (in |
70 | this case: any type of quote or a slash) |
71 | |
72 | $patstring = gen_delimited_pat(q{'"`/}); |
73 | |
74 | # Generate a reference to an anonymous sub that is just like |
75 | extract_tagged # but pre-compiled and optimized for a specific pair of |
76 | tags, and consequently # much faster (i.e. 3 times faster). It uses qr// |
77 | for better performance on # repeated calls, so it only works under Perl |
78 | 5.005 or later. |
79 | |
80 | $extract_head = gen_extract_tagged('<HEAD>','</HEAD>'); |
81 | |
82 | ($extracted, $remainder) = $extract_head->($text); |
83 | |
84 | DESCRIPTION |
85 | The various "extract_..." subroutines may be used to extract a delimited |
86 | substring, possibly after skipping a specified prefix string. By |
87 | default, that prefix is optional whitespace ("/\s*/"), but you can |
88 | change it to whatever you wish (see below). |
89 | |
90 | The substring to be extracted must appear at the current "pos" location |
91 | of the string's variable (or at index zero, if no "pos" position is |
92 | defined). In other words, the "extract_..." subroutines *don't* extract |
93 | the first occurrence of a substring anywhere in a string (like an |
94 | unanchored regex would). Rather, they extract an occurrence of the |
95 | substring appearing immediately at the current matching position in the |
96 | string (like a "\G"-anchored regex would). |
97 | |
98 | General behaviour in list contexts |
99 | In a list context, all the subroutines return a list, the first three |
100 | elements of which are always: |
101 | |
102 | [0] The extracted string, including the specified delimiters. If the |
103 | extraction fails "undef" is returned. |
104 | |
105 | [1] The remainder of the input string (i.e. the characters after the |
106 | extracted string). On failure, the entire string is returned. |
107 | |
108 | [2] The skipped prefix (i.e. the characters before the extracted |
109 | string). On failure, "undef" is returned. |
110 | |
111 | Note that in a list context, the contents of the original input text |
112 | (the first argument) are not modified in any way. |
113 | |
114 | However, if the input text was passed in a variable, that variable's |
115 | "pos" value is updated to point at the first character after the |
116 | extracted text. That means that in a list context the various |
117 | subroutines can be used much like regular expressions. For example: |
118 | |
119 | while ( $next = (extract_quotelike($text))[0] ) |
120 | { |
121 | # process next quote-like (in $next) |
122 | } |
123 | |
124 | General behaviour in scalar and void contexts |
125 | In a scalar context, the extracted string is returned, having first been |
126 | removed from the input text. Thus, the following code also processes |
127 | each quote-like operation, but actually removes them from $text: |
128 | |
129 | while ( $next = extract_quotelike($text) ) |
130 | { |
131 | # process next quote-like (in $next) |
132 | } |
133 | |
134 | Note that if the input text is a read-only string (i.e. a literal), no |
135 | attempt is made to remove the extracted text. |
136 | |
137 | In a void context the behaviour of the extraction subroutines is exactly |
138 | the same as in a scalar context, except (of course) that the extracted |
139 | substring is not returned. |
140 | |
141 | A note about prefixes |
142 | Prefix patterns are matched without any trailing modifiers ("/gimsox" |
143 | etc.) This can bite you if you're expecting a prefix specification like |
144 | '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix |
145 | pattern will only succeed if the <H1> tag is on the current line, since |
146 | . normally doesn't match newlines. |
147 | |
148 | To overcome this limitation, you need to turn on /s matching within the |
149 | prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)' |
150 | |
151 | "extract_delimited" |
152 | The "extract_delimited" function formalizes the common idiom of |
153 | extracting a single-character-delimited substring from the start of a |
154 | string. For example, to extract a single-quote delimited string, the |
155 | following code is typically used: |
156 | |
157 | ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s; |
158 | $extracted = $1; |
159 | |
160 | but with "extract_delimited" it can be simplified to: |
161 | |
162 | ($extracted,$remainder) = extract_delimited($text, "'"); |
163 | |
164 | "extract_delimited" takes up to four scalars (the input text, the |
165 | delimiters, a prefix pattern to be skipped, and any escape characters) |
166 | and extracts the initial substring of the text that is appropriately |
167 | delimited. If the delimiter string has multiple characters, the first |
168 | one encountered in the text is taken to delimit the substring. The third |
169 | argument specifies a prefix pattern that is to be skipped (but must be |
170 | present!) before the substring is extracted. The final argument |
171 | specifies the escape character to be used for each delimiter. |
172 | |
173 | All arguments are optional. If the escape characters are not specified, |
174 | every delimiter is escaped with a backslash ("\"). If the prefix is not |
175 | specified, the pattern '\s*' - optional whitespace - is used. If the |
176 | delimiter set is also not specified, the set "/["'`]/" is used. If the |
177 | text to be processed is not specified either, $_ is used. |
178 | |
179 | In list context, "extract_delimited" returns a array of three elements, |
180 | the extracted substring (*including the surrounding delimiters*), the |
181 | remainder of the text, and the skipped prefix (if any). If a suitable |
182 | delimited substring is not found, the first element of the array is the |
183 | empty string, the second is the complete original text, and the prefix |
184 | returned in the third element is an empty string. |
185 | |
186 | In a scalar context, just the extracted substring is returned. In a void |
187 | context, the extracted substring (and any prefix) are simply removed |
188 | from the beginning of the first argument. |
189 | |
190 | Examples: |
191 | |
192 | # Remove a single-quoted substring from the very beginning of $text: |
193 | |
194 | $substring = extract_delimited($text, "'", ''); |
195 | |
196 | # Remove a single-quoted Pascalish substring (i.e. one in which |
197 | # doubling the quote character escapes it) from the very |
198 | # beginning of $text: |
199 | |
200 | $substring = extract_delimited($text, "'", '', "'"); |
201 | |
202 | # Extract a single- or double- quoted substring from the |
203 | # beginning of $text, optionally after some whitespace |
204 | # (note the list context to protect $text from modification): |
205 | |
206 | ($substring) = extract_delimited $text, q{"'}; |
207 | |
208 | # Delete the substring delimited by the first '/' in $text: |
209 | |
210 | $text = join '', (extract_delimited($text,'/','[^/]*')[2,1]; |
211 | |
212 | Note that this last example is *not* the same as deleting the first |
213 | quote-like pattern. For instance, if $text contained the string: |
214 | |
215 | "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }" |
216 | |
217 | then after the deletion it would contain: |
218 | |
219 | "if ('.$UNIXCMD/s) { $cmd = $1; }" |
220 | |
221 | not: |
222 | |
223 | "if ('./cmd' =~ ms) { $cmd = $1; }" |
224 | |
225 | See "extract_quotelike" for a (partial) solution to this problem. |
226 | |
227 | "extract_bracketed" |
228 | Like "extract_delimited", the "extract_bracketed" function takes up to |
229 | three optional scalar arguments: a string to extract from, a delimiter |
230 | specifier, and a prefix pattern. As before, a missing prefix defaults to |
231 | optional whitespace and a missing text defaults to $_. However, a |
232 | missing delimiter specifier defaults to '{}()[]<>' (see below). |
233 | |
234 | "extract_bracketed" extracts a balanced-bracket-delimited substring |
235 | (using any one (or more) of the user-specified delimiter brackets: |
236 | '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect |
237 | quoted unbalanced brackets (see below). |
238 | |
239 | A "delimiter bracket" is a bracket in list of delimiters passed as |
240 | "extract_bracketed"'s second argument. Delimiter brackets are specified |
241 | by giving either the left or right (or both!) versions of the required |
242 | bracket(s). Note that the order in which two or more delimiter brackets |
243 | are specified is not significant. |
244 | |
245 | A "balanced-bracket-delimited substring" is a substring bounded by |
246 | matched brackets, such that any other (left or right) delimiter bracket |
247 | *within* the substring is also matched by an opposite (right or left) |
248 | delimiter bracket *at the same level of nesting*. Any type of bracket |
249 | not in the delimiter list is treated as an ordinary character. |
250 | |
251 | In other words, each type of bracket specified as a delimiter must be |
252 | balanced and correctly nested within the substring, and any other kind |
253 | of ("non-delimiter") bracket in the substring is ignored. |
254 | |
255 | For example, given the string: |
256 | |
257 | $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }"; |
258 | |
259 | then a call to "extract_bracketed" in a list context: |
260 | |
261 | @result = extract_bracketed( $text, '{}' ); |
262 | |
263 | would return: |
264 | |
265 | ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" ) |
266 | |
267 | since both sets of '{..}' brackets are properly nested and evenly |
268 | balanced. (In a scalar context just the first element of the array would |
269 | be returned. In a void context, $text would be replaced by an empty |
270 | string.) |
271 | |
272 | Likewise the call in: |
273 | |
274 | @result = extract_bracketed( $text, '{[' ); |
275 | |
276 | would return the same result, since all sets of both types of specified |
277 | delimiter brackets are correctly nested and balanced. |
278 | |
279 | However, the call in: |
280 | |
281 | @result = extract_bracketed( $text, '{([<' ); |
282 | |
283 | would fail, returning: |
284 | |
285 | ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" ); |
286 | |
287 | because the embedded pairs of '(..)'s and '[..]'s are "cross-nested" and |
288 | the embedded '>' is unbalanced. (In a scalar context, this call would |
289 | return an empty string. In a void context, $text would be unchanged.) |
290 | |
291 | Note that the embedded single-quotes in the string don't help in this |
292 | case, since they have not been specified as acceptable delimiters and |
293 | are therefore treated as non-delimiter characters (and ignored). |
294 | |
295 | However, if a particular species of quote character is included in the |
296 | delimiter specification, then that type of quote will be correctly |
297 | handled. for example, if $text is: |
298 | |
299 | $text = '<A HREF=">>>>">link</A>'; |
300 | |
301 | then |
302 | |
303 | @result = extract_bracketed( $text, '<">' ); |
304 | |
305 | returns: |
306 | |
307 | ( '<A HREF=">>>>">', 'link</A>', "" ) |
308 | |
309 | as expected. Without the specification of """ as an embedded quoter: |
310 | |
311 | @result = extract_bracketed( $text, '<>' ); |
312 | |
313 | the result would be: |
314 | |
315 | ( '<A HREF=">', '>>>">link</A>', "" ) |
316 | |
317 | In addition to the quote delimiters "'", """, and "`", full Perl |
318 | quote-like quoting (i.e. q{string}, qq{string}, etc) can be specified by |
319 | including the letter 'q' as a delimiter. Hence: |
320 | |
321 | @result = extract_bracketed( $text, '<q>' ); |
322 | |
323 | would correctly match something like this: |
324 | |
325 | $text = '<leftop: conj /and/ conj>'; |
326 | |
327 | See also: "extract_quotelike" and "extract_codeblock". |
328 | |
329 | "extract_variable" |
330 | "extract_variable" extracts any valid Perl variable or variable-involved |
331 | expression, including scalars, arrays, hashes, array accesses, hash |
332 | look-ups, method calls through objects, subroutine calls through |
333 | subroutine references, etc. |
334 | |
335 | The subroutine takes up to two optional arguments: |
336 | |
337 | 1. A string to be processed ($_ if the string is omitted or "undef") |
338 | |
339 | 2. A string specifying a pattern to be matched as a prefix (which is to |
340 | be skipped). If omitted, optional whitespace is skipped. |
341 | |
342 | On success in a list context, an array of 3 elements is returned. The |
343 | elements are: |
344 | |
345 | [0] the extracted variable, or variablish expression |
346 | |
347 | [1] the remainder of the input text, |
348 | |
349 | [2] the prefix substring (if any), |
350 | |
351 | On failure, all of these values (except the remaining text) are "undef". |
352 | |
353 | In a scalar context, "extract_variable" returns just the complete |
354 | substring that matched a variablish expression. "undef" is returned on |
355 | failure. In addition, the original input text has the returned substring |
356 | (and any prefix) removed from it. |
357 | |
358 | In a void context, the input text just has the matched substring (and |
359 | any specified prefix) removed. |
360 | |
361 | "extract_tagged" |
362 | "extract_tagged" extracts and segments text between (balanced) specified |
363 | tags. |
364 | |
365 | The subroutine takes up to five optional arguments: |
366 | |
367 | 1. A string to be processed ($_ if the string is omitted or "undef") |
368 | |
369 | 2. A string specifying a pattern to be matched as the opening tag. If |
370 | the pattern string is omitted (or "undef") then a pattern that |
371 | matches any standard XML tag is used. |
372 | |
373 | 3. A string specifying a pattern to be matched at the closing tag. If |
374 | the pattern string is omitted (or "undef") then the closing tag is |
375 | constructed by inserting a "/" after any leading bracket characters |
376 | in the actual opening tag that was matched (*not* the pattern that |
377 | matched the tag). For example, if the opening tag pattern is |
378 | specified as '{{\w+}}' and actually matched the opening tag |
379 | "{{DATA}}", then the constructed closing tag would be "{{/DATA}}". |
380 | |
381 | 4. A string specifying a pattern to be matched as a prefix (which is to |
382 | be skipped). If omitted, optional whitespace is skipped. |
383 | |
384 | 5. A hash reference containing various parsing options (see below) |
385 | |
386 | The various options that can be specified are: |
387 | |
388 | "reject => $listref" |
389 | The list reference contains one or more strings specifying patterns |
390 | that must *not* appear within the tagged text. |
391 | |
392 | For example, to extract an HTML link (which should not contain |
393 | nested links) use: |
394 | |
395 | extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} ); |
396 | |
397 | "ignore => $listref" |
398 | The list reference contains one or more strings specifying patterns |
399 | that are *not* be be treated as nested tags within the tagged text |
400 | (even if they would match the start tag pattern). |
401 | |
402 | For example, to extract an arbitrary XML tag, but ignore "empty" |
403 | elements: |
404 | |
405 | extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} ); |
406 | |
407 | (also see "gen_delimited_pat" below). |
408 | |
409 | "fail => $str" |
410 | The "fail" option indicates the action to be taken if a matching end |
411 | tag is not encountered (i.e. before the end of the string or some |
412 | "reject" pattern matches). By default, a failure to match a closing |
413 | tag causes "extract_tagged" to immediately fail. |
414 | |
415 | However, if the string value associated with <reject> is "MAX", then |
416 | "extract_tagged" returns the complete text up to the point of |
417 | failure. If the string is "PARA", "extract_tagged" returns only the |
418 | first paragraph after the tag (up to the first line that is either |
419 | empty or contains only whitespace characters). If the string is "", |
420 | the the default behaviour (i.e. failure) is reinstated. |
421 | |
422 | For example, suppose the start tag "/para" introduces a paragraph, |
423 | which then continues until the next "/endpara" tag or until another |
424 | "/para" tag is encountered: |
425 | |
426 | $text = "/para line 1\n\nline 3\n/para line 4"; |
427 | |
428 | extract_tagged($text, '/para', '/endpara', undef, |
429 | {reject => '/para', fail => MAX ); |
430 | |
431 | # EXTRACTED: "/para line 1\n\nline 3\n" |
432 | |
433 | Suppose instead, that if no matching "/endpara" tag is found, the |
434 | "/para" tag refers only to the immediately following paragraph: |
435 | |
436 | $text = "/para line 1\n\nline 3\n/para line 4"; |
437 | |
438 | extract_tagged($text, '/para', '/endpara', undef, |
439 | {reject => '/para', fail => MAX ); |
440 | |
441 | # EXTRACTED: "/para line 1\n" |
442 | |
443 | Note that the specified "fail" behaviour applies to nested tags as |
444 | well. |
445 | |
446 | On success in a list context, an array of 6 elements is returned. The |
447 | elements are: |
448 | |
449 | [0] the extracted tagged substring (including the outermost tags), |
450 | |
451 | [1] the remainder of the input text, |
452 | |
453 | [2] the prefix substring (if any), |
454 | |
455 | [3] the opening tag |
456 | |
457 | [4] the text between the opening and closing tags |
458 | |
459 | [5] the closing tag (or "" if no closing tag was found) |
460 | |
461 | On failure, all of these values (except the remaining text) are "undef". |
462 | |
463 | In a scalar context, "extract_tagged" returns just the complete |
464 | substring that matched a tagged text (including the start and end tags). |
465 | "undef" is returned on failure. In addition, the original input text has |
466 | the returned substring (and any prefix) removed from it. |
467 | |
468 | In a void context, the input text just has the matched substring (and |
469 | any specified prefix) removed. |
470 | |
471 | "gen_extract_tagged" |
472 | (Note: This subroutine is only available under Perl5.005) |
473 | |
474 | "gen_extract_tagged" generates a new anonymous subroutine which extracts |
475 | text between (balanced) specified tags. In other words, it generates a |
476 | function identical in function to "extract_tagged". |
477 | |
478 | The difference between "extract_tagged" and the anonymous subroutines |
479 | generated by "gen_extract_tagged", is that those generated subroutines: |
480 | |
481 | * do not have to reparse tag specification or parsing options every |
482 | time they are called (whereas "extract_tagged" has to effectively |
483 | rebuild its tag parser on every call); |
484 | |
485 | * make use of the new qr// construct to pre-compile the regexes they |
486 | use (whereas "extract_tagged" uses standard string variable |
487 | interpolation to create tag-matching patterns). |
488 | |
489 | The subroutine takes up to four optional arguments (the same set as |
490 | "extract_tagged" except for the string to be processed). It returns a |
491 | reference to a subroutine which in turn takes a single argument (the |
492 | text to be extracted from). |
493 | |
494 | In other words, the implementation of "extract_tagged" is exactly |
495 | equivalent to: |
496 | |
497 | sub extract_tagged |
498 | { |
499 | my $text = shift; |
500 | $extractor = gen_extract_tagged(@_); |
501 | return $extractor->($text); |
502 | } |
503 | |
504 | (although "extract_tagged" is not currently implemented that way, in |
505 | order to preserve pre-5.005 compatibility). |
506 | |
507 | Using "gen_extract_tagged" to create extraction functions for specific |
508 | tags is a good idea if those functions are going to be called more than |
509 | once, since their performance is typically twice as good as the more |
510 | general-purpose "extract_tagged". |
511 | |
512 | "extract_quotelike" |
513 | "extract_quotelike" attempts to recognize, extract, and segment any one |
514 | of the various Perl quotes and quotelike operators (see perlop(3)) |
515 | Nested backslashed delimiters, embedded balanced bracket delimiters (for |
516 | the quotelike operators), and trailing modifiers are all caught. For |
517 | example, in: |
518 | |
519 | extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #' |
520 | |
521 | extract_quotelike ' "You said, \"Use sed\"." ' |
522 | |
523 | extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; ' |
524 | |
525 | extract_quotelike ' tr/\\\/\\\\/\\\//ds; ' |
526 | |
527 | the full Perl quotelike operations are all extracted correctly. |
528 | |
529 | Note too that, when using the /x modifier on a regex, any comment |
530 | containing the current pattern delimiter will cause the regex to be |
531 | immediately terminated. In other words: |
532 | |
533 | 'm / |
534 | (?i) # CASE INSENSITIVE |
535 | [a-z_] # LEADING ALPHABETIC/UNDERSCORE |
536 | [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS |
537 | /x' |
538 | |
539 | will be extracted as if it were: |
540 | |
541 | 'm / |
542 | (?i) # CASE INSENSITIVE |
543 | [a-z_] # LEADING ALPHABETIC/' |
544 | |
545 | This behaviour is identical to that of the actual compiler. |
546 | |
547 | "extract_quotelike" takes two arguments: the text to be processed and a |
548 | prefix to be matched at the very beginning of the text. If no prefix is |
549 | specified, optional whitespace is the default. If no text is given, $_ |
550 | is used. |
551 | |
552 | In a list context, an array of 11 elements is returned. The elements |
553 | are: |
554 | |
555 | [0] the extracted quotelike substring (including trailing modifiers), |
556 | |
557 | [1] the remainder of the input text, |
558 | |
559 | [2] the prefix substring (if any), |
560 | |
561 | [3] the name of the quotelike operator (if any), |
562 | |
563 | [4] the left delimiter of the first block of the operation, |
564 | |
565 | [5] the text of the first block of the operation (that is, the contents |
566 | of a quote, the regex of a match or substitution or the target list |
567 | of a translation), |
568 | |
569 | [6] the right delimiter of the first block of the operation, |
570 | |
571 | [7] the left delimiter of the second block of the operation (that is, if |
572 | it is a "s", "tr", or "y"), |
573 | |
574 | [8] the text of the second block of the operation (that is, the |
575 | replacement of a substitution or the translation list of a |
576 | translation), |
577 | |
578 | [9] the right delimiter of the second block of the operation (if any), |
579 | |
580 | [10] |
581 | the trailing modifiers on the operation (if any). |
582 | |
583 | For each of the fields marked "(if any)" the default value on success is |
584 | an empty string. On failure, all of these values (except the remaining |
585 | text) are "undef". |
586 | |
587 | In a scalar context, "extract_quotelike" returns just the complete |
588 | substring that matched a quotelike operation (or "undef" on failure). In |
589 | a scalar or void context, the input text has the same substring (and any |
590 | specified prefix) removed. |
591 | |
592 | Examples: |
593 | |
594 | # Remove the first quotelike literal that appears in text |
595 | |
596 | $quotelike = extract_quotelike($text,'.*?'); |
597 | |
598 | # Replace one or more leading whitespace-separated quotelike |
599 | # literals in $_ with "<QLL>" |
600 | |
601 | do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@; |
602 | |
603 | |
604 | # Isolate the search pattern in a quotelike operation from $text |
605 | |
606 | ($op,$pat) = (extract_quotelike $text)[3,5]; |
607 | if ($op =~ /[ms]/) |
608 | { |
609 | print "search pattern: $pat\n"; |
610 | } |
611 | else |
612 | { |
613 | print "$op is not a pattern matching operation\n"; |
614 | } |
615 | |
616 | "extract_quotelike" and "here documents" |
617 | "extract_quotelike" can successfully extract "here documents" from an |
618 | input string, but with an important caveat in list contexts. |
619 | |
620 | Unlike other types of quote-like literals, a here document is rarely a |
621 | contiguous substring. For example, a typical piece of code using here |
622 | document might look like this: |
623 | |
624 | <<'EOMSG' || die; |
625 | This is the message. |
626 | EOMSG |
627 | exit; |
628 | |
629 | Given this as an input string in a scalar context, "extract_quotelike" |
630 | would correctly return the string "<<'EOMSG'\nThis is the |
631 | message.\nEOMSG", leaving the string " || die;\nexit;" in the original |
632 | variable. In other words, the two separate pieces of the here document |
633 | are successfully extracted and concatenated. |
634 | |
635 | In a list context, "extract_quotelike" would return the list |
636 | |
637 | [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted |
638 | here document, including fore and aft delimiters), |
639 | |
640 | [1] " || die;\nexit;" (i.e. the remainder of the input text, |
641 | concatenated), |
642 | |
643 | [2] "" (i.e. the prefix substring -- trivial in this case), |
644 | |
645 | [3] "<<" (i.e. the "name" of the quotelike operator) |
646 | |
647 | [4] "'EOMSG'" (i.e. the left delimiter of the here document, including |
648 | any quotes), |
649 | |
650 | [5] "This is the message.\n" (i.e. the text of the here document), |
651 | |
652 | [6] "EOMSG" (i.e. the right delimiter of the here document), |
653 | |
654 | [7..10] |
655 | "" (a here document has no second left delimiter, second text, |
656 | second right delimiter, or trailing modifiers). |
657 | |
658 | However, the matching position of the input variable would be set to |
659 | "exit;" (i.e. *after* the closing delimiter of the here document), which |
660 | would cause the earlier " || die;\nexit;" to be skipped in any sequence |
661 | of code fragment extractions. |
662 | |
663 | To avoid this problem, when it encounters a here document whilst |
664 | extracting from a modifiable string, "extract_quotelike" silently |
665 | rearranges the string to an equivalent piece of Perl: |
666 | |
667 | <<'EOMSG' |
668 | This is the message. |
669 | EOMSG |
670 | || die; |
671 | exit; |
672 | |
673 | in which the here document *is* contiguous. It still leaves the matching |
674 | position after the here document, but now the rest of the line on which |
675 | the here document starts is not skipped. |
676 | |
677 | To prevent <extract_quotelike> from mucking about with the input in this |
678 | way (this is the only case where a list-context "extract_quotelike" does |
679 | so), you can pass the input variable as an interpolated literal: |
680 | |
681 | $quotelike = extract_quotelike("$var"); |
682 | |
683 | "extract_codeblock" |
684 | "extract_codeblock" attempts to recognize and extract a balanced bracket |
685 | delimited substring that may contain unbalanced brackets inside Perl |
686 | quotes or quotelike operations. That is, "extract_codeblock" is like a |
687 | combination of "extract_bracketed" and "extract_quotelike". |
688 | |
689 | "extract_codeblock" takes the same initial three parameters as |
690 | "extract_bracketed": a text to process, a set of delimiter brackets to |
691 | look for, and a prefix to match first. It also takes an optional fourth |
692 | parameter, which allows the outermost delimiter brackets to be specified |
693 | separately (see below). |
694 | |
695 | Omitting the first argument (input text) means process $_ instead. |
696 | Omitting the second argument (delimiter brackets) indicates that only |
697 | '{' is to be used. Omitting the third argument (prefix argument) implies |
698 | optional whitespace at the start. Omitting the fourth argument |
699 | (outermost delimiter brackets) indicates that the value of the second |
700 | argument is to be used for the outermost delimiters. |
701 | |
702 | Once the prefix an dthe outermost opening delimiter bracket have been |
703 | recognized, code blocks are extracted by stepping through the input text |
704 | and trying the following alternatives in sequence: |
705 | |
706 | 1. Try and match a closing delimiter bracket. If the bracket was the |
707 | same species as the last opening bracket, return the substring to |
708 | that point. If the bracket was mismatched, return an error. |
709 | |
710 | 2. Try to match a quote or quotelike operator. If found, call |
711 | "extract_quotelike" to eat it. If "extract_quotelike" fails, return |
712 | the error it returned. Otherwise go back to step 1. |
713 | |
714 | 3. Try to match an opening delimiter bracket. If found, call |
715 | "extract_codeblock" recursively to eat the embedded block. If the |
716 | recursive call fails, return an error. Otherwise, go back to step 1. |
717 | |
718 | 4. Unconditionally match a bareword or any other single character, and |
719 | then go back to step 1. |
720 | |
721 | Examples: |
722 | |
723 | # Find a while loop in the text |
724 | |
725 | if ($text =~ s/.*?while\s*\{/{/) |
726 | { |
727 | $loop = "while " . extract_codeblock($text); |
728 | } |
729 | |
730 | # Remove the first round-bracketed list (which may include |
731 | # round- or curly-bracketed code blocks or quotelike operators) |
732 | |
733 | extract_codeblock $text, "(){}", '[^(]*'; |
734 | |
735 | The ability to specify a different outermost delimiter bracket is useful |
736 | in some circumstances. For example, in the Parse::RecDescent module, |
737 | parser actions which are to be performed only on a successful parse are |
738 | specified using a "<defer:...>" directive. For example: |
739 | |
740 | sentence: subject verb object |
741 | <defer: {$::theVerb = $item{verb}} > |
742 | |
743 | Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to extract the |
744 | code within the "<defer:...>" directive, but there's a problem. |
745 | |
746 | A deferred action like this: |
747 | |
748 | <defer: {if ($count>10) {$count--}} > |
749 | |
750 | will be incorrectly parsed as: |
751 | |
752 | <defer: {if ($count> |
753 | |
754 | because the "less than" operator is interpreted as a closing delimiter. |
755 | |
756 | But, by extracting the directive using |
757 | "extract_codeblock($text, '{}', undef, '<>')" the '>' character is only |
758 | treated as a delimited at the outermost level of the code block, so the |
759 | directive is parsed correctly. |
760 | |
761 | "extract_multiple" |
762 | The "extract_multiple" subroutine takes a string to be processed and a |
763 | list of extractors (subroutines or regular expressions) to apply to that |
764 | string. |
765 | |
766 | In an array context "extract_multiple" returns an array of substrings of |
767 | the original string, as extracted by the specified extractors. In a |
768 | scalar context, "extract_multiple" returns the first substring |
769 | successfully extracted from the original string. In both scalar and void |
770 | contexts the original string has the first successfully extracted |
771 | substring removed from it. In all contexts "extract_multiple" starts at |
772 | the current "pos" of the string, and sets that "pos" appropriately after |
773 | it matches. |
774 | |
775 | Hence, the aim of of a call to "extract_multiple" in a list context is |
776 | to split the processed string into as many non-overlapping fields as |
777 | possible, by repeatedly applying each of the specified extractors to the |
778 | remainder of the string. Thus "extract_multiple" is a generalized form |
779 | of Perl's "split" subroutine. |
780 | |
781 | The subroutine takes up to four optional arguments: |
782 | |
783 | 1. A string to be processed ($_ if the string is omitted or "undef") |
784 | |
785 | 2. A reference to a list of subroutine references and/or qr// objects |
786 | and/or literal strings and/or hash references, specifying the |
787 | extractors to be used to split the string. If this argument is |
788 | omitted (or "undef") the list: |
789 | |
790 | [ |
791 | sub { extract_variable($_[0], '') }, |
792 | sub { extract_quotelike($_[0],'') }, |
793 | sub { extract_codeblock($_[0],'{}','') }, |
794 | ] |
795 | |
796 | is used. |
797 | |
798 | 3. An number specifying the maximum number of fields to return. If this |
799 | argument is omitted (or "undef"), split continues as long as |
800 | possible. |
801 | |
802 | If the third argument is *N*, then extraction continues until *N* |
803 | fields have been successfully extracted, or until the string has |
804 | been completely processed. |
805 | |
806 | Note that in scalar and void contexts the value of this argument is |
807 | automatically reset to 1 (under "-w", a warning is issued if the |
808 | argument has to be reset). |
809 | |
810 | 4. A value indicating whether unmatched substrings (see below) within |
811 | the text should be skipped or returned as fields. If the value is |
812 | true, such substrings are skipped. Otherwise, they are returned. |
813 | |
814 | The extraction process works by applying each extractor in sequence to |
815 | the text string. |
816 | |
817 | If the extractor is a subroutine it is called in a list context and is |
818 | expected to return a list of a single element, namely the extracted |
819 | text. It may optionally also return two further arguments: a string |
820 | representing the text left after extraction (like $' for a pattern |
821 | match), and a string representing any prefix skipped before the |
822 | extraction (like $` in a pattern match). Note that this is designed to |
823 | facilitate the use of other Text::Balanced subroutines with |
824 | "extract_multiple". Note too that the value returned by an extractor |
825 | subroutine need not bear any relationship to the corresponding substring |
826 | of the original text (see examples below). |
827 | |
828 | If the extractor is a precompiled regular expression or a string, it is |
829 | matched against the text in a scalar context with a leading '\G' and the |
830 | gc modifiers enabled. The extracted value is either $1 if that variable |
831 | is defined after the match, or else the complete match (i.e. $&). |
832 | |
833 | If the extractor is a hash reference, it must contain exactly one |
834 | element. The value of that element is one of the above extractor types |
835 | (subroutine reference, regular expression, or string). The key of that |
836 | element is the name of a class into which the successful return value of |
837 | the extractor will be blessed. |
838 | |
839 | If an extractor returns a defined value, that value is immediately |
840 | treated as the next extracted field and pushed onto the list of fields. |
841 | If the extractor was specified in a hash reference, the field is also |
842 | blessed into the appropriate class, |
843 | |
844 | If the extractor fails to match (in the case of a regex extractor), or |
845 | returns an empty list or an undefined value (in the case of a subroutine |
846 | extractor), it is assumed to have failed to extract. If none of the |
847 | extractor subroutines succeeds, then one character is extracted from the |
848 | start of the text and the extraction subroutines reapplied. Characters |
849 | which are thus removed are accumulated and eventually become the next |
850 | field (unless the fourth argument is true, in which case they are |
851 | discarded). |
852 | |
853 | For example, the following extracts substrings that are valid Perl |
854 | variables: |
855 | |
856 | @fields = extract_multiple($text, |
857 | [ sub { extract_variable($_[0]) } ], |
858 | undef, 1); |
859 | |
860 | This example separates a text into fields which are quote delimited, |
861 | curly bracketed, and anything else. The delimited and bracketed parts |
862 | are also blessed to identify them (the "anything else" is unblessed): |
863 | |
864 | @fields = extract_multiple($text, |
865 | [ |
866 | { Delim => sub { extract_delimited($_[0],q{'"}) } }, |
867 | { Brack => sub { extract_bracketed($_[0],'{}') } }, |
868 | ]); |
869 | |
870 | This call extracts the next single substring that is a valid Perl |
871 | quotelike operator (and removes it from $text): |
872 | |
873 | $quotelike = extract_multiple($text, |
874 | [ |
875 | sub { extract_quotelike($_[0]) }, |
876 | ], undef, 1); |
877 | |
878 | Finally, here is yet another way to do comma-separated value parsing: |
879 | |
880 | @fields = extract_multiple($csv_text, |
881 | [ |
882 | sub { extract_delimited($_[0],q{'"}) }, |
883 | qr/([^,]+)(.*)/, |
884 | ], |
885 | undef,1); |
886 | |
887 | The list in the second argument means: *"Try and extract a ' or " |
888 | delimited string, otherwise extract anything up to a comma..."*. The |
889 | undef third argument means: *"...as many times as possible..."*, and the |
890 | true value in the fourth argument means *"...discarding anything else |
891 | that appears (i.e. the commas)"*. |
892 | |
893 | If you wanted the commas preserved as separate fields (i.e. like split |
894 | does if your split pattern has capturing parentheses), you would just |
895 | make the last parameter undefined (or remove it). |
896 | |
897 | "gen_delimited_pat" |
898 | The "gen_delimited_pat" subroutine takes a single (string) argument and |
899 | > builds a Friedl-style optimized regex that matches a string delimited |
900 | by any one of the characters in the single argument. For example: |
901 | |
902 | gen_delimited_pat(q{'"}) |
903 | |
904 | returns the regex: |
905 | |
906 | (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\') |
907 | |
908 | Note that the specified delimiters are automatically quotemeta'd. |
909 | |
910 | A typical use of "gen_delimited_pat" would be to build special purpose |
911 | tags for "extract_tagged". For example, to properly ignore "empty" XML |
912 | elements (which might contain quoted strings): |
913 | |
914 | my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>'; |
915 | |
916 | extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} ); |
917 | |
918 | "gen_delimited_pat" may also be called with an optional second argument, |
919 | which specifies the "escape" character(s) to be used for each delimiter. |
920 | For example to match a Pascal-style string (where ' is the delimiter and |
921 | '' is a literal ' within the string): |
922 | |
923 | gen_delimited_pat(q{'},q{'}); |
924 | |
925 | Different escape characters can be specified for different delimiters. |
926 | For example, to specify that '/' is the escape for single quotes and '%' |
927 | is the escape for double quotes: |
928 | |
929 | gen_delimited_pat(q{'"},q{/%}); |
930 | |
931 | If more delimiters than escape chars are specified, the last escape char |
932 | is used for the remaining delimiters. If no escape char is specified for |
933 | a given specified delimiter, '\' is used. |
934 | |
935 | "delimited_pat" |
936 | Note that "gen_delimited_pat" was previously called "delimited_pat". |
937 | That name may still be used, but is now deprecated. |
938 | |
939 | DIAGNOSTICS |
940 | In a list context, all the functions return "(undef,$original_text)" on |
941 | failure. In a scalar context, failure is indicated by returning "undef" |
942 | (in this case the input text is not modified in any way). |
943 | |
944 | In addition, on failure in *any* context, the $@ variable is set. |
945 | Accessing "$@->{error}" returns one of the error diagnostics listed |
946 | below. Accessing "$@->{pos}" returns the offset into the original string |
947 | at which the error was detected (although not necessarily where it |
948 | occurred!) Printing $@ directly produces the error message, with the |
949 | offset appended. On success, the $@ variable is guaranteed to be |
950 | "undef". |
951 | |
952 | The available diagnostics are: |
953 | |
954 | "Did not find a suitable bracket: "%s"" |
955 | The delimiter provided to "extract_bracketed" was not one of |
956 | '()[]<>{}'. |
957 | |
958 | "Did not find prefix: /%s/" |
959 | A non-optional prefix was specified but wasn't found at the start of |
960 | the text. |
961 | |
962 | "Did not find opening bracket after prefix: "%s"" |
963 | "extract_bracketed" or "extract_codeblock" was expecting a |
964 | particular kind of bracket at the start of the text, and didn't find |
965 | it. |
966 | |
967 | "No quotelike operator found after prefix: "%s"" |
968 | "extract_quotelike" didn't find one of the quotelike operators "q", |
969 | "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it |
970 | was extracting. |
971 | |
972 | "Unmatched closing bracket: "%c"" |
973 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" |
974 | encountered a closing bracket where none was expected. |
975 | |
976 | "Unmatched opening bracket(s): "%s"" |
977 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran |
978 | out of characters in the text before closing one or more levels of |
979 | nested brackets. |
980 | |
981 | "Unmatched embedded quote (%s)" |
982 | "extract_bracketed" attempted to match an embedded quoted substring, |
983 | but failed to find a closing quote to match it. |
984 | |
985 | "Did not find closing delimiter to match '%s'" |
986 | "extract_quotelike" was unable to find a closing delimiter to match |
987 | the one that opened the quote-like operation. |
988 | |
989 | "Mismatched closing bracket: expected "%c" but found "%s"" |
990 | "extract_bracketed", "extract_quotelike" or "extract_codeblock" |
991 | found a valid bracket delimiter, but it was the wrong species. This |
992 | usually indicates a nesting error, but may indicate incorrect |
993 | quoting or escaping. |
994 | |
995 | "No block delimiter found after quotelike "%s"" |
996 | "extract_quotelike" or "extract_codeblock" found one of the |
997 | quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without |
998 | a suitable block after it. |
55a1c97c |
999 | |
85831461 |
1000 | "Did not find leading dereferencer" |
1001 | "extract_variable" was expecting one of '$', '@', or '%' at the |
1002 | start of a variable, but didn't find any of them. |
55a1c97c |
1003 | |
85831461 |
1004 | "Bad identifier after dereferencer" |
1005 | "extract_variable" found a '$', '@', or '%' indicating a variable, |
1006 | but that character was not followed by a legal Perl identifier. |
55a1c97c |
1007 | |
85831461 |
1008 | "Did not find expected opening bracket at %s" |
1009 | "extract_codeblock" failed to find any of the outermost opening |
1010 | brackets that were specified. |
55a1c97c |
1011 | |
85831461 |
1012 | "Improperly nested codeblock at %s" |
1013 | A nested code block was found that started with a delimiter that was |
1014 | specified as being only to be used as an outermost bracket. |
55a1c97c |
1015 | |
85831461 |
1016 | "Missing second block for quotelike "%s"" |
1017 | "extract_codeblock" or "extract_quotelike" found one of the |
1018 | quotelike operators "s", "tr" or "y" followed by only one block. |
55a1c97c |
1019 | |
85831461 |
1020 | "No match found for opening bracket" |
1021 | "extract_codeblock" failed to find a closing bracket to match the |
1022 | outermost opening bracket. |
55a1c97c |
1023 | |
85831461 |
1024 | "Did not find opening tag: /%s/" |
1025 | "extract_tagged" did not find a suitable opening tag (after any |
1026 | specified prefix was removed). |
55a1c97c |
1027 | |
85831461 |
1028 | "Unable to construct closing tag to match: /%s/" |
1029 | "extract_tagged" matched the specified opening tag and tried to |
1030 | modify the matched text to produce a matching closing tag (because |
1031 | none was specified). It failed to generate the closing tag, almost |
1032 | certainly because the opening tag did not start with a bracket of |
1033 | some kind. |
55a1c97c |
1034 | |
85831461 |
1035 | "Found invalid nested tag: %s" |
1036 | "extract_tagged" found a nested tag that appeared in the "reject" |
1037 | list (and the failure mode was not "MAX" or "PARA"). |
55a1c97c |
1038 | |
85831461 |
1039 | "Found unbalanced nested tag: %s" |
1040 | "extract_tagged" found a nested opening tag that was not matched by |
1041 | a corresponding nested closing tag (and the failure mode was not |
1042 | "MAX" or "PARA"). |
55a1c97c |
1043 | |
85831461 |
1044 | "Did not find closing tag" |
1045 | "extract_tagged" reached the end of the text without finding a |
1046 | closing tag to match the original opening tag (and the failure mode |
1047 | was not "MAX" or "PARA"). |
55a1c97c |
1048 | |
85831461 |
1049 | AUTHOR |
1050 | Damian Conway (damian@conway.org) |
55a1c97c |
1051 | |
85831461 |
1052 | BUGS AND IRRITATIONS |
1053 | There are undoubtedly serious bugs lurking somewhere in this code, if |
1054 | only because parts of it give the impression of understanding a great |
1055 | deal more about Perl than they really do. |
55a1c97c |
1056 | |
85831461 |
1057 | Bug reports and other feedback are most welcome. |
55a1c97c |
1058 | |
85831461 |
1059 | COPYRIGHT |
1060 | Copyright 1997 - 2001 Damian Conway. All Rights Reserved. |
55a1c97c |
1061 | |
85831461 |
1062 | Some (minor) parts copyright 2009 Adam Kennedy. |
55a1c97c |
1063 | |
85831461 |
1064 | This module is free software. It may be used, redistributed and/or |
1065 | modified under the same terms as Perl itself. |
55a1c97c |
1066 | |