2 Text::Balanced - Extract delimited text sequences from strings.
5 use Text::Balanced qw (
17 # Extract the initial substring of $text that is delimited by
18 # two (unescaped) instances of the first character in $delim.
20 ($extracted, $remainder) = extract_delimited($text,$delim);
23 # Extract the initial substring of $text that is bracketed
24 # with a delimiter(s) specified by $delim (where the string
25 # in $delim contains one or more of '(){}[]<>').
27 ($extracted, $remainder) = extract_bracketed($text,$delim);
30 # Extract the initial substring of $text that is bounded by
33 ($extracted, $remainder) = extract_tagged($text);
36 # Extract the initial substring of $text that is bounded by
37 # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
39 ($extracted, $remainder) =
40 extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
43 # Extract the initial substring of $text that represents a
44 # Perl "quote or quote-like operation"
46 ($extracted, $remainder) = extract_quotelike($text);
49 # Extract the initial substring of $text that represents a block
50 # of Perl code, bracketed by any of character(s) specified by $delim
51 # (where the string $delim contains one or more of '(){}[]<>').
53 ($extracted, $remainder) = extract_codeblock($text,$delim);
56 # Extract the initial substrings of $text that would be extracted by
57 # one or more sequential applications of the specified functions
58 # or regular expressions
60 @extracted = extract_multiple($text,
61 [ \&extract_bracketed,
63 \&some_other_extractor_sub,
68 # Create a string representing an optimized pattern (a la Friedl) # that
69 matches a substring delimited by any of the specified characters # (in
70 this case: any type of quote or a slash)
72 $patstring = gen_delimited_pat(q{'"`/});
74 # Generate a reference to an anonymous sub that is just like
75 extract_tagged # but pre-compiled and optimized for a specific pair of
76 tags, and consequently # much faster (i.e. 3 times faster). It uses qr//
77 for better performance on # repeated calls, so it only works under Perl
80 $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
82 ($extracted, $remainder) = $extract_head->($text);
85 The various "extract_..." subroutines may be used to extract a delimited
86 substring, possibly after skipping a specified prefix string. By
87 default, that prefix is optional whitespace ("/\s*/"), but you can
88 change it to whatever you wish (see below).
90 The substring to be extracted must appear at the current "pos" location
91 of the string's variable (or at index zero, if no "pos" position is
92 defined). In other words, the "extract_..." subroutines *don't* extract
93 the first occurrence of a substring anywhere in a string (like an
94 unanchored regex would). Rather, they extract an occurrence of the
95 substring appearing immediately at the current matching position in the
96 string (like a "\G"-anchored regex would).
98 General behaviour in list contexts
99 In a list context, all the subroutines return a list, the first three
100 elements of which are always:
102 [0] The extracted string, including the specified delimiters. If the
103 extraction fails "undef" is returned.
105 [1] The remainder of the input string (i.e. the characters after the
106 extracted string). On failure, the entire string is returned.
108 [2] The skipped prefix (i.e. the characters before the extracted
109 string). On failure, "undef" is returned.
111 Note that in a list context, the contents of the original input text
112 (the first argument) are not modified in any way.
114 However, if the input text was passed in a variable, that variable's
115 "pos" value is updated to point at the first character after the
116 extracted text. That means that in a list context the various
117 subroutines can be used much like regular expressions. For example:
119 while ( $next = (extract_quotelike($text))[0] )
121 # process next quote-like (in $next)
124 General behaviour in scalar and void contexts
125 In a scalar context, the extracted string is returned, having first been
126 removed from the input text. Thus, the following code also processes
127 each quote-like operation, but actually removes them from $text:
129 while ( $next = extract_quotelike($text) )
131 # process next quote-like (in $next)
134 Note that if the input text is a read-only string (i.e. a literal), no
135 attempt is made to remove the extracted text.
137 In a void context the behaviour of the extraction subroutines is exactly
138 the same as in a scalar context, except (of course) that the extracted
139 substring is not returned.
141 A note about prefixes
142 Prefix patterns are matched without any trailing modifiers ("/gimsox"
143 etc.) This can bite you if you're expecting a prefix specification like
144 '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix
145 pattern will only succeed if the <H1> tag is on the current line, since
146 . normally doesn't match newlines.
148 To overcome this limitation, you need to turn on /s matching within the
149 prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)'
152 The "extract_delimited" function formalizes the common idiom of
153 extracting a single-character-delimited substring from the start of a
154 string. For example, to extract a single-quote delimited string, the
155 following code is typically used:
157 ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
160 but with "extract_delimited" it can be simplified to:
162 ($extracted,$remainder) = extract_delimited($text, "'");
164 "extract_delimited" takes up to four scalars (the input text, the
165 delimiters, a prefix pattern to be skipped, and any escape characters)
166 and extracts the initial substring of the text that is appropriately
167 delimited. If the delimiter string has multiple characters, the first
168 one encountered in the text is taken to delimit the substring. The third
169 argument specifies a prefix pattern that is to be skipped (but must be
170 present!) before the substring is extracted. The final argument
171 specifies the escape character to be used for each delimiter.
173 All arguments are optional. If the escape characters are not specified,
174 every delimiter is escaped with a backslash ("\"). If the prefix is not
175 specified, the pattern '\s*' - optional whitespace - is used. If the
176 delimiter set is also not specified, the set "/["'`]/" is used. If the
177 text to be processed is not specified either, $_ is used.
179 In list context, "extract_delimited" returns a array of three elements,
180 the extracted substring (*including the surrounding delimiters*), the
181 remainder of the text, and the skipped prefix (if any). If a suitable
182 delimited substring is not found, the first element of the array is the
183 empty string, the second is the complete original text, and the prefix
184 returned in the third element is an empty string.
186 In a scalar context, just the extracted substring is returned. In a void
187 context, the extracted substring (and any prefix) are simply removed
188 from the beginning of the first argument.
192 # Remove a single-quoted substring from the very beginning of $text:
194 $substring = extract_delimited($text, "'", '');
196 # Remove a single-quoted Pascalish substring (i.e. one in which
197 # doubling the quote character escapes it) from the very
198 # beginning of $text:
200 $substring = extract_delimited($text, "'", '', "'");
202 # Extract a single- or double- quoted substring from the
203 # beginning of $text, optionally after some whitespace
204 # (note the list context to protect $text from modification):
206 ($substring) = extract_delimited $text, q{"'};
208 # Delete the substring delimited by the first '/' in $text:
210 $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
212 Note that this last example is *not* the same as deleting the first
213 quote-like pattern. For instance, if $text contained the string:
215 "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
217 then after the deletion it would contain:
219 "if ('.$UNIXCMD/s) { $cmd = $1; }"
223 "if ('./cmd' =~ ms) { $cmd = $1; }"
225 See "extract_quotelike" for a (partial) solution to this problem.
228 Like "extract_delimited", the "extract_bracketed" function takes up to
229 three optional scalar arguments: a string to extract from, a delimiter
230 specifier, and a prefix pattern. As before, a missing prefix defaults to
231 optional whitespace and a missing text defaults to $_. However, a
232 missing delimiter specifier defaults to '{}()[]<>' (see below).
234 "extract_bracketed" extracts a balanced-bracket-delimited substring
235 (using any one (or more) of the user-specified delimiter brackets:
236 '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
237 quoted unbalanced brackets (see below).
239 A "delimiter bracket" is a bracket in list of delimiters passed as
240 "extract_bracketed"'s second argument. Delimiter brackets are specified
241 by giving either the left or right (or both!) versions of the required
242 bracket(s). Note that the order in which two or more delimiter brackets
243 are specified is not significant.
245 A "balanced-bracket-delimited substring" is a substring bounded by
246 matched brackets, such that any other (left or right) delimiter bracket
247 *within* the substring is also matched by an opposite (right or left)
248 delimiter bracket *at the same level of nesting*. Any type of bracket
249 not in the delimiter list is treated as an ordinary character.
251 In other words, each type of bracket specified as a delimiter must be
252 balanced and correctly nested within the substring, and any other kind
253 of ("non-delimiter") bracket in the substring is ignored.
255 For example, given the string:
257 $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
259 then a call to "extract_bracketed" in a list context:
261 @result = extract_bracketed( $text, '{}' );
265 ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
267 since both sets of '{..}' brackets are properly nested and evenly
268 balanced. (In a scalar context just the first element of the array would
269 be returned. In a void context, $text would be replaced by an empty
272 Likewise the call in:
274 @result = extract_bracketed( $text, '{[' );
276 would return the same result, since all sets of both types of specified
277 delimiter brackets are correctly nested and balanced.
279 However, the call in:
281 @result = extract_bracketed( $text, '{([<' );
283 would fail, returning:
285 ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" );
287 because the embedded pairs of '(..)'s and '[..]'s are "cross-nested" and
288 the embedded '>' is unbalanced. (In a scalar context, this call would
289 return an empty string. In a void context, $text would be unchanged.)
291 Note that the embedded single-quotes in the string don't help in this
292 case, since they have not been specified as acceptable delimiters and
293 are therefore treated as non-delimiter characters (and ignored).
295 However, if a particular species of quote character is included in the
296 delimiter specification, then that type of quote will be correctly
297 handled. for example, if $text is:
299 $text = '<A HREF=">>>>">link</A>';
303 @result = extract_bracketed( $text, '<">' );
307 ( '<A HREF=">>>>">', 'link</A>', "" )
309 as expected. Without the specification of """ as an embedded quoter:
311 @result = extract_bracketed( $text, '<>' );
315 ( '<A HREF=">', '>>>">link</A>', "" )
317 In addition to the quote delimiters "'", """, and "`", full Perl
318 quote-like quoting (i.e. q{string}, qq{string}, etc) can be specified by
319 including the letter 'q' as a delimiter. Hence:
321 @result = extract_bracketed( $text, '<q>' );
323 would correctly match something like this:
325 $text = '<leftop: conj /and/ conj>';
327 See also: "extract_quotelike" and "extract_codeblock".
330 "extract_variable" extracts any valid Perl variable or variable-involved
331 expression, including scalars, arrays, hashes, array accesses, hash
332 look-ups, method calls through objects, subroutine calls through
333 subroutine references, etc.
335 The subroutine takes up to two optional arguments:
337 1. A string to be processed ($_ if the string is omitted or "undef")
339 2. A string specifying a pattern to be matched as a prefix (which is to
340 be skipped). If omitted, optional whitespace is skipped.
342 On success in a list context, an array of 3 elements is returned. The
345 [0] the extracted variable, or variablish expression
347 [1] the remainder of the input text,
349 [2] the prefix substring (if any),
351 On failure, all of these values (except the remaining text) are "undef".
353 In a scalar context, "extract_variable" returns just the complete
354 substring that matched a variablish expression. "undef" is returned on
355 failure. In addition, the original input text has the returned substring
356 (and any prefix) removed from it.
358 In a void context, the input text just has the matched substring (and
359 any specified prefix) removed.
362 "extract_tagged" extracts and segments text between (balanced) specified
365 The subroutine takes up to five optional arguments:
367 1. A string to be processed ($_ if the string is omitted or "undef")
369 2. A string specifying a pattern to be matched as the opening tag. If
370 the pattern string is omitted (or "undef") then a pattern that
371 matches any standard XML tag is used.
373 3. A string specifying a pattern to be matched at the closing tag. If
374 the pattern string is omitted (or "undef") then the closing tag is
375 constructed by inserting a "/" after any leading bracket characters
376 in the actual opening tag that was matched (*not* the pattern that
377 matched the tag). For example, if the opening tag pattern is
378 specified as '{{\w+}}' and actually matched the opening tag
379 "{{DATA}}", then the constructed closing tag would be "{{/DATA}}".
381 4. A string specifying a pattern to be matched as a prefix (which is to
382 be skipped). If omitted, optional whitespace is skipped.
384 5. A hash reference containing various parsing options (see below)
386 The various options that can be specified are:
389 The list reference contains one or more strings specifying patterns
390 that must *not* appear within the tagged text.
392 For example, to extract an HTML link (which should not contain
395 extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
398 The list reference contains one or more strings specifying patterns
399 that are *not* be be treated as nested tags within the tagged text
400 (even if they would match the start tag pattern).
402 For example, to extract an arbitrary XML tag, but ignore "empty"
405 extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
407 (also see "gen_delimited_pat" below).
410 The "fail" option indicates the action to be taken if a matching end
411 tag is not encountered (i.e. before the end of the string or some
412 "reject" pattern matches). By default, a failure to match a closing
413 tag causes "extract_tagged" to immediately fail.
415 However, if the string value associated with <reject> is "MAX", then
416 "extract_tagged" returns the complete text up to the point of
417 failure. If the string is "PARA", "extract_tagged" returns only the
418 first paragraph after the tag (up to the first line that is either
419 empty or contains only whitespace characters). If the string is "",
420 the the default behaviour (i.e. failure) is reinstated.
422 For example, suppose the start tag "/para" introduces a paragraph,
423 which then continues until the next "/endpara" tag or until another
424 "/para" tag is encountered:
426 $text = "/para line 1\n\nline 3\n/para line 4";
428 extract_tagged($text, '/para', '/endpara', undef,
429 {reject => '/para', fail => MAX );
431 # EXTRACTED: "/para line 1\n\nline 3\n"
433 Suppose instead, that if no matching "/endpara" tag is found, the
434 "/para" tag refers only to the immediately following paragraph:
436 $text = "/para line 1\n\nline 3\n/para line 4";
438 extract_tagged($text, '/para', '/endpara', undef,
439 {reject => '/para', fail => MAX );
441 # EXTRACTED: "/para line 1\n"
443 Note that the specified "fail" behaviour applies to nested tags as
446 On success in a list context, an array of 6 elements is returned. The
449 [0] the extracted tagged substring (including the outermost tags),
451 [1] the remainder of the input text,
453 [2] the prefix substring (if any),
457 [4] the text between the opening and closing tags
459 [5] the closing tag (or "" if no closing tag was found)
461 On failure, all of these values (except the remaining text) are "undef".
463 In a scalar context, "extract_tagged" returns just the complete
464 substring that matched a tagged text (including the start and end tags).
465 "undef" is returned on failure. In addition, the original input text has
466 the returned substring (and any prefix) removed from it.
468 In a void context, the input text just has the matched substring (and
469 any specified prefix) removed.
472 (Note: This subroutine is only available under Perl5.005)
474 "gen_extract_tagged" generates a new anonymous subroutine which extracts
475 text between (balanced) specified tags. In other words, it generates a
476 function identical in function to "extract_tagged".
478 The difference between "extract_tagged" and the anonymous subroutines
479 generated by "gen_extract_tagged", is that those generated subroutines:
481 * do not have to reparse tag specification or parsing options every
482 time they are called (whereas "extract_tagged" has to effectively
483 rebuild its tag parser on every call);
485 * make use of the new qr// construct to pre-compile the regexes they
486 use (whereas "extract_tagged" uses standard string variable
487 interpolation to create tag-matching patterns).
489 The subroutine takes up to four optional arguments (the same set as
490 "extract_tagged" except for the string to be processed). It returns a
491 reference to a subroutine which in turn takes a single argument (the
492 text to be extracted from).
494 In other words, the implementation of "extract_tagged" is exactly
500 $extractor = gen_extract_tagged(@_);
501 return $extractor->($text);
504 (although "extract_tagged" is not currently implemented that way, in
505 order to preserve pre-5.005 compatibility).
507 Using "gen_extract_tagged" to create extraction functions for specific
508 tags is a good idea if those functions are going to be called more than
509 once, since their performance is typically twice as good as the more
510 general-purpose "extract_tagged".
513 "extract_quotelike" attempts to recognize, extract, and segment any one
514 of the various Perl quotes and quotelike operators (see perlop(3))
515 Nested backslashed delimiters, embedded balanced bracket delimiters (for
516 the quotelike operators), and trailing modifiers are all caught. For
519 extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
521 extract_quotelike ' "You said, \"Use sed\"." '
523 extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
525 extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
527 the full Perl quotelike operations are all extracted correctly.
529 Note too that, when using the /x modifier on a regex, any comment
530 containing the current pattern delimiter will cause the regex to be
531 immediately terminated. In other words:
534 (?i) # CASE INSENSITIVE
535 [a-z_] # LEADING ALPHABETIC/UNDERSCORE
536 [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
539 will be extracted as if it were:
542 (?i) # CASE INSENSITIVE
543 [a-z_] # LEADING ALPHABETIC/'
545 This behaviour is identical to that of the actual compiler.
547 "extract_quotelike" takes two arguments: the text to be processed and a
548 prefix to be matched at the very beginning of the text. If no prefix is
549 specified, optional whitespace is the default. If no text is given, $_
552 In a list context, an array of 11 elements is returned. The elements
555 [0] the extracted quotelike substring (including trailing modifiers),
557 [1] the remainder of the input text,
559 [2] the prefix substring (if any),
561 [3] the name of the quotelike operator (if any),
563 [4] the left delimiter of the first block of the operation,
565 [5] the text of the first block of the operation (that is, the contents
566 of a quote, the regex of a match or substitution or the target list
569 [6] the right delimiter of the first block of the operation,
571 [7] the left delimiter of the second block of the operation (that is, if
572 it is a "s", "tr", or "y"),
574 [8] the text of the second block of the operation (that is, the
575 replacement of a substitution or the translation list of a
578 [9] the right delimiter of the second block of the operation (if any),
581 the trailing modifiers on the operation (if any).
583 For each of the fields marked "(if any)" the default value on success is
584 an empty string. On failure, all of these values (except the remaining
587 In a scalar context, "extract_quotelike" returns just the complete
588 substring that matched a quotelike operation (or "undef" on failure). In
589 a scalar or void context, the input text has the same substring (and any
590 specified prefix) removed.
594 # Remove the first quotelike literal that appears in text
596 $quotelike = extract_quotelike($text,'.*?');
598 # Replace one or more leading whitespace-separated quotelike
599 # literals in $_ with "<QLL>"
601 do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
604 # Isolate the search pattern in a quotelike operation from $text
606 ($op,$pat) = (extract_quotelike $text)[3,5];
609 print "search pattern: $pat\n";
613 print "$op is not a pattern matching operation\n";
616 "extract_quotelike" and "here documents"
617 "extract_quotelike" can successfully extract "here documents" from an
618 input string, but with an important caveat in list contexts.
620 Unlike other types of quote-like literals, a here document is rarely a
621 contiguous substring. For example, a typical piece of code using here
622 document might look like this:
629 Given this as an input string in a scalar context, "extract_quotelike"
630 would correctly return the string "<<'EOMSG'\nThis is the
631 message.\nEOMSG", leaving the string " || die;\nexit;" in the original
632 variable. In other words, the two separate pieces of the here document
633 are successfully extracted and concatenated.
635 In a list context, "extract_quotelike" would return the list
637 [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted
638 here document, including fore and aft delimiters),
640 [1] " || die;\nexit;" (i.e. the remainder of the input text,
643 [2] "" (i.e. the prefix substring -- trivial in this case),
645 [3] "<<" (i.e. the "name" of the quotelike operator)
647 [4] "'EOMSG'" (i.e. the left delimiter of the here document, including
650 [5] "This is the message.\n" (i.e. the text of the here document),
652 [6] "EOMSG" (i.e. the right delimiter of the here document),
655 "" (a here document has no second left delimiter, second text,
656 second right delimiter, or trailing modifiers).
658 However, the matching position of the input variable would be set to
659 "exit;" (i.e. *after* the closing delimiter of the here document), which
660 would cause the earlier " || die;\nexit;" to be skipped in any sequence
661 of code fragment extractions.
663 To avoid this problem, when it encounters a here document whilst
664 extracting from a modifiable string, "extract_quotelike" silently
665 rearranges the string to an equivalent piece of Perl:
673 in which the here document *is* contiguous. It still leaves the matching
674 position after the here document, but now the rest of the line on which
675 the here document starts is not skipped.
677 To prevent <extract_quotelike> from mucking about with the input in this
678 way (this is the only case where a list-context "extract_quotelike" does
679 so), you can pass the input variable as an interpolated literal:
681 $quotelike = extract_quotelike("$var");
684 "extract_codeblock" attempts to recognize and extract a balanced bracket
685 delimited substring that may contain unbalanced brackets inside Perl
686 quotes or quotelike operations. That is, "extract_codeblock" is like a
687 combination of "extract_bracketed" and "extract_quotelike".
689 "extract_codeblock" takes the same initial three parameters as
690 "extract_bracketed": a text to process, a set of delimiter brackets to
691 look for, and a prefix to match first. It also takes an optional fourth
692 parameter, which allows the outermost delimiter brackets to be specified
693 separately (see below).
695 Omitting the first argument (input text) means process $_ instead.
696 Omitting the second argument (delimiter brackets) indicates that only
697 '{' is to be used. Omitting the third argument (prefix argument) implies
698 optional whitespace at the start. Omitting the fourth argument
699 (outermost delimiter brackets) indicates that the value of the second
700 argument is to be used for the outermost delimiters.
702 Once the prefix an dthe outermost opening delimiter bracket have been
703 recognized, code blocks are extracted by stepping through the input text
704 and trying the following alternatives in sequence:
706 1. Try and match a closing delimiter bracket. If the bracket was the
707 same species as the last opening bracket, return the substring to
708 that point. If the bracket was mismatched, return an error.
710 2. Try to match a quote or quotelike operator. If found, call
711 "extract_quotelike" to eat it. If "extract_quotelike" fails, return
712 the error it returned. Otherwise go back to step 1.
714 3. Try to match an opening delimiter bracket. If found, call
715 "extract_codeblock" recursively to eat the embedded block. If the
716 recursive call fails, return an error. Otherwise, go back to step 1.
718 4. Unconditionally match a bareword or any other single character, and
719 then go back to step 1.
723 # Find a while loop in the text
725 if ($text =~ s/.*?while\s*\{/{/)
727 $loop = "while " . extract_codeblock($text);
730 # Remove the first round-bracketed list (which may include
731 # round- or curly-bracketed code blocks or quotelike operators)
733 extract_codeblock $text, "(){}", '[^(]*';
735 The ability to specify a different outermost delimiter bracket is useful
736 in some circumstances. For example, in the Parse::RecDescent module,
737 parser actions which are to be performed only on a successful parse are
738 specified using a "<defer:...>" directive. For example:
740 sentence: subject verb object
741 <defer: {$::theVerb = $item{verb}} >
743 Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to extract the
744 code within the "<defer:...>" directive, but there's a problem.
746 A deferred action like this:
748 <defer: {if ($count>10) {$count--}} >
750 will be incorrectly parsed as:
754 because the "less than" operator is interpreted as a closing delimiter.
756 But, by extracting the directive using
757 "extract_codeblock($text, '{}', undef, '<>')" the '>' character is only
758 treated as a delimited at the outermost level of the code block, so the
759 directive is parsed correctly.
762 The "extract_multiple" subroutine takes a string to be processed and a
763 list of extractors (subroutines or regular expressions) to apply to that
766 In an array context "extract_multiple" returns an array of substrings of
767 the original string, as extracted by the specified extractors. In a
768 scalar context, "extract_multiple" returns the first substring
769 successfully extracted from the original string. In both scalar and void
770 contexts the original string has the first successfully extracted
771 substring removed from it. In all contexts "extract_multiple" starts at
772 the current "pos" of the string, and sets that "pos" appropriately after
775 Hence, the aim of of a call to "extract_multiple" in a list context is
776 to split the processed string into as many non-overlapping fields as
777 possible, by repeatedly applying each of the specified extractors to the
778 remainder of the string. Thus "extract_multiple" is a generalized form
779 of Perl's "split" subroutine.
781 The subroutine takes up to four optional arguments:
783 1. A string to be processed ($_ if the string is omitted or "undef")
785 2. A reference to a list of subroutine references and/or qr// objects
786 and/or literal strings and/or hash references, specifying the
787 extractors to be used to split the string. If this argument is
788 omitted (or "undef") the list:
791 sub { extract_variable($_[0], '') },
792 sub { extract_quotelike($_[0],'') },
793 sub { extract_codeblock($_[0],'{}','') },
798 3. An number specifying the maximum number of fields to return. If this
799 argument is omitted (or "undef"), split continues as long as
802 If the third argument is *N*, then extraction continues until *N*
803 fields have been successfully extracted, or until the string has
804 been completely processed.
806 Note that in scalar and void contexts the value of this argument is
807 automatically reset to 1 (under "-w", a warning is issued if the
808 argument has to be reset).
810 4. A value indicating whether unmatched substrings (see below) within
811 the text should be skipped or returned as fields. If the value is
812 true, such substrings are skipped. Otherwise, they are returned.
814 The extraction process works by applying each extractor in sequence to
817 If the extractor is a subroutine it is called in a list context and is
818 expected to return a list of a single element, namely the extracted
819 text. It may optionally also return two further arguments: a string
820 representing the text left after extraction (like $' for a pattern
821 match), and a string representing any prefix skipped before the
822 extraction (like $` in a pattern match). Note that this is designed to
823 facilitate the use of other Text::Balanced subroutines with
824 "extract_multiple". Note too that the value returned by an extractor
825 subroutine need not bear any relationship to the corresponding substring
826 of the original text (see examples below).
828 If the extractor is a precompiled regular expression or a string, it is
829 matched against the text in a scalar context with a leading '\G' and the
830 gc modifiers enabled. The extracted value is either $1 if that variable
831 is defined after the match, or else the complete match (i.e. $&).
833 If the extractor is a hash reference, it must contain exactly one
834 element. The value of that element is one of the above extractor types
835 (subroutine reference, regular expression, or string). The key of that
836 element is the name of a class into which the successful return value of
837 the extractor will be blessed.
839 If an extractor returns a defined value, that value is immediately
840 treated as the next extracted field and pushed onto the list of fields.
841 If the extractor was specified in a hash reference, the field is also
842 blessed into the appropriate class,
844 If the extractor fails to match (in the case of a regex extractor), or
845 returns an empty list or an undefined value (in the case of a subroutine
846 extractor), it is assumed to have failed to extract. If none of the
847 extractor subroutines succeeds, then one character is extracted from the
848 start of the text and the extraction subroutines reapplied. Characters
849 which are thus removed are accumulated and eventually become the next
850 field (unless the fourth argument is true, in which case they are
853 For example, the following extracts substrings that are valid Perl
856 @fields = extract_multiple($text,
857 [ sub { extract_variable($_[0]) } ],
860 This example separates a text into fields which are quote delimited,
861 curly bracketed, and anything else. The delimited and bracketed parts
862 are also blessed to identify them (the "anything else" is unblessed):
864 @fields = extract_multiple($text,
866 { Delim => sub { extract_delimited($_[0],q{'"}) } },
867 { Brack => sub { extract_bracketed($_[0],'{}') } },
870 This call extracts the next single substring that is a valid Perl
871 quotelike operator (and removes it from $text):
873 $quotelike = extract_multiple($text,
875 sub { extract_quotelike($_[0]) },
878 Finally, here is yet another way to do comma-separated value parsing:
880 @fields = extract_multiple($csv_text,
882 sub { extract_delimited($_[0],q{'"}) },
887 The list in the second argument means: *"Try and extract a ' or "
888 delimited string, otherwise extract anything up to a comma..."*. The
889 undef third argument means: *"...as many times as possible..."*, and the
890 true value in the fourth argument means *"...discarding anything else
891 that appears (i.e. the commas)"*.
893 If you wanted the commas preserved as separate fields (i.e. like split
894 does if your split pattern has capturing parentheses), you would just
895 make the last parameter undefined (or remove it).
898 The "gen_delimited_pat" subroutine takes a single (string) argument and
899 > builds a Friedl-style optimized regex that matches a string delimited
900 by any one of the characters in the single argument. For example:
902 gen_delimited_pat(q{'"})
906 (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
908 Note that the specified delimiters are automatically quotemeta'd.
910 A typical use of "gen_delimited_pat" would be to build special purpose
911 tags for "extract_tagged". For example, to properly ignore "empty" XML
912 elements (which might contain quoted strings):
914 my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
916 extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
918 "gen_delimited_pat" may also be called with an optional second argument,
919 which specifies the "escape" character(s) to be used for each delimiter.
920 For example to match a Pascal-style string (where ' is the delimiter and
921 '' is a literal ' within the string):
923 gen_delimited_pat(q{'},q{'});
925 Different escape characters can be specified for different delimiters.
926 For example, to specify that '/' is the escape for single quotes and '%'
927 is the escape for double quotes:
929 gen_delimited_pat(q{'"},q{/%});
931 If more delimiters than escape chars are specified, the last escape char
932 is used for the remaining delimiters. If no escape char is specified for
933 a given specified delimiter, '\' is used.
936 Note that "gen_delimited_pat" was previously called "delimited_pat".
937 That name may still be used, but is now deprecated.
940 In a list context, all the functions return "(undef,$original_text)" on
941 failure. In a scalar context, failure is indicated by returning "undef"
942 (in this case the input text is not modified in any way).
944 In addition, on failure in *any* context, the $@ variable is set.
945 Accessing "$@->{error}" returns one of the error diagnostics listed
946 below. Accessing "$@->{pos}" returns the offset into the original string
947 at which the error was detected (although not necessarily where it
948 occurred!) Printing $@ directly produces the error message, with the
949 offset appended. On success, the $@ variable is guaranteed to be
952 The available diagnostics are:
954 "Did not find a suitable bracket: "%s""
955 The delimiter provided to "extract_bracketed" was not one of
958 "Did not find prefix: /%s/"
959 A non-optional prefix was specified but wasn't found at the start of
962 "Did not find opening bracket after prefix: "%s""
963 "extract_bracketed" or "extract_codeblock" was expecting a
964 particular kind of bracket at the start of the text, and didn't find
967 "No quotelike operator found after prefix: "%s""
968 "extract_quotelike" didn't find one of the quotelike operators "q",
969 "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
972 "Unmatched closing bracket: "%c""
973 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
974 encountered a closing bracket where none was expected.
976 "Unmatched opening bracket(s): "%s""
977 "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran
978 out of characters in the text before closing one or more levels of
981 "Unmatched embedded quote (%s)"
982 "extract_bracketed" attempted to match an embedded quoted substring,
983 but failed to find a closing quote to match it.
985 "Did not find closing delimiter to match '%s'"
986 "extract_quotelike" was unable to find a closing delimiter to match
987 the one that opened the quote-like operation.
989 "Mismatched closing bracket: expected "%c" but found "%s""
990 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
991 found a valid bracket delimiter, but it was the wrong species. This
992 usually indicates a nesting error, but may indicate incorrect
995 "No block delimiter found after quotelike "%s""
996 "extract_quotelike" or "extract_codeblock" found one of the
997 quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without
998 a suitable block after it.
1000 "Did not find leading dereferencer"
1001 "extract_variable" was expecting one of '$', '@', or '%' at the
1002 start of a variable, but didn't find any of them.
1004 "Bad identifier after dereferencer"
1005 "extract_variable" found a '$', '@', or '%' indicating a variable,
1006 but that character was not followed by a legal Perl identifier.
1008 "Did not find expected opening bracket at %s"
1009 "extract_codeblock" failed to find any of the outermost opening
1010 brackets that were specified.
1012 "Improperly nested codeblock at %s"
1013 A nested code block was found that started with a delimiter that was
1014 specified as being only to be used as an outermost bracket.
1016 "Missing second block for quotelike "%s""
1017 "extract_codeblock" or "extract_quotelike" found one of the
1018 quotelike operators "s", "tr" or "y" followed by only one block.
1020 "No match found for opening bracket"
1021 "extract_codeblock" failed to find a closing bracket to match the
1022 outermost opening bracket.
1024 "Did not find opening tag: /%s/"
1025 "extract_tagged" did not find a suitable opening tag (after any
1026 specified prefix was removed).
1028 "Unable to construct closing tag to match: /%s/"
1029 "extract_tagged" matched the specified opening tag and tried to
1030 modify the matched text to produce a matching closing tag (because
1031 none was specified). It failed to generate the closing tag, almost
1032 certainly because the opening tag did not start with a bracket of
1035 "Found invalid nested tag: %s"
1036 "extract_tagged" found a nested tag that appeared in the "reject"
1037 list (and the failure mode was not "MAX" or "PARA").
1039 "Found unbalanced nested tag: %s"
1040 "extract_tagged" found a nested opening tag that was not matched by
1041 a corresponding nested closing tag (and the failure mode was not
1044 "Did not find closing tag"
1045 "extract_tagged" reached the end of the text without finding a
1046 closing tag to match the original opening tag (and the failure mode
1047 was not "MAX" or "PARA").
1050 Damian Conway (damian@conway.org)
1052 BUGS AND IRRITATIONS
1053 There are undoubtedly serious bugs lurking somewhere in this code, if
1054 only because parts of it give the impression of understanding a great
1055 deal more about Perl than they really do.
1057 Bug reports and other feedback are most welcome.
1060 Copyright 1997 - 2001 Damian Conway. All Rights Reserved.
1062 Some (minor) parts copyright 2009 Adam Kennedy.
1064 This module is free software. It may be used, redistributed and/or
1065 modified under the same terms as Perl itself.