3 Text::Balanced - Extract delimited text sequences from strings.
8 use Text::Balanced qw (
21 # Extract the initial substring of $text that is delimited by
22 # two (unescaped) instances of the first character in $delim.
24 ($extracted, $remainder) = extract_delimited($text,$delim);
27 # Extract the initial substring of $text that is bracketed
28 # with a delimiter(s) specified by $delim (where the string
29 # in $delim contains one or more of '(){}[]<>').
31 ($extracted, $remainder) = extract_bracketed($text,$delim);
34 # Extract the initial substring of $text that is bounded by
37 ($extracted, $remainder) = extract_tagged($text);
40 # Extract the initial substring of $text that is bounded by
41 # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
43 ($extracted, $remainder) =
44 extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
47 # Extract the initial substring of $text that represents a
48 # Perl "quote or quote-like operation"
50 ($extracted, $remainder) = extract_quotelike($text);
53 # Extract the initial substring of $text that represents a block
54 # of Perl code, bracketed by any of character(s) specified by $delim
55 # (where the string $delim contains one or more of '(){}[]<>').
57 ($extracted, $remainder) = extract_codeblock($text,$delim);
60 # Extract the initial substrings of $text that would be extracted by
61 # one or more sequential applications of the specified functions
62 # or regular expressions
64 @extracted = extract_multiple($text,
65 [ \&extract_bracketed,
67 \&some_other_extractor_sub,
72 # Create a string representing an optimized pattern (a la Friedl)
73 # that matches a substring delimited by any of the specified characters
74 # (in this case: any type of quote or a slash)
76 $patstring = gen_delimited_pat(q{'"`/});
79 # Generate a reference to an anonymous sub that is just like extract_tagged
80 # but pre-compiled and optimized for a specific pair of tags, and consequently
81 # much faster (i.e. 3 times faster). It uses qr// for better performance on
82 # repeated calls, so it only works under Perl 5.005 or later.
84 $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
86 ($extracted, $remainder) = $extract_head->($text);
91 The various C<extract_...> subroutines may be used to extract a
92 delimited string (possibly after skipping a specified prefix string).
93 The search for the string always begins at the current C<pos>
94 location of the string's variable (or at index zero, if no C<pos>
97 =head2 General behaviour in list contexts
99 In a list context, all the subroutines return a list, the first three
100 elements of which are always:
106 The extracted string, including the specified delimiters.
107 If the extraction fails an empty string is returned.
111 The remainder of the input string (i.e. the characters after the
112 extracted string). On failure, the entire string is returned.
116 The skipped prefix (i.e. the characters before the extracted string).
117 On failure, the empty string is returned.
121 Note that in a list context, the contents of the original input text (the first
122 argument) are not modified in any way.
124 However, if the input text was passed in a variable, that variable's
125 C<pos> value is updated to point at the first character after the
126 extracted text. That means that in a list context the various
127 subroutines can be used much like regular expressions. For example:
129 while ( $next = (extract_quotelike($text))[0] )
131 # process next quote-like (in $next)
135 =head2 General behaviour in scalar and void contexts
137 In a scalar context, the extracted string is returned, having first been
138 removed from the input text. Thus, the following code also processes
139 each quote-like operation, but actually removes them from $text:
141 while ( $next = extract_quotelike($text) )
143 # process next quote-like (in $next)
146 Note that if the input text is a read-only string (i.e. a literal),
147 no attempt is made to remove the extracted text.
149 In a void context the behaviour of the extraction subroutines is
150 exactly the same as in a scalar context, except (of course) that the
151 extracted substring is not returned.
153 =head2 A note about prefixes
155 Prefix patterns are matched without any trailing modifiers (C</gimsox> etc.)
156 This can bite you if you're expecting a prefix specification like
157 '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a prefix
158 pattern will only succeed if the <H1> tag is on the current line, since
159 . normally doesn't match newlines.
161 To overcome this limitation, you need to turn on /s matching within
162 the prefix pattern, using the C<(?s)> directive: '(?s).*?(?=<H1>)'
165 =head2 C<extract_delimited>
167 The C<extract_delimited> function formalizes the common idiom
168 of extracting a single-character-delimited substring from the start of
169 a string. For example, to extract a single-quote delimited string, the
170 following code is typically used:
172 ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
175 but with C<extract_delimited> it can be simplified to:
177 ($extracted,$remainder) = extract_delimited($text, "'");
179 C<extract_delimited> takes up to four scalars (the input text, the
180 delimiters, a prefix pattern to be skipped, and any escape characters)
181 and extracts the initial substring of the text that
182 is appropriately delimited. If the delimiter string has multiple
183 characters, the first one encountered in the text is taken to delimit
185 The third argument specifies a prefix pattern that is to be skipped
186 (but must be present!) before the substring is extracted.
187 The final argument specifies the escape character to be used for each
190 All arguments are optional. If the escape characters are not specified,
191 every delimiter is escaped with a backslash (C<\>).
192 If the prefix is not specified, the
193 pattern C<'\s*'> - optional whitespace - is used. If the delimiter set
194 is also not specified, the set C</["'`]/> is used. If the text to be processed
195 is not specified either, C<$_> is used.
197 In list context, C<extract_delimited> returns a array of three
198 elements, the extracted substring (I<including the surrounding
199 delimiters>), the remainder of the text, and the skipped prefix (if
200 any). If a suitable delimited substring is not found, the first
201 element of the array is the empty string, the second is the complete
202 original text, and the prefix returned in the third element is an
205 In a scalar context, just the extracted substring is returned. In
206 a void context, the extracted substring (and any prefix) are simply
207 removed from the beginning of the first argument.
211 # Remove a single-quoted substring from the very beginning of $text:
213 $substring = extract_delimited($text, "'", '');
215 # Remove a single-quoted Pascalish substring (i.e. one in which
216 # doubling the quote character escapes it) from the very
217 # beginning of $text:
219 $substring = extract_delimited($text, "'", '', "'");
221 # Extract a single- or double- quoted substring from the
222 # beginning of $text, optionally after some whitespace
223 # (note the list context to protect $text from modification):
225 ($substring) = extract_delimited $text, q{"'};
228 # Delete the substring delimited by the first '/' in $text:
230 $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
232 Note that this last example is I<not> the same as deleting the first
233 quote-like pattern. For instance, if C<$text> contained the string:
235 "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
237 then after the deletion it would contain:
239 "if ('.$UNIXCMD/s) { $cmd = $1; }"
243 "if ('./cmd' =~ ms) { $cmd = $1; }"
246 See L<"extract_quotelike"> for a (partial) solution to this problem.
249 =head2 C<extract_bracketed>
251 Like C<"extract_delimited">, the C<extract_bracketed> function takes
252 up to three optional scalar arguments: a string to extract from, a delimiter
253 specifier, and a prefix pattern. As before, a missing prefix defaults to
254 optional whitespace and a missing text defaults to C<$_>. However, a missing
255 delimiter specifier defaults to C<'{}()[]E<lt>E<gt>'> (see below).
257 C<extract_bracketed> extracts a balanced-bracket-delimited
258 substring (using any one (or more) of the user-specified delimiter
259 brackets: '(..)', '{..}', '[..]', or '<..>'). Optionally it will also
260 respect quoted unbalanced brackets (see below).
262 A "delimiter bracket" is a bracket in list of delimiters passed as
263 C<extract_bracketed>'s second argument. Delimiter brackets are
264 specified by giving either the left or right (or both!) versions
265 of the required bracket(s). Note that the order in which
266 two or more delimiter brackets are specified is not significant.
268 A "balanced-bracket-delimited substring" is a substring bounded by
269 matched brackets, such that any other (left or right) delimiter
270 bracket I<within> the substring is also matched by an opposite
271 (right or left) delimiter bracket I<at the same level of nesting>. Any
272 type of bracket not in the delimiter list is treated as an ordinary
275 In other words, each type of bracket specified as a delimiter must be
276 balanced and correctly nested within the substring, and any other kind of
277 ("non-delimiter") bracket in the substring is ignored.
279 For example, given the string:
281 $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
283 then a call to C<extract_bracketed> in a list context:
285 @result = extract_bracketed( $text, '{}' );
289 ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
291 since both sets of C<'{..}'> brackets are properly nested and evenly balanced.
292 (In a scalar context just the first element of the array would be returned. In
293 a void context, C<$text> would be replaced by an empty string.)
295 Likewise the call in:
297 @result = extract_bracketed( $text, '{[' );
299 would return the same result, since all sets of both types of specified
300 delimiter brackets are correctly nested and balanced.
302 However, the call in:
304 @result = extract_bracketed( $text, '{([<' );
306 would fail, returning:
308 ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" );
310 because the embedded pairs of C<'(..)'>s and C<'[..]'>s are "cross-nested" and
311 the embedded C<'E<gt>'> is unbalanced. (In a scalar context, this call would
312 return an empty string. In a void context, C<$text> would be unchanged.)
314 Note that the embedded single-quotes in the string don't help in this
315 case, since they have not been specified as acceptable delimiters and are
316 therefore treated as non-delimiter characters (and ignored).
318 However, if a particular species of quote character is included in the
319 delimiter specification, then that type of quote will be correctly handled.
320 for example, if C<$text> is:
322 $text = '<A HREF=">>>>">link</A>';
326 @result = extract_bracketed( $text, '<">' );
330 ( '<A HREF=">>>>">', 'link</A>', "" )
332 as expected. Without the specification of C<"> as an embedded quoter:
334 @result = extract_bracketed( $text, '<>' );
338 ( '<A HREF=">', '>>>">link</A>', "" )
340 In addition to the quote delimiters C<'>, C<">, and C<`>, full Perl quote-like
341 quoting (i.e. q{string}, qq{string}, etc) can be specified by including the
342 letter 'q' as a delimiter. Hence:
344 @result = extract_bracketed( $text, '<q>' );
346 would correctly match something like this:
348 $text = '<leftop: conj /and/ conj>';
350 See also: C<"extract_quotelike"> and C<"extract_codeblock">.
353 =head2 C<extract_tagged>
355 C<extract_tagged> extracts and segments text between (balanced)
358 The subroutine takes up to five optional arguments:
364 A string to be processed (C<$_> if the string is omitted or C<undef>)
368 A string specifying a pattern to be matched as the opening tag.
369 If the pattern string is omitted (or C<undef>) then a pattern
370 that matches any standard HTML/XML tag is used.
374 A string specifying a pattern to be matched at the closing tag.
375 If the pattern string is omitted (or C<undef>) then the closing
376 tag is constructed by inserting a C</> after any leading bracket
377 characters in the actual opening tag that was matched (I<not> the pattern
378 that matched the tag). For example, if the opening tag pattern
379 is specified as C<'{{\w+}}'> and actually matched the opening tag
380 C<"{{DATA}}">, then the constructed closing tag would be C<"{{/DATA}}">.
384 A string specifying a pattern to be matched as a prefix (which is to be
385 skipped). If omitted, optional whitespace is skipped.
389 A hash reference containing various parsing options (see below)
393 The various options that can be specified are:
397 =item C<reject =E<gt> $listref>
399 The list reference contains one or more strings specifying patterns
400 that must I<not> appear within the tagged text.
402 For example, to extract
403 an HTML link (which should not contain nested links) use:
405 extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
407 =item C<ignore =E<gt> $listref>
409 The list reference contains one or more strings specifying patterns
410 that are I<not> be be treated as nested tags within the tagged text
411 (even if they would match the start tag pattern).
413 For example, to extract an arbitrary XML tag, but ignore "empty" elements:
415 extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
417 (also see L<"gen_delimited_pat"> below).
420 =item C<fail =E<gt> $str>
422 The C<fail> option indicates the action to be taken if a matching end
423 tag is not encountered (i.e. before the end of the string or some
424 C<reject> pattern matches). By default, a failure to match a closing
425 tag causes C<extract_tagged> to immediately fail.
427 However, if the string value associated with <reject> is "MAX", then
428 C<extract_tagged> returns the complete text up to the point of failure.
429 If the string is "PARA", C<extract_tagged> returns only the first paragraph
430 after the tag (up to the first line that is either empty or contains
431 only whitespace characters).
432 If the string is "", the the default behaviour (i.e. failure) is reinstated.
434 For example, suppose the start tag "/para" introduces a paragraph, which then
435 continues until the next "/endpara" tag or until another "/para" tag is
438 $text = "/para line 1\n\nline 3\n/para line 4";
440 extract_tagged($text, '/para', '/endpara', undef,
441 {reject => '/para', fail => MAX );
443 # EXTRACTED: "/para line 1\n\nline 3\n"
445 Suppose instead, that if no matching "/endpara" tag is found, the "/para"
446 tag refers only to the immediately following paragraph:
448 $text = "/para line 1\n\nline 3\n/para line 4";
450 extract_tagged($text, '/para', '/endpara', undef,
451 {reject => '/para', fail => MAX );
453 # EXTRACTED: "/para line 1\n"
455 Note that the specified C<fail> behaviour applies to nested tags as well.
459 On success in a list context, an array of 6 elements is returned. The elements are:
465 the extracted tagged substring (including the outermost tags),
469 the remainder of the input text,
473 the prefix substring (if any),
481 the text between the opening and closing tags
485 the closing tag (or "" if no closing tag was found)
489 On failure, all of these values (except the remaining text) are C<undef>.
491 In a scalar context, C<extract_tagged> returns just the complete
492 substring that matched a tagged text (including the start and end
493 tags). C<undef> is returned on failure. In addition, the original input
494 text has the returned substring (and any prefix) removed from it.
496 In a void context, the input text just has the matched substring (and
497 any specified prefix) removed.
500 =head2 C<gen_extract_tagged>
502 (Note: This subroutine is only available under Perl5.005)
504 C<gen_extract_tagged> generates a new anonymous subroutine which
505 extracts text between (balanced) specified tags. In other words,
506 it generates a function identical in function to C<extract_tagged>.
508 The difference between C<extract_tagged> and the anonymous
509 subroutines generated by
510 C<gen_extract_tagged>, is that those generated subroutines:
516 do not have to reparse tag specification or parsing options every time
517 they are called (whereas C<extract_tagged> has to effectively rebuild
518 its tag parser on every call);
522 make use of the new qr// construct to pre-compile the regexes they use
523 (whereas C<extract_tagged> uses standard string variable interpolation
524 to create tag-matching patterns).
528 The subroutine takes up to four optional arguments (the same set as
529 C<extract_tagged> except for the string to be processed). It returns
530 a reference to a subroutine which in turn takes a single argument (the text to
533 In other words, the implementation of C<extract_tagged> is exactly
539 $extractor = gen_extract_tagged(@_);
540 return $extractor->($text);
543 (although C<extract_tagged> is not currently implemented that way, in order
544 to preserve pre-5.005 compatibility).
546 Using C<gen_extract_tagged> to create extraction functions for specific tags
547 is a good idea if those functions are going to be called more than once, since
548 their performance is typically twice as good as the more general-purpose
552 =head2 C<extract_quotelike>
554 C<extract_quotelike> attempts to recognize, extract, and segment any
555 one of the various Perl quotes and quotelike operators (see
556 L<perlop(3)>) Nested backslashed delimiters, embedded balanced bracket
557 delimiters (for the quotelike operators), and trailing modifiers are
558 all caught. For example, in:
560 extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
562 extract_quotelike ' "You said, \"Use sed\"." '
564 extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
566 extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
568 the full Perl quotelike operations are all extracted correctly.
570 Note too that, when using the /x modifier on a regex, any comment
571 containing the current pattern delimiter will cause the regex to be
572 immediately terminated. In other words:
575 (?i) # CASE INSENSITIVE
576 [a-z_] # LEADING ALPHABETIC/UNDERSCORE
577 [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
580 will be extracted as if it were:
583 (?i) # CASE INSENSITIVE
584 [a-z_] # LEADING ALPHABETIC/'
586 This behaviour is identical to that of the actual compiler.
588 C<extract_quotelike> takes two arguments: the text to be processed and
589 a prefix to be matched at the very beginning of the text. If no prefix
590 is specified, optional whitespace is the default. If no text is given,
593 In a list context, an array of 11 elements is returned. The elements are:
599 the extracted quotelike substring (including trailing modifiers),
603 the remainder of the input text,
607 the prefix substring (if any),
611 the name of the quotelike operator (if any),
615 the left delimiter of the first block of the operation,
619 the text of the first block of the operation
620 (that is, the contents of
621 a quote, the regex of a match or substitution or the target list of a
626 the right delimiter of the first block of the operation,
630 the left delimiter of the second block of the operation
631 (that is, if it is a C<s>, C<tr>, or C<y>),
635 the text of the second block of the operation
636 (that is, the replacement of a substitution or the translation list
641 the right delimiter of the second block of the operation (if any),
645 the trailing modifiers on the operation (if any).
649 For each of the fields marked "(if any)" the default value on success is
651 On failure, all of these values (except the remaining text) are C<undef>.
654 In a scalar context, C<extract_quotelike> returns just the complete substring
655 that matched a quotelike operation (or C<undef> on failure). In a scalar or
656 void context, the input text has the same substring (and any specified
661 # Remove the first quotelike literal that appears in text
663 $quotelike = extract_quotelike($text,'.*?');
665 # Replace one or more leading whitespace-separated quotelike
666 # literals in $_ with "<QLL>"
668 do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
671 # Isolate the search pattern in a quotelike operation from $text
673 ($op,$pat) = (extract_quotelike $text)[3,5];
676 print "search pattern: $pat\n";
680 print "$op is not a pattern matching operation\n";
684 =head2 C<extract_quotelike> and "here documents"
686 C<extract_quotelike> can successfully extract "here documents" from an input
687 string, but with an important caveat in list contexts.
689 Unlike other types of quote-like literals, a here document is rarely
690 a contiguous substring. For example, a typical piece of code using
691 here document might look like this:
698 Given this as an input string in a scalar context, C<extract_quotelike>
699 would correctly return the string "<<'EOMSG'\nThis is the message.\nEOMSG",
700 leaving the string " || die;\nexit;" in the original variable. In other words,
701 the two separate pieces of the here document are successfully extracted and
704 In a list context, C<extract_quotelike> would return the list
710 "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full extracted here document,
711 including fore and aft delimiters),
715 " || die;\nexit;" (i.e. the remainder of the input text, concatenated),
719 "" (i.e. the prefix substring -- trivial in this case),
723 "<<" (i.e. the "name" of the quotelike operator)
727 "'EOMSG'" (i.e. the left delimiter of the here document, including any quotes),
731 "This is the message.\n" (i.e. the text of the here document),
735 "EOMSG" (i.e. the right delimiter of the here document),
739 "" (a here document has no second left delimiter, second text, second right
740 delimiter, or trailing modifiers).
744 However, the matching position of the input variable would be set to
745 "exit;" (i.e. I<after> the closing delimiter of the here document),
746 which would cause the earlier " || die;\nexit;" to be skipped in any
747 sequence of code fragment extractions.
749 To avoid this problem, when it encounters a here document whilst
750 extracting from a modifiable string, C<extract_quotelike> silently
751 rearranges the string to an equivalent piece of Perl:
759 in which the here document I<is> contiguous. It still leaves the
760 matching position after the here document, but now the rest of the line
761 on which the here document starts is not skipped.
763 To prevent <extract_quotelike> from mucking about with the input in this way
764 (this is the only case where a list-context C<extract_quotelike> does so),
765 you can pass the input variable as an interpolated literal:
767 $quotelike = extract_quotelike("$var");
770 =head2 C<extract_codeblock>
772 C<extract_codeblock> attempts to recognize and extract a balanced
773 bracket delimited substring that may contain unbalanced brackets
774 inside Perl quotes or quotelike operations. That is, C<extract_codeblock>
775 is like a combination of C<"extract_bracketed"> and
776 C<"extract_quotelike">.
778 C<extract_codeblock> takes the same initial three parameters as C<extract_bracketed>:
779 a text to process, a set of delimiter brackets to look for, and a prefix to
780 match first. It also takes an optional fourth parameter, which allows the
781 outermost delimiter brackets to be specified separately (see below).
783 Omitting the first argument (input text) means process C<$_> instead.
784 Omitting the second argument (delimiter brackets) indicates that only C<'{'> is to be used.
785 Omitting the third argument (prefix argument) implies optional whitespace at the start.
786 Omitting the fourth argument (outermost delimiter brackets) indicates that the
787 value of the second argument is to be used for the outermost delimiters.
789 Once the prefix an dthe outermost opening delimiter bracket have been
790 recognized, code blocks are extracted by stepping through the input text and
791 trying the following alternatives in sequence:
797 Try and match a closing delimiter bracket. If the bracket was the same
798 species as the last opening bracket, return the substring to that
799 point. If the bracket was mismatched, return an error.
803 Try to match a quote or quotelike operator. If found, call
804 C<extract_quotelike> to eat it. If C<extract_quotelike> fails, return
805 the error it returned. Otherwise go back to step 1.
809 Try to match an opening delimiter bracket. If found, call
810 C<extract_codeblock> recursively to eat the embedded block. If the
811 recursive call fails, return an error. Otherwise, go back to step 1.
815 Unconditionally match a bareword or any other single character, and
816 then go back to step 1.
823 # Find a while loop in the text
825 if ($text =~ s/.*?while\s*\{/{/)
827 $loop = "while " . extract_codeblock($text);
830 # Remove the first round-bracketed list (which may include
831 # round- or curly-bracketed code blocks or quotelike operators)
833 extract_codeblock $text, "(){}", '[^(]*';
836 The ability to specify a different outermost delimiter bracket is useful
837 in some circumstances. For example, in the Parse::RecDescent module,
838 parser actions which are to be performed only on a successful parse
839 are specified using a C<E<lt>defer:...E<gt>> directive. For example:
841 sentence: subject verb object
842 <defer: {$::theVerb = $item{verb}} >
844 Parse::RecDescent uses C<extract_codeblock($text, '{}E<lt>E<gt>')> to extract the code
845 within the C<E<lt>defer:...E<gt>> directive, but there's a problem.
847 A deferred action like this:
849 <defer: {if ($count>10) {$count--}} >
851 will be incorrectly parsed as:
855 because the "less than" operator is interpreted as a closing delimiter.
857 But, by extracting the directive using
858 S<C<extract_codeblock($text, '{}', undef, 'E<lt>E<gt>')>>
859 the '>' character is only treated as a delimited at the outermost
860 level of the code block, so the directive is parsed correctly.
862 =head2 C<extract_multiple>
864 The C<extract_multiple> subroutine takes a string to be processed and a
865 list of extractors (subroutines or regular expressions) to apply to that string.
867 In an array context C<extract_multiple> returns an array of substrings
868 of the original string, as extracted by the specified extractors.
869 In a scalar context, C<extract_multiple> returns the first
870 substring successfully extracted from the original string. In both
871 scalar and void contexts the original string has the first successfully
872 extracted substring removed from it. In all contexts
873 C<extract_multiple> starts at the current C<pos> of the string, and
874 sets that C<pos> appropriately after it matches.
876 Hence, the aim of of a call to C<extract_multiple> in a list context
877 is to split the processed string into as many non-overlapping fields as
878 possible, by repeatedly applying each of the specified extractors
879 to the remainder of the string. Thus C<extract_multiple> is
880 a generalized form of Perl's C<split> subroutine.
882 The subroutine takes up to four optional arguments:
888 A string to be processed (C<$_> if the string is omitted or C<undef>)
892 A reference to a list of subroutine references and/or qr// objects and/or
893 literal strings and/or hash references, specifying the extractors
894 to be used to split the string. If this argument is omitted (or
898 sub { extract_variable($_[0], '') },
899 sub { extract_quotelike($_[0],'') },
900 sub { extract_codeblock($_[0],'{}','') },
908 An number specifying the maximum number of fields to return. If this
909 argument is omitted (or C<undef>), split continues as long as possible.
911 If the third argument is I<N>, then extraction continues until I<N> fields
912 have been successfully extracted, or until the string has been completely
915 Note that in scalar and void contexts the value of this argument is
916 automatically reset to 1 (under C<-w>, a warning is issued if the argument
921 A value indicating whether unmatched substrings (see below) within the
922 text should be skipped or returned as fields. If the value is true,
923 such substrings are skipped. Otherwise, they are returned.
927 The extraction process works by applying each extractor in
928 sequence to the text string. If the extractor is a subroutine it
930 context and is expected to return a list of a single element, namely
932 Note that the value returned by an extractor subroutine need not bear any
933 relationship to the corresponding substring of the original text (see
936 If the extractor is a precompiled regular expression or a string,
937 it is matched against the text in a scalar context with a leading
938 '\G' and the gc modifiers enabled. The extracted value is either
939 $1 if that variable is defined after the match, or else the
940 complete match (i.e. $&).
942 If the extractor is a hash reference, it must contain exactly one element.
943 The value of that element is one of the
944 above extractor types (subroutine reference, regular expression, or string).
945 The key of that element is the name of a class into which the successful
946 return value of the extractor will be blessed.
948 If an extractor returns a defined value, that value is immediately
949 treated as the next extracted field and pushed onto the list of fields.
950 If the extractor was specified in a hash reference, the field is also
951 blessed into the appropriate class,
953 If the extractor fails to match (in the case of a regex extractor), or returns an empty list or an undefined value (in the case of a subroutine extractor), it is
954 assumed to have failed to extract.
955 If none of the extractor subroutines succeeds, then one
956 character is extracted from the start of the text and the extraction
957 subroutines reapplied. Characters which are thus removed are accumulated and
958 eventually become the next field (unless the fourth argument is true, in which
959 case they are disgarded).
961 For example, the following extracts substrings that are valid Perl variables:
963 @fields = extract_multiple($text,
964 [ sub { extract_variable($_[0]) } ],
967 This example separates a text into fields which are quote delimited,
968 curly bracketed, and anything else. The delimited and bracketed
969 parts are also blessed to identify them (the "anything else" is unblessed):
971 @fields = extract_multiple($text,
973 { Delim => sub { extract_delimited($_[0],q{'"}) } },
974 { Brack => sub { extract_bracketed($_[0],'{}') } },
977 This call extracts the next single substring that is a valid Perl quotelike
978 operator (and removes it from $text):
980 $quotelike = extract_multiple($text,
982 sub { extract_quotelike($_[0]) },
985 Finally, here is yet another way to do comma-separated value parsing:
987 @fields = extract_multiple($csv_text,
989 sub { extract_delimited($_[0],q{'"}) },
994 The list in the second argument means:
995 I<"Try and extract a ' or " delimited string, otherwise extract anything up to a comma...">.
996 The undef third argument means:
997 I<"...as many times as possible...">,
998 and the true value in the fourth argument means
999 I<"...discarding anything else that appears (i.e. the commas)">.
1001 If you wanted the commas preserved as separate fields (i.e. like split
1002 does if your split pattern has capturing parentheses), you would
1003 just make the last parameter undefined (or remove it).
1006 =head2 C<gen_delimited_pat>
1008 The C<gen_delimited_pat> subroutine takes a single (string) argument and
1009 builds a Friedl-style optimized regex that matches a string delimited
1010 by any one of the characters in the single argument. For example:
1012 gen_delimited_pat(q{'"})
1016 (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
1018 Note that the specified delimiters are automatically quotemeta'd.
1020 A typical use of C<gen_delimited_pat> would be to build special purpose tags
1021 for C<extract_tagged>. For example, to properly ignore "empty" XML elements
1022 (which might contain quoted strings):
1024 my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
1026 extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
1029 C<gen_delimited_pat> may also be called with an optional second argument,
1030 which specifies the "escape" character(s) to be used for each delimiter.
1031 For example to match a Pascal-style string (where ' is the delimiter
1032 and '' is a literal ' within the string):
1034 gen_delimited_pat(q{'},q{'});
1036 Different escape characters can be specified for different delimiters.
1037 For example, to specify that '/' is the escape for single quotes
1038 and '%' is the escape for double quotes:
1040 gen_delimited_pat(q{'"},q{/%});
1042 If more delimiters than escape chars are specified, the last escape char
1043 is used for the remaining delimiters.
1044 If no escape char is specified for a given specified delimiter, '\' is used.
1047 C<gen_delimited_pat> was previously called
1048 C<delimited_pat>. That name may still be used, but is now deprecated.
1053 In a list context, all the functions return C<(undef,$original_text)>
1054 on failure. In a scalar context, failure is indicated by returning C<undef>
1055 (in this case the input text is not modified in any way).
1057 In addition, on failure in I<any> context, the C<$@> variable is set.
1058 Accessing C<$@-E<gt>{error}> returns one of the error diagnostics listed
1060 Accessing C<$@-E<gt>{pos}> returns the offset into the original string at
1061 which the error was detected (although not necessarily where it occurred!)
1062 Printing C<$@> directly produces the error message, with the offset appended.
1063 On success, the C<$@> variable is guaranteed to be C<undef>.
1065 The available diagnostics are:
1069 =item C<Did not find a suitable bracket: "%s">
1071 The delimiter provided to C<extract_bracketed> was not one of
1072 C<'()[]E<lt>E<gt>{}'>.
1074 =item C<Did not find prefix: /%s/>
1076 A non-optional prefix was specified but wasn't found at the start of the text.
1078 =item C<Did not find opening bracket after prefix: "%s">
1080 C<extract_bracketed> or C<extract_codeblock> was expecting a
1081 particular kind of bracket at the start of the text, and didn't find it.
1083 =item C<No quotelike operator found after prefix: "%s">
1085 C<extract_quotelike> didn't find one of the quotelike operators C<q>,
1086 C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y> at the start of the substring
1089 =item C<Unmatched closing bracket: "%c">
1091 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> encountered
1092 a closing bracket where none was expected.
1094 =item C<Unmatched opening bracket(s): "%s">
1096 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> ran
1097 out of characters in the text before closing one or more levels of nested
1100 =item C<Unmatched embedded quote (%s)>
1102 C<extract_bracketed> attempted to match an embedded quoted substring, but
1103 failed to find a closing quote to match it.
1105 =item C<Did not find closing delimiter to match '%s'>
1107 C<extract_quotelike> was unable to find a closing delimiter to match the
1108 one that opened the quote-like operation.
1110 =item C<Mismatched closing bracket: expected "%c" but found "%s">
1112 C<extract_bracketed>, C<extract_quotelike> or C<extract_codeblock> found
1113 a valid bracket delimiter, but it was the wrong species. This usually
1114 indicates a nesting error, but may indicate incorrect quoting or escaping.
1116 =item C<No block delimiter found after quotelike "%s">
1118 C<extract_quotelike> or C<extract_codeblock> found one of the
1119 quotelike operators C<q>, C<qq>, C<qw>, C<qx>, C<s>, C<tr> or C<y>
1120 without a suitable block after it.
1122 =item C<Did not find leading dereferencer>
1124 C<extract_variable> was expecting one of '$', '@', or '%' at the start of
1125 a variable, but didn't find any of them.
1127 =item C<Bad identifier after dereferencer>
1129 C<extract_variable> found a '$', '@', or '%' indicating a variable, but that
1130 character was not followed by a legal Perl identifier.
1132 =item C<Did not find expected opening bracket at %s>
1134 C<extract_codeblock> failed to find any of the outermost opening brackets
1135 that were specified.
1137 =item C<Improperly nested codeblock at %s>
1139 A nested code block was found that started with a delimiter that was specified
1140 as being only to be used as an outermost bracket.
1142 =item C<Missing second block for quotelike "%s">
1144 C<extract_codeblock> or C<extract_quotelike> found one of the
1145 quotelike operators C<s>, C<tr> or C<y> followed by only one block.
1147 =item C<No match found for opening bracket>
1149 C<extract_codeblock> failed to find a closing bracket to match the outermost
1152 =item C<Did not find opening tag: /%s/>
1154 C<extract_tagged> did not find a suitable opening tag (after any specified
1155 prefix was removed).
1157 =item C<Unable to construct closing tag to match: /%s/>
1159 C<extract_tagged> matched the specified opening tag and tried to
1160 modify the matched text to produce a matching closing tag (because
1161 none was specified). It failed to generate the closing tag, almost
1162 certainly because the opening tag did not start with a
1163 bracket of some kind.
1165 =item C<Found invalid nested tag: %s>
1167 C<extract_tagged> found a nested tag that appeared in the "reject" list
1168 (and the failure mode was not "MAX" or "PARA").
1170 =item C<Found unbalanced nested tag: %s>
1172 C<extract_tagged> found a nested opening tag that was not matched by a
1173 corresponding nested closing tag (and the failure mode was not "MAX" or "PARA").
1175 =item C<Did not find closing tag>
1177 C<extract_tagged> reached the end of the text without finding a closing tag
1178 to match the original opening tag (and the failure mode was not
1189 Damian Conway (damian@conway.org)
1192 =head1 BUGS AND IRRITATIONS
1194 There are undoubtedly serious bugs lurking somewhere in this code, if
1195 only because parts of it give the impression of understanding a great deal
1196 more about Perl than they really do.
1198 Bug reports and other feedback are most welcome.
1203 Copyright (c) 1997-2000, Damian Conway. All Rights Reserved.
1204 This module is free software; you can redistribute it and/or
1205 modify it under the same terms as Perl itself.