3 XML::LibXML::Parser - Parsing XML Data with XML::LibXML
13 $parser = XML::LibXML->new();
14 $parser = XML::LibXML->new(option=>value, ...);
15 $parser = XML::LibXML->new({option=>value, ...});
19 $dom = XML::LibXML->load_xml(
20 location => $file_or_url
23 $dom = XML::LibXML->load_xml(
27 $dom = XML::LibXML->load_xml({
28 IO => $perl_file_handle
31 $dom = $parser->load_xml(...);
35 $dom = XML::LibXML->load_html(...);
36 $dom = $parser->load_html(...);
38 # Parsing well-balanced XML chunks
40 $fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding );
44 $parser->process_xincludes( $doc );
45 $parser->processXIncludes( $doc );
47 # Old-style parser interfaces
49 $doc = $parser->parse_file( $xmlfilename );
50 $doc = $parser->parse_fh( $io_fh );
51 $doc = $parser->parse_string( $xmlstring);
52 $doc = $parser->parse_html_file( $htmlfile, \%opts );
53 $doc = $parser->parse_html_fh( $io_fh, \%opts );
54 $doc = $parser->parse_html_string( $htmlstring, \%opts );
58 $parser->parse_chunk($string, $terminate);
61 $doc = $parser->finish_push( $recover );
63 # Set/query parser options
65 $parser->option_exists($name);
66 $parser->get_option($name);
67 $parser->set_option($name,$value);
68 $parser->set_options({$name=>$value,...});
72 $parser->load_catalog( $catalog_file );
76 A XML document is read into a data structure such as a DOM tree by a piece of
77 software, called a parser. XML::LibXML currently provides four different parser
103 A DOM based SAX Parser.
110 =head2 Creating a Parser Instance
112 XML::LibXML provides an OO interface to the libxml2 parser functions. Thus you
113 have to create a parser instance before you can parse any XML data.
120 $parser = XML::LibXML->new();
121 $parser = XML::LibXML->new(option=>value, ...);
122 $parser = XML::LibXML->new({option=>value, ...});
124 Create a new XML and HTML parser instance. Each parser instance holds default
125 values for various parser options. Optionally, one can pass a hash reference or
126 a list of option => value pairs to set a different default set of options.
127 Unless specified otherwise, the options C<<<<<< load_ext_dtd >>>>>>, C<<<<<< expand_entities >>>>>>, and C<<<<<< huge >>>>>> are set to 1. See L<<<<<< Parser Options >>>>>> for a list of libxml2 parser's options.
136 One of the common parser interfaces of XML::LibXML is the DOM parser. This
137 parser reads XML data into a DOM like data structure, so each tag can get
138 accessed and transformed.
140 XML::LibXML's DOM parser is not only capable to parse XML data, but also
141 (strict) HTML files. There are three ways to parse documents - as a string, as
142 a Perl filehandle, or as a filename/URL. The return value from each is a L<<<<<< XML::LibXML::Document >>>>>> object, which is a DOM object.
144 All of the functions listed below will throw an exception if the document is
145 invalid. To prevent this causing your program exiting, wrap the call in an
153 $dom = XML::LibXML->load_xml(
154 location => $file_or_url
157 $dom = XML::LibXML->load_xml(
158 string => $xml_string
161 $dom = XML::LibXML->load_xml({
162 IO => $perl_file_handle
165 $dom = $parser->load_xml(...);
168 This function is available since XML::LibXML 1.70. It provides easy to use
169 interface to the XML parser that parses given file (or URL), string, or input
170 stream to a DOM tree. The arguments can be passed in a HASH reference or as
171 name => value pairs. The function can be called as a class method or an object
172 method. In both cases it internally creates a new parser instance passing the
173 specified parser options; if called as an object method, it clones the original
174 parser (preserving its settings) and additionally applies the specified options
175 to the new parser. See the constructor C<<<<<< new >>>>>> and L<<<<<< Parser Options >>>>>> for more information.
181 $dom = XML::LibXML->load_html(...);
182 $dom = $parser->load_html(...);
185 This function is available since XML::LibXML 1.70. It has the same usage as C<<<<<< load_xml >>>>>>, providing interface to the HTML parser. See C<<<<<< load_xml >>>>>> for more information.
188 Parsing HTML may cause problems, especially if the ampersand ('&') is used.
189 This is a common problem if HTML code is parsed that contains links to
190 CGI-scripts. Such links cause the parser to throw errors. In such cases libxml2
191 still parses the entire document as there was no error, but the error causes
192 XML::LibXML to stop the parsing process. However, the document is not lost.
193 Such HTML documents should be parsed using the I<<<<<< recover >>>>>> flag. By default recovering is deactivated.
195 The functions described above are implemented to parse well formed documents.
196 In some cases a program gets well balanced XML instead of well formed documents
197 (e.g. a XML fragment from a Database). With XML::LibXML it is not required to
198 wrap such fragments in the code, because XML::LibXML is capable even to parse
199 well balanced XML fragments.
203 =item parse_balanced_chunk
205 $fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding );
207 This function parses a well balanced XML string into a L<<<<<< XML::LibXML::DocumentFragment >>>>>>. The first arguments contains the input string, the optional second argument
208 can be used to specify character encoding of the input (UTF-8 is assumed by
212 =item parse_xml_chunk
214 This is the old name of parse_balanced_chunk(). Because it may causes confusion
215 with the push parser interface, this function should not be used anymore.
221 By default XML::LibXML does not process XInclude tags within a XML Document
222 (see options section below). XML::LibXML allows to post process a document to
223 expand XInclude tags.
227 =item process_xincludes
229 $parser->process_xincludes( $doc );
231 After a document is parsed into a DOM structure, you may want to expand the
232 documents XInclude tags. This function processes the given document structure
233 and expands all XInclude tags (or throws an error) by using the flags and
234 callbacks of the given parser instance.
236 Note that the resulting Tree contains some extra nodes (of type
237 XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully processing the
238 document. These nodes indicate where data was included into the original tree.
239 if the document is serialized, these extra nodes will not show up.
241 Remember: A Document with processed XIncludes differs from the original
242 document after serialization, because the original XInclude tags will not get
245 If the parser flag "expand_xincludes" is set to 1, you need not to post process
249 =item processXIncludes
251 $parser->processXIncludes( $doc );
253 This is an alias to process_xincludes, but through a JAVA like function name.
258 $doc = $parser->parse_file( $xmlfilename );
260 This function parses an XML document from a file or network; $xmlfilename can
261 be either a filename or an URL. Note that for parsing files, this function is
262 the fastest choice, about 6-8 times faster then parse_fh().
267 $doc = $parser->parse_fh( $io_fh );
269 parse_fh() parses a IOREF or a subclass of IO::Handle.
271 Because the data comes from an open handle, libxml2's parser does not know
272 about the base URI of the document. To set the base URI one should use
273 parse_fh() as follows:
277 my $doc = $parser->parse_fh( $io_fh, $baseuri );
282 $doc = $parser->parse_string( $xmlstring);
284 This function is similar to parse_fh(), but it parses a XML document that is
285 available as a single string in memory. Again, you can pass an optional base
290 my $doc = $parser->parse_string( $xmlstring, $baseuri );
293 =item parse_html_file
295 $doc = $parser->parse_html_file( $htmlfile, \%opts );
297 Similar to parse_file() but parses HTML (strict) documents; $htmlfile can be
300 An optional second argument can be used to pass some options to the HTML parser
301 as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>.
306 $doc = $parser->parse_html_fh( $io_fh, \%opts );
308 Similar to parse_fh() but parses HTML (strict) streams.
310 An optional second argument can be used to pass some options to the HTML parser
311 as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>.
313 Note: encoding option may not work correctly with this function in libxml2 <
314 2.6.27 if the HTML file declares charset using a META tag.
317 =item parse_html_string
319 $doc = $parser->parse_html_string( $htmlstring, \%opts );
321 Similar to parse_string() but parses HTML (strict) strings.
323 An optional second argument can be used to pass some options to the HTML parser
324 as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>.
336 XML::LibXML provides a push parser interface. Rather than pulling the data from
337 a given source the push parser waits for the data to be pushed into it.
339 This allows one to parse large documents without waiting for the parser to
340 finish. The interface is especially useful if a program needs to pre-process
341 the incoming pieces of XML (e.g. to detect document boundaries).
343 While XML::LibXML parse_*() functions force the data to be a well-formed XML,
344 the push parser will take any arbitrary string that contains some XML data. The
345 only requirement is that all the pushed strings are together a well formed
346 document. With the push parser interface a program can interrupt the parsing
347 process as required, where the parse_*() functions give not enough flexibility.
349 Different to the pull parser implemented in parse_fh() or parse_file(), the
350 push parser is not able to find out about the documents end itself. Thus the
351 calling program needs to indicate explicitly when the parsing is done.
353 In XML::LibXML this is done by a single function:
359 $parser->parse_chunk($string, $terminate);
361 parse_chunk() tries to parse a given chunk of data, which isn't necessarily
362 well balanced data. The function takes two parameters: The chunk of data as a
363 string and optional a termination flag. If the termination flag is set to a
364 true value (e.g. 1), the parsing will be stopped and the resulting document
365 will be returned as the following example describes:
369 my $parser = XML::LibXML->new;
370 for my $string ( "<", "foo", ' bar="hello world"', "/>") {
371 $parser->parse_chunk( $string );
373 my $doc = $parser->parse_chunk("", 1); # terminate the parsing
379 Internally XML::LibXML provides three functions that control the push parser
386 $parser->init_push();
388 Initializes the push parser.
393 $parser->push(@data);
395 This function pushes the data stored inside the array to libxml2's parser. Each
396 entry in @data must be a normal scalar! This method can be called repeatedly.
401 $doc = $parser->finish_push( $recover );
403 This function returns the result of the parsing process. If this function is
404 called without a parameter it will complain about non well-formed documents. If
405 $restore is 1, the push parser can be used to restore broken or non well formed
406 (XML) documents as the following example shows:
411 $parser->push( "<foo>", "bar" );
412 $doc = $parser->finish_push(); # will report broken XML
418 This can be annoying if the closing tag is missed by accident. The following
419 code will restore the document:
424 $parser->push( "<foo>", "bar" );
425 $doc = $parser->finish_push(1); # will return the data parsed
426 # unless an error happened
429 print $doc->toString(); # returns "<foo>bar</foo>"
431 Of course finish_push() will return nothing if there was no data pushed to the
439 =head2 Pull Parser (Reader)
441 XML::LibXML also provides a pull-parser interface similar to the XmlReader
442 interface in .NET. This interface is almost streaming, and is usually faster
443 and simpler to use than SAX. See L<<<<<< XML::LibXML::Reader >>>>>>.
446 =head2 Direct SAX Parser
448 XML::LibXML provides a direct SAX parser in the L<<<<<< XML::LibXML::SAX >>>>>> module.
451 =head2 DOM based SAX Parser
453 XML::LibXML also provides a DOM based SAX parser. The SAX parser is defined in
454 the module XML::LibXML::SAX::Parser. As it is not a stream based parser, it
455 parses documents into a DOM and traverses the DOM tree instead.
457 The API of this parser is exactly the same as any other Perl SAX2 parser. See
458 XML::SAX::Intro for details.
460 Aside from the regular parsing methods, you can access the DOM tree traverser
461 directly, using the generate() method:
465 my $doc = build_yourself_a_document();
466 my $saxparser = $XML::LibXML::SAX::Parser->new( ... );
467 $parser->generate( $doc );
469 This is useful for serializing DOM trees, for example that you might have done
470 prior processing on, or that you have as a result of XSLT processing.
472 I<<<<<< WARNING >>>>>>
474 This is NOT a streaming SAX parser. As I said above, this parser reads the
475 entire document into a DOM and serialises it. Some people couldn't read that in
476 the paragraph above so I've added this warning. If you want a streaming SAX
477 parser look at the L<<<<<< XML::LibXML::SAX >>>>>> man page
482 XML::LibXML provides some functions to serialize nodes and documents. The
483 serialization functions are described on the L<<<<<< XML::LibXML::Node >>>>>> manpage or the L<<<<<< XML::LibXML::Document >>>>>> manpage. XML::LibXML checks three global flags that alter the serialization
509 of that three functions only setTagCompression is available for all
510 serialization functions.
512 Because XML::LibXML does these flags not itself, one has to define them locally
513 as the following example shows:
517 local $XML::LibXML::skipXMLDeclaration = 1;
518 local $XML::LibXML::skipDTD = 1;
519 local $XML::LibXML::setTagCompression = 1;
521 If skipXMLDeclaration is defined and not '0', the XML declaration is omitted
522 during serialization.
524 If skipDTD is defined and not '0', an existing DTD would not be serialized with
527 If setTagCompression is defined and not '0' empty tags are displayed as open
528 and closing tags rather than the shortcut. For example the empty tag I<<<<<< foo >>>>>> will be rendered as I<<<<<< <foo></foo> >>>>>> rather than I<<<<<< <foo/> >>>>>>.
531 =head1 PARSER OPTIONS
533 Handling of libxml2 parser options has been unified and improved in XML::LibXML
534 1.70. You can now set default options for a particular parser instance by
535 passing them to the constructor as C<<<<<< XML::LibXML->new({name=>value, ...}) >>>>>> or C<<<<<< XML::LibXML->new(name=>value,...) >>>>>>. The options can be queried and changed using the following methods (pre-1.70
536 interfaces such as C<<<<<< $parser->load_ext_dtd(0) >>>>>> also exist, see below):
542 $parser->option_exists($name);
544 Returns 1 if the current XML::LibXML version supports the option C<<<<<< $name >>>>>>, otherwise returns 0 (note that this does not necessarily mean that the option
545 is supported by the underlying libxml2 library).
550 $parser->get_option($name);
552 Returns the current value of the parser option C<<<<<< $name >>>>>>.
557 $parser->set_option($name,$value);
559 Sets option C<<<<<< $name >>>>>> to value C<<<<<< $value >>>>>>.
564 $parser->set_options({$name=>$value,...});
566 Sets multiple parsing options at once.
572 IMPORTANT NOTE: This documentation reflects the parser flags available in
573 libxml2 2.7.3. Some options have no effect if an older version of libxml2 is
576 Each of the flags listed below is labeled labeled
582 if it can be used with a C<<<<<< XML::LibXML >>>>>> parser object (i.e. passed to C<<<<<< XML::LibXML->new >>>>>>, C<<<<<< XML::LibXML->set_option >>>>>>, etc.)
587 if it can be used passed to the C<<<<<< parse_html_* >>>>>> methods
592 if it can be used with the C<<<<<< XML::LibXML::Reader >>>>>>.
598 Unless specified otherwise, the default for boolean valued options is 0
601 The available options are:
607 /parser, html, reader/
609 In case of parsing strings or file handles, XML::LibXML doesn't know about the
610 base uri of the document. To make relative references such as XIncludes work,
611 one has to set a base URI, that is then used for the parsed document.
616 /parser, html, reader/
618 If this option is activated, libxml2 will store the line number of each element
619 node in the parsed document. The line number can be obtained using the C<<<<<< line_number() >>>>>> method of the C<<<<<< XML::LibXML::Node >>>>>> class (for non-element nodes this may report the line number of the containing
620 element). The line numbers are also used for reporting positions of validation
623 IMPORTANT: Due to limitations in the libxml2 library line numbers greater than
624 65535 will be returned as 65535. Unfortunatelly, this is a long and sad story,
625 please see L<<<<<< http://bugzilla.gnome.org/show_bug.cgi?id=325533 >>>>>> for more details.
632 character encoding of the input
637 /parser, html, reader/
639 recover from errors; possible values are 0, 1, and 2
641 A true value turns on recovery mode which allows one to parse broken XML or
642 HTML data. The recovery mode allows the parser to return the successfully
643 parsed portion of the input document. This is useful for almost well-formed
644 documents, where for example a closing tag is missing somewhere. Still,
645 XML::LibXML will only parse until the first fatal (non-recoverable) error
646 occurs, reporting recoverable parsing errors as warnings. To suppress even
647 these warnings, use recover=>2.
649 Note that validation is switched off automatically in recovery mode.
652 =item expand_entities
656 substitute entities; possible values are 0 and 1; default is 1
658 Note that although this flag disables entity substitution, it does not prevent
659 the parser from loading external entities; when substitution of an external
660 entity is disabled, the entity will be represented in the document tree by a
661 XML_ENTITY_REF_NODE node whose subtree will be the content obtained by parsing
662 the external resource; Although this is level of nesting is visible from the
663 DOM it is transparent to XPath data model, so it is possible to match nodes in
664 an unexpanded entity by the same XPath expression as if the entity was
665 expanded. See also ext_ent_handler.
668 =item ext_ent_handler
672 Provide a custom external entity handler to be used when expand_entities is set
673 to 1. Possible value is a subroutine reference.
675 This feature does not work properly in libxml2 < 2.6.27!
677 The subroutine provided is called whenever the parser needs to retrieve the
678 content of an external entity. It is called with two arguments: the system ID
679 (URI) and the public ID. The value returned by the subroutine is parsed as the
680 content of the entity.
682 This method can be used to completely disable entity loading, e.g. to prevent
683 exploits of the type described at (L<<<<<< http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html >>>>>>), where a service is tricked to expose its private data by letting it parse a
684 remote file (RSS feed) that contains an entity reference to a local file (e.g. C<<<<<< /etc/fstab >>>>>>).
686 A more granular solution to this problem, however, is provided by custom URL
689 my $c = XML::LibXML::InputCallback->new();
690 sub match { # accept file:/ URIs except for XML catalogs in /etc/xml/
692 return ($uri=~m{^file:/}
693 and $uri !~ m{^file:///etc/xml/})
696 $c->register_callbacks([ \&match, sub{}, sub{}, sub{} ]);
697 $parser->input_callbacks($c);
706 load the external DTD subset while parsing; possible values are 0 and 1. Unless
707 specified, XML::LibXML sets this option to 1.
709 This flag is also required for DTD Validation, to provide complete attribute,
710 and to expand entities, regardless if the document has an internal subset. Thus
711 switching off external DTD loading, will disable entity expansion, validation,
712 and complete attributes on internal subsets as well.
715 =item complete_attributes
719 create default DTD attributes; possible values are 0 and 1
726 validate with the DTD; possible values are 0 and 1
729 =item suppress_errors
731 /parser, html, reader/
733 suppress error reports; possible values are 0 and 1
736 =item suppress_warnings
738 /parser, html, reader/
740 suppress warning reports; possible values are 0 and 1
743 =item pedantic_parser
745 /parser, html, reader/
747 pedantic error reporting; possible values are 0 and 1
752 /parser, html, reader/
754 remove blank nodes; possible values are 0 and 1
757 =item expand_xinclude or xinclude
761 Implement XInclude substitution; possible values are 0 and 1
763 Expands XIinclude tags immediately while parsing the document. Note that the
764 parser will use the URI resolvers installed via C<<<<<< XML::LibXML::InputCallback >>>>>> to parse the included document (if any).
767 =item no_xinclude_nodes
771 do not generate XINCLUDE START/END nodes; possible values are 0 and 1
776 /parser, html, reader/
778 Forbid network access; possible values are 0 and 1
780 If set to true, all attempts to fetch non-local resources (such as DTD or
781 external entities) will fail (unless custom callbacks are defined).
783 It may be necessary to use the flag C<<<<<< recover >>>>>> for processing documents requiring such resources while networking is off.
786 =item clean_namespaces
790 remove redundant namespaces declarations during parsing; possible values are 0
796 /parser, html, reader/
798 merge CDATA as text nodes; possible values are 0 and 1
805 not fixup XINCLUDE xml#base URIS; possible values are 0 and 1
810 /parser, html, reader/
812 relax any hardcoded limit from the parser; possible values are 0 and 1. Unless
813 specified, XML::LibXML sets this option to 1.
820 THIS OPTION IS EXPERIMENTAL!
822 Although quite powerful, XML:LibXML's DOM implementation is incomplete with
823 respect to the DOM level 2 or level 3 specifications. XML::GDOME is based on
824 libxml2 as well and and provides a rather complete DOM implementation by
825 wrapping libgdome. This flag allows you to make use of XML::LibXML's full
826 parser options and XML::GDOME's DOM implementation at the same time.
828 To make use of this function, one has to install libgdome and configure
829 XML::LibXML to use this library. For this you need to rebuild XML::LibXML!
831 Note: this feature was not seriously tested in recent XML::LibXML releases.
837 For compatibility with XML::LibXML versions prior to 1.70, the following
838 methods are also supported for querying and setting the corresponding parser
839 options (if called without arguments, the methods return the current value of
840 the corresponding parser options; with an argument sets the option to a given
845 $parser->validation();
847 $parser->pedantic_parser();
848 $parser->line_numbers();
849 $parser->load_ext_dtd();
850 $parser->complete_attributes();
851 $parser->expand_xinclude();
852 $parser->gdome_dom();
853 $parser->clean_namespaces();
854 $parser->no_network();
856 The following obsolete methods trigger parser options in some special way:
860 =item recover_silently
864 $parser->recover_silently(1);;
866 If called without an argument, returns true if the current value of the C<<<<<< recover >>>>>> parser option is 2 and returns false otherwise. With a true argument sets the C<<<<<< recover >>>>>> parser option to 2; with a false argument sets the C<<<<<< recover >>>>>> parser option to 0.
869 =item expand_entities
873 $parser->expand_entities(0);
875 Get/set the C<<<<<< expand_entities >>>>>> option. If called with a true argument, also turns the C<<<<<< load_ext_dtd >>>>>> option to 1.
882 $parser->keep_blanks(0);
884 This is actually an oposite of the C<<<<<< no_blanks >>>>>> parser option. If used without an argument retrieves negated value of C<<<<<< no_blanks >>>>>>. If used with an argument sets C<<<<<< no_blanks >>>>>> to the oposite value.
891 $parser->base_uri( $your_base_uri );
893 Get/set the C<<<<<< URI >>>>>> option.
902 C<<<<<< libxml2 >>>>>> supports XML catalogs. Catalogs are used to map remote resources to their local
903 copies. Using catalogs can speed up parsing processes if many external
904 resources from remote addresses are loaded into the parsed documents (such as
907 Note that libxml2 has a global pool of loaded catalogs, so if you apply the
908 method C<<<<<< load_catalog >>>>>> to one parser instance, all parser instances will start using the catalog (in
909 addition to other previously loaded catalogs).
911 Note also that catalogs are not used when a custom external entity handler is
912 specified. At the current state it is not possible to make use of both types of
913 resolving systems at the same time.
919 $parser->load_catalog( $catalog_file );
921 Loads the XML catalog file $catalog_file.
928 =head1 ERROR REPORTING
930 XML::LibXML throws exceptions during parsing, validation or XPath processing
931 (and some other occasions). These errors can be caught by using I<<<<<< eval >>>>>> blocks. The error is stored in I<<<<<< $@ >>>>>>. There are two implementations: the old one throws $@ which is just a message
932 string, in the new one $@ is an object from the class XML::LibXML::Error; this
933 class overrides the operator "" so that when printed, the object flattens to
934 the usual error message.
936 XML::LibXML throws errors as they occur. This is a very common misunderstanding
937 in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt
938 your script by "croaking" (see Carp man page for details).
940 Also note that an increasing number of functions throw errors if bad data is
941 passed as arguments. If you cannot assure valid data passed to XML::LibXML you
942 should eval these functions.
944 Note: since version 1.59, get_last_error() is no longer available in
945 XML::LibXML for thread-safety reasons.
960 2001-2007, AxKit.com Ltd.
962 2002-2006, Christian Glahn.
964 2006-2009, Petr Pajas.