3 XML::LibXML::Reader - XML::LibXML::Reader - interface to libxml2 pull parser
9 use XML::LibXML::Reader;
13 my $reader = new XML::LibXML::Reader(location => "file.xml")
14 or die "cannot read file.xml\n";
15 while ($reader->read) {
23 printf "%d %d %s %d\n", ($reader->depth,
26 $reader->isEmptyElement);
33 $reader = new XML::LibXML::Reader(location => "file.xml")
34 or die "cannot read file.xml\n";
35 $reader->preservePattern('//table/tr');
37 print $reader->document->toString(1);
42 This is a perl interface to libxml2's pull-parser implementation xmlTextReader I<<<<<< http://xmlsoft.org/html/libxml-xmlreader.html >>>>>>. This feature requires at least libxml2-2.6.21. Pull-parser (StAX in Java,
43 XmlReader in C#) use an iterator approach to parse a xml-file. They are easier
44 to program than event-based parser (SAX) and much more lightweight than
45 tree-based parser (DOM), which load the complete tree into memory.
47 The Reader acts as a cursor going forward on the document stream and stopping
48 at each node in the way. At every point DOM-like methods of the Reader object
49 allow to examine the current node (name, namespace, attributes, etc.)
51 The user's code keeps control of the progress and simply calls the C<<<<<< read() >>>>>> function repeatedly to progress to the next node in the document order. Other
52 functions provide means for skipping complete sub-trees, or nodes until a
53 specific element, etc.
55 At every time, only a very limited portion of the document is kept in the
56 memory, which makes the API more memory-efficient than using DOM. However, it
57 is also possible to mix Reader with DOM. At every point the user may copy the
58 current node (optionally expanded into a complete sub-tree) from the processed
59 document to another DOM tree, or to instruct the Reader to collect sub-document
60 in form of a DOM tree consisting of selected nodes.
62 Reader API also supports namespaces, xml:base, entity handling, and DTD
63 validation. Schema and RelaxNG validation support will probably be added in
64 some later revision of the Perl interface.
66 The naming of methods compared to libxml2 and C# XmlTextReader has been changed
67 slightly to match the conventions of XML::LibXML. Some functions have been
68 changed or added with respect to the C interface.
73 Depending on the XML source, the Reader object can be created with either of:
77 my $reader = XML::LibXML::Reader->new( location => "file.xml", ... );
78 my $reader = XML::LibXML::Reader->new( string => $xml_string, ... );
79 my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... );
80 my $reader = XML::LibXML::Reader->new( FD => fileno(STDIN), ... );
81 my $reader = XML::LibXML::Reader->new( DOM => $dom, ... );
83 where ... are (optional) reader options described below in L<<<<<< Reader options >>>>>> or various parser options described in L<<<<<< XML::LibXML::Parser >>>>>>. The constructor recognizes the following XML sources:
86 =head2 Source specification
92 Read XML from a local file or URL.
97 Read XML from a string.
102 Read XML a Perl IO filehandle.
107 Read XML from a file descriptor (bypasses Perl I/O layer, only applicable to
108 filehandles for regular files or pipes). Possibly faster than IO.
113 Use reader API to walk through a pre-parsed L<<<<<< XML::LibXML::Document >>>>>>.
120 =head2 Reader options
124 =item encoding => $encoding
126 override document encoding.
129 =item RelaxNG => $rng_schema
131 can be used to pass either a L<<<<<< XML::LibXML::RelaxNG >>>>>> object or a filename or URL of a RelaxNG schema to the constructor. The schema
132 is then used to validate the document as it is processed.
135 =item Schema => $xsd_schema
137 can be used to pass either a L<<<<<< XML::LibXML::Schema >>>>>> object or a filename or URL of a W3C XSD schema to the constructor. The schema
138 is then used to validate the document as it is processed.
143 the reader further supports various parser options described in L<<<<<< XML::LibXML::Parser >>>>>> (specificly those labeled by /reader/).
150 =head1 METHODS CONTROLLING PARSING PROGRESS
156 Moves the position to the next node in the stream, exposing its properties.
158 Returns 1 if the node was read successfully, 0 if there is no more nodes to
159 read, or -1 in case of error
162 =item readAttributeValue ()
164 Parses an attribute value into one or more Text and EntityReference nodes.
166 Returns 1 in case of success, 0 if the reader was not positioned on an
167 attribute node or all the attribute values have been read, or -1 in case of
173 Gets the read state of the reader. Returns the state value, or -1 in case of
174 error. The module exports constants for the Reader states, see STATES below.
179 The depth of the node in the tree, starts at 0 for the root node.
184 Skip to the node following the current one in the document order while avoiding
185 the sub-tree if any. Returns 1 if the node was read successfully, 0 if there is
186 no more nodes to read, or -1 in case of error.
189 =item nextElement (localname?,nsURI?)
191 Skip nodes following the current one in the document order until a specific
192 element is reached. The element's name must be equal to a given localname if
193 defined, and its namespace must equal to a given nsURI if defined. Either of
194 the arguments can be undefined (or omitted, in case of the latter or both).
196 Returns 1 if the element was found, 0 if there is no more nodes to read, or -1
200 =item nextPatternMatch (compiled_pattern)
202 Skip nodes following the current one in the document order until an element
203 matching a given compiled pattern is reached. See L<<<<<< XML::LibXML::Pattern >>>>>> for information on compiled patterns. See also the C<<<<<< matchesPattern >>>>>> method.
205 Returns 1 if the element was found, 0 if there is no more nodes to read, or -1
209 =item skipSiblings ()
211 Skip all nodes on the same or lower level until the first node on a higher
212 level is reached. In particular, if the current node occurs in an element, the
213 reader stops at the end tag of the parent element, otherwise it stops at a node
214 immediately following the parent node.
216 Returns 1 if successful, 0 if end of the document is reached, or -1 in case of
222 It skips to the node following the current one in the document order while
223 avoiding the sub-tree if any.
225 Returns 1 if the node was read successfully, 0 if there is no more nodes to
226 read, or -1 in case of error
229 =item nextSiblingElement (name?,nsURI?)
231 Like nextElement but only processes sibling elements of the current node
232 (moving forward using C<<<<<< nextSibling () >>>>>> rather than C<<<<<< read () >>>>>>, internally).
234 Returns 1 if the element was found, 0 if there is no more sibling nodes, or -1
240 Skip all remaining nodes in the document, reaching end of the document.
242 Returns 1 if successful, 0 in case of error.
247 This method releases any resources allocated by the current instance and closes
248 any underlying input. It returns 0 on failure and 1 on success. This method is
249 automatically called by the destructor when the reader is forgotten, therefore
250 you do not have to call it directly.
257 =head1 METHODS EXTRACTING INFORMATION
263 Returns the qualified name of the current node, equal to (Prefix:)LocalName.
268 Returns the type of the current node. See NODE TYPES below.
273 Returns the local name of the node.
278 Returns the prefix of the namespace associated with the node.
281 =item namespaceURI ()
283 Returns the URI defining the namespace associated with the node.
286 =item isEmptyElement ()
288 Check if the current node is empty, this is a bit bizarre in the sense that
289 <a/> will be considered empty while <a></a> will not.
294 Returns true if the node can have a text value.
299 Provides the text value of the node if present or undef if not available.
302 =item readInnerXml ()
304 Reads the contents of the current node, including child nodes and markup.
305 Returns a string containing the XML of the node's content, or undef if the
306 current node is neither an element nor attribute, or has no child nodes.
309 =item readOuterXml ()
311 Reads the contents of the current node, including child nodes and markup.
313 Returns a string containing the XML of the node including its content, or undef
314 if the current node is neither an element nor attribute.
319 Returns a cannonical location path to the current element from the root node to
320 the current node. Namespaced elements are matched by '*', because there is no
321 way to declare prefixes within XPath patterns. Unlike C<<<<<< XML::LibXML::Node::nodePath() >>>>>>, this function does not provide sibling counts (i.e. instead of e.g. '/a/b[1]'
322 and '/a/b[2]' you get '/a/b' for both matches).
325 =item matchesPattern(compiled_pattern)
327 Returns a true value if the current node matches a compiled pattern. See L<<<<<< XML::LibXML::Pattern >>>>>> for information on compiled patterns. See also the C<<<<<< nextPatternMatch >>>>>> method.
334 =head1 METHODS EXTRACTING DOM NODES
340 Provides access to the document tree built by the reader. This function can be
341 used to collect the preserved nodes (see C<<<<<< preserveNode() >>>>>> and preservePattern).
343 CAUTION: Never use this function to modify the tree unless reading of the whole
344 document is completed!
347 =item copyCurrentNode (deep)
349 This function is similar a DOM function C<<<<<< copyNode() >>>>>>. It returns a copy of the currently processed node as a corresponding DOM
350 object. Use deep = 1 to obtain the full sub-tree.
353 =item preserveNode ()
355 This tells the XML Reader to preserve the current node in the document tree. A
356 document tree consisting of the preserved nodes and their content can be
357 obtained using the method C<<<<<< document() >>>>>> once parsing is finished.
359 Returns the node or NULL in case of error.
362 =item preservePattern (pattern,\%ns_map)
364 This tells the XML Reader to preserve all nodes matched by the pattern (which
365 is a streaming XPath subset). A document tree consisting of the preserved nodes
366 and their content can be obtained using the method C<<<<<< document() >>>>>> once parsing is finished.
368 An optional second argument can be used to provide a HASH reference mapping
369 prefixes used by the XPath to namespace URIs.
371 The XPath subset available with this function is described at
375 http://www.w3.org/TR/xmlschema-1/#Selector
377 and matches the production
381 Path ::= ('.//')? ( Step '/' )* ( Step | '@' NameTest )
383 Returns a positive number in case of success and -1 in case of error
390 =head1 METHODS PROCESSING ATTRIBUTES
394 =item attributeCount ()
396 Provides the number of attributes of the current node.
399 =item hasAttributes ()
401 Whether the node has attributes.
404 =item getAttribute (name)
406 Provides the value of the attribute with the specified qualified name.
408 Returns a string containing the value of the specified attribute, or undef in
412 =item getAttributeNs (localName, namespaceURI)
414 Provides the value of the specified attribute.
416 Returns a string containing the value of the specified attribute, or undef in
420 =item getAttributeNo (no)
422 Provides the value of the attribute with the specified index relative to the
425 Returns a string containing the value of the specified attribute, or undef in
431 Returns true if the current attribute node was generated from the default value
435 =item moveToAttribute (name)
437 Moves the position to the attribute with the specified local name and namespace
440 Returns 1 in case of success, -1 in case of error, 0 if not found
443 =item moveToAttributeNo (no)
445 Moves the position to the attribute with the specified index relative to the
448 Returns 1 in case of success, -1 in case of error, 0 if not found
451 =item moveToAttributeNs (localName,namespaceURI)
453 Moves the position to the attribute with the specified local name and namespace
456 Returns 1 in case of success, -1 in case of error, 0 if not found
459 =item moveToFirstAttribute ()
461 Moves the position to the first attribute associated with the current node.
463 Returns 1 in case of success, -1 in case of error, 0 if not found
466 =item moveToNextAttribute ()
468 Moves the position to the next attribute associated with the current node.
470 Returns 1 in case of success, -1 in case of error, 0 if not found
473 =item moveToElement ()
475 Moves the position to the node that contains the current attribute node.
477 Returns 1 in case of success, -1 in case of error, 0 if not moved
480 =item isNamespaceDecl ()
482 Determine whether the current node is a namespace declaration rather than a
485 Returns 1 if the current node is a namespace declaration, 0 if it is a regular
486 attribute or other type of node, or -1 in case of error.
497 =item lookupNamespace (prefix)
499 Resolves a namespace prefix in the scope of the current element.
501 Returns a string containing the namespace URI to which the prefix maps or undef
507 Returns a string containing the encoding of the document or undef in case of
513 Determine the standalone status of the document being read. Returns 1 if the
514 document was declared to be standalone, 0 if it was declared to be not
515 standalone, or -1 if the document did not specify its standalone status or in
521 Determine the XML version of the document being read. Returns a string
522 containing the XML version of the document or undef in case of error.
527 Returns the base URI of a given node.
532 Retrieve the validity status from the parser.
534 Returns 1 if valid, 0 if no, and -1 in case of error.
539 The xml:lang scope within which the node resides.
544 Provide the line number of the current parsing point.
547 =item columnNumber ()
549 Provide the column number of the current parsing point.
552 =item byteConsumed ()
554 This function provides the current index of the parser relative to the start of
555 the current entity. This function is computed in bytes from the beginning
556 starting at zero and finishing at the size in bytes of the file if parsing a
557 file. The function is of constant cost if the input is UTF-8 but can be costly
558 if run on non-UTF-8 input.
561 =item setParserProp (prop => value, ...)
563 Change the parser processing behaviour by changing some of its internal
564 properties. The following properties are available with this function:
565 ``load_ext_dtd'', ``complete_attributes'', ``validation'', ``expand_entities''.
567 Since some of the properties can only be changed before any read has been done,
568 it is best to set the parsing properties at the constructor.
570 Returns 0 if the call was successful, or -1 in case of error
573 =item getParserProp (prop)
575 Get value of an parser internal property. The following property names can be
576 used: ``load_ext_dtd'', ``complete_attributes'', ``validation'',
579 Returns the value, usually 0 or 1, or -1 in case of error.
588 XML::LibXML takes care of the reader object destruction when the last reference
589 to the reader object goes out of scope. The document tree is preserved, though,
590 if either of $reader->document or $reader->preserveNode was used and references
591 to the document tree exist.
596 The reader interface provides the following constants for node types (the
597 constant symbols are exported by default or if tag C<<<<<< :types >>>>>> is used).
601 XML_READER_TYPE_NONE => 0
602 XML_READER_TYPE_ELEMENT => 1
603 XML_READER_TYPE_ATTRIBUTE => 2
604 XML_READER_TYPE_TEXT => 3
605 XML_READER_TYPE_CDATA => 4
606 XML_READER_TYPE_ENTITY_REFERENCE => 5
607 XML_READER_TYPE_ENTITY => 6
608 XML_READER_TYPE_PROCESSING_INSTRUCTION => 7
609 XML_READER_TYPE_COMMENT => 8
610 XML_READER_TYPE_DOCUMENT => 9
611 XML_READER_TYPE_DOCUMENT_TYPE => 10
612 XML_READER_TYPE_DOCUMENT_FRAGMENT => 11
613 XML_READER_TYPE_NOTATION => 12
614 XML_READER_TYPE_WHITESPACE => 13
615 XML_READER_TYPE_SIGNIFICANT_WHITESPACE => 14
616 XML_READER_TYPE_END_ELEMENT => 15
617 XML_READER_TYPE_END_ENTITY => 16
618 XML_READER_TYPE_XML_DECLARATION => 17
623 The following constants represent the values returned by C<<<<<< readState() >>>>>>. They are exported by default, or if tag C<<<<<< :states >>>>>> is used:
627 XML_READER_NONE => -1
628 XML_READER_START => 0
629 XML_READER_ELEMENT => 1
631 XML_READER_EMPTY => 3
632 XML_READER_BACKTRACK => 4
634 XML_READER_ERROR => 6
639 L<<<<<< XML::LibXML::Pattern >>>>>> for information about compiled patterns.
641 http://xmlsoft.org/html/libxml-xmlreader.html
643 http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html
646 =head1 ORIGINAL IMPLEMENTATION
648 Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas
663 2001-2007, AxKit.com Ltd.
665 2002-2006, Christian Glahn.
667 2006-2009, Petr Pajas.