1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
---|
2 | "http://www.w3.org/TR/html4/loose.dtd">
|
---|
3 | <html>
|
---|
4 | <head>
|
---|
5 | <meta http-equiv="Content-Type" content="text/html">
|
---|
6 | <style type="text/css"></style>
|
---|
7 | <!--
|
---|
8 | TD {font-family: Verdana,Arial,Helvetica}
|
---|
9 | BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
|
---|
10 | H1 {font-family: Verdana,Arial,Helvetica}
|
---|
11 | H2 {font-family: Verdana,Arial,Helvetica}
|
---|
12 | H3 {font-family: Verdana,Arial,Helvetica}
|
---|
13 | A:link, A:visited, A:active { text-decoration: underline }
|
---|
14 | </style>
|
---|
15 | -->
|
---|
16 | <title>Libxml2 XmlTextReader Interface tutorial</title>
|
---|
17 | </head>
|
---|
18 |
|
---|
19 | <body bgcolor="#fffacd" text="#000000">
|
---|
20 | <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
|
---|
21 |
|
---|
22 | <p></p>
|
---|
23 |
|
---|
24 | <p>This document describes the use of the XmlTextReader streaming API added
|
---|
25 | to libxml2 in version 2.5.0 . This API is closely modeled after the <a
|
---|
26 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
|
---|
27 | and <a
|
---|
28 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
|
---|
29 | classes of the C# language.</p>
|
---|
30 |
|
---|
31 | <p>This tutorial will present the key points of this API, and working
|
---|
32 | examples using both C and the Python bindings:</p>
|
---|
33 |
|
---|
34 | <p>Table of content:</p>
|
---|
35 | <ul>
|
---|
36 | <li><a href="#Introducti">Introduction: why a new API</a></li>
|
---|
37 | <li><a href="#Walking">Walking a simple tree</a></li>
|
---|
38 | <li><a href="#Extracting">Extracting informations for the current
|
---|
39 | node</a></li>
|
---|
40 | <li><a href="#Extracting1">Extracting informations for the
|
---|
41 | attributes</a></li>
|
---|
42 | <li><a href="#Validating">Validating a document</a></li>
|
---|
43 | <li><a href="#Entities">Entities substitution</a></li>
|
---|
44 | <li><a href="#L1142">Relax-NG Validation</a></li>
|
---|
45 | <li><a href="#Mixing">Mixing the reader and tree or XPath
|
---|
46 | operations</a></li>
|
---|
47 | </ul>
|
---|
48 |
|
---|
49 | <p></p>
|
---|
50 |
|
---|
51 | <h2><a name="Introducti">Introduction: why a new API</a></h2>
|
---|
52 |
|
---|
53 | <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
|
---|
54 | tree based</a>, where the parsing operation results in a document loaded
|
---|
55 | completely in memory, and expose it as a tree of nodes all availble at the
|
---|
56 | same time. This is very simple and quite powerful, but has the major
|
---|
57 | limitation that the size of the document that can be hamdled is limited by
|
---|
58 | the size of the memory available. Libxml2 also provide a <a
|
---|
59 | href="http://www.saxproject.org/">SAX</a> based API, but that version was
|
---|
60 | designed upon one of the early <a
|
---|
61 | href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
|
---|
62 | also not formally defined for C. SAX basically work by registering callbacks
|
---|
63 | which are called directly by the parser as it progresses through the document
|
---|
64 | streams. The problem is that this programming model is relatively complex,
|
---|
65 | not well standardized, cannot provide validation directly, makes entity,
|
---|
66 | namespace and base processing relatively hard.</p>
|
---|
67 |
|
---|
68 | <p>The <a
|
---|
69 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
|
---|
70 | API from C#</a> provides a far simpler programming model. The API acts as a
|
---|
71 | cursor going forward on the document stream and stopping at each node in the
|
---|
72 | way. The user's code keeps control of the progress and simply calls a
|
---|
73 | Read() function repeatedly to progress to each node in sequence in document
|
---|
74 | order. There is direct support for namespaces, xml:base, entity handling and
|
---|
75 | adding DTD validation on top of it was relatively simple. This API is really
|
---|
76 | close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
|
---|
77 | specification</a> This provides a far more standard, easy to use and powerful
|
---|
78 | API than the existing SAX. Moreover integrating extension features based on
|
---|
79 | the tree seems relatively easy.</p>
|
---|
80 |
|
---|
81 | <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
|
---|
82 | more extensible interface to handle large documents than the existing SAX
|
---|
83 | version.</p>
|
---|
84 |
|
---|
85 | <h2><a name="Walking">Walking a simple tree</a></h2>
|
---|
86 |
|
---|
87 | <p>Basically the XmlTextReader API is a forward only tree walking interface.
|
---|
88 | The basic steps are:</p>
|
---|
89 | <ol>
|
---|
90 | <li>prepare a reader context operating on some input</li>
|
---|
91 | <li>run a loop iterating over all nodes in the document</li>
|
---|
92 | <li>free up the reader context</li>
|
---|
93 | </ol>
|
---|
94 |
|
---|
95 | <p>Here is a basic C sample doing this:</p>
|
---|
96 | <pre>#include <libxml/xmlreader.h>
|
---|
97 |
|
---|
98 | void processNode(xmlTextReaderPtr reader) {
|
---|
99 | /* handling of a node in the tree */
|
---|
100 | }
|
---|
101 |
|
---|
102 | int streamFile(char *filename) {
|
---|
103 | xmlTextReaderPtr reader;
|
---|
104 | int ret;
|
---|
105 |
|
---|
106 | reader = xmlNewTextReaderFilename(filename);
|
---|
107 | if (reader != NULL) {
|
---|
108 | ret = xmlTextReaderRead(reader);
|
---|
109 | while (ret == 1) {
|
---|
110 | processNode(reader);
|
---|
111 | ret = xmlTextReaderRead(reader);
|
---|
112 | }
|
---|
113 | xmlFreeTextReader(reader);
|
---|
114 | if (ret != 0) {
|
---|
115 | printf("%s : failed to parse\n", filename);
|
---|
116 | }
|
---|
117 | } else {
|
---|
118 | printf("Unable to open %s\n", filename);
|
---|
119 | }
|
---|
120 | }</pre>
|
---|
121 |
|
---|
122 | <p>A few things to notice:</p>
|
---|
123 | <ul>
|
---|
124 | <li>the include file needed : <code>libxml/xmlreader.h</code></li>
|
---|
125 | <li>the creation of the reader using a filename</li>
|
---|
126 | <li>the repeated call to xmlTextReaderRead() and how any return value
|
---|
127 | different from 1 should stop the loop</li>
|
---|
128 | <li>that a negative return means a parsing error</li>
|
---|
129 | <li>how xmlFreeTextReader() should be used to free up the resources used by
|
---|
130 | the reader.</li>
|
---|
131 | </ul>
|
---|
132 |
|
---|
133 | <p>Here is similar code in python for exactly the same processing:</p>
|
---|
134 | <pre>import libxml2
|
---|
135 |
|
---|
136 | def processNode(reader):
|
---|
137 | pass
|
---|
138 |
|
---|
139 | def streamFile(filename):
|
---|
140 | try:
|
---|
141 | reader = libxml2.newTextReaderFilename(filename)
|
---|
142 | except:
|
---|
143 | print "unable to open %s" % (filename)
|
---|
144 | return
|
---|
145 |
|
---|
146 | ret = reader.Read()
|
---|
147 | while ret == 1:
|
---|
148 | processNode(reader)
|
---|
149 | ret = reader.Read()
|
---|
150 |
|
---|
151 | if ret != 0:
|
---|
152 | print "%s : failed to parse" % (filename)</pre>
|
---|
153 |
|
---|
154 | <p>The only things worth adding are that the <a
|
---|
155 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
|
---|
156 | is abstracted as a class like in C#</a> with the same method names (but the
|
---|
157 | properties are currently accessed with methods) and that one doesn't need to
|
---|
158 | free the reader at the end of the processing. It will get garbage collected
|
---|
159 | once all references have disapeared.</p>
|
---|
160 |
|
---|
161 | <h2><a name="Extracting">Extracting information for the current node</a></h2>
|
---|
162 |
|
---|
163 | <p>So far the example code did not indicate how information was extracted
|
---|
164 | from the reader. It was abstrated as a call to the processNode() routine,
|
---|
165 | with the reader as the argument. At each invocation, the parser is stopped on
|
---|
166 | a given node and the reader can be used to query those node properties. Each
|
---|
167 | <em>Property</em> is available at the C level as a function taking a single
|
---|
168 | xmlTextReaderPtr argument whose name is
|
---|
169 | <code>xmlTextReader</code><em>Property</em> , if the return type is an
|
---|
170 | <code>xmlChar *</code> string then it must be deallocated with
|
---|
171 | <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
|
---|
172 | <em>Property</em> method to the reader class that can be called on the
|
---|
173 | instance. The list of the properties is based on the <a
|
---|
174 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
|
---|
175 | XmlTextReader class</a> set of properties and methods:</p>
|
---|
176 | <ul>
|
---|
177 | <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
|
---|
178 | element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
|
---|
179 | entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
|
---|
180 | 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
|
---|
181 | fragment and 12 for notation nodes.</li>
|
---|
182 | <li><em>Name</em>: the <a
|
---|
183 | href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
|
---|
184 | name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
|
---|
185 | <li><em>LocalName</em>: the <a
|
---|
186 | href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
|
---|
187 | the node.</li>
|
---|
188 | <li><em>Prefix</em>: a shorthand reference to the <a
|
---|
189 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
|
---|
190 | the node.</li>
|
---|
191 | <li><em>NamespaceUri</em>: the URI defining the <a
|
---|
192 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
|
---|
193 | the node.</li>
|
---|
194 | <li><em>BaseUri:</em> the base URI of the node. See the <a
|
---|
195 | href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
|
---|
196 | <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
|
---|
197 | root node.</li>
|
---|
198 | <li><em>HasAttributes</em>: whether the node has attributes.</li>
|
---|
199 | <li><em>HasValue</em>: whether the node can have a text value.</li>
|
---|
200 | <li><em>Value</em>: provides the text value of the node if present.</li>
|
---|
201 | <li><em>IsDefault</em>: whether an Attribute node was generated from the
|
---|
202 | default value defined in the DTD or schema (<em>unsupported
|
---|
203 | yet</em>).</li>
|
---|
204 | <li><em>XmlLang</em>: the <a
|
---|
205 | href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
|
---|
206 | within which the node resides.</li>
|
---|
207 | <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
|
---|
208 | bit bizarre in the sense that <code><a/></code> will be considered
|
---|
209 | empty while <code><a></a></code> will not.</li>
|
---|
210 | <li><em>AttributeCount</em>: provides the number of attributes of the
|
---|
211 | current node.</li>
|
---|
212 | </ul>
|
---|
213 |
|
---|
214 | <p>Let's look first at a small example to get this in practice by redefining
|
---|
215 | the processNode() function in the Python example:</p>
|
---|
216 | <pre>def processNode(reader):
|
---|
217 | print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
|
---|
218 | reader.Name(), reader.IsEmptyElement())</pre>
|
---|
219 |
|
---|
220 | <p>and look at the result of calling streamFile("tst.xml") for various
|
---|
221 | content of the XML test file.</p>
|
---|
222 |
|
---|
223 | <p>For the minimal document "<code><doc/></code>" we get:</p>
|
---|
224 | <pre>0 1 doc 1</pre>
|
---|
225 |
|
---|
226 | <p>Only one node is found, its depth is 0, type 1 indicate an element start,
|
---|
227 | of name "doc" and it is empty. Trying now with
|
---|
228 | "<code><doc></doc></code>" instead leads to:</p>
|
---|
229 | <pre>0 1 doc 0
|
---|
230 | 0 15 doc 0</pre>
|
---|
231 |
|
---|
232 | <p>The document root node is not flagged as empty anymore and both a start
|
---|
233 | and an end of element are detected. The following document shows how
|
---|
234 | character data are reported:</p>
|
---|
235 | <pre><doc><a/><b>some text</b>
|
---|
236 | <c/></doc></pre>
|
---|
237 |
|
---|
238 | <p>We modifying the processNode() function to also report the node Value:</p>
|
---|
239 | <pre>def processNode(reader):
|
---|
240 | print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
|
---|
241 | reader.Name(), reader.IsEmptyElement(),
|
---|
242 | reader.Value())</pre>
|
---|
243 |
|
---|
244 | <p>The result of the test is:</p>
|
---|
245 | <pre>0 1 doc 0 None
|
---|
246 | 1 1 a 1 None
|
---|
247 | 1 1 b 0 None
|
---|
248 | 2 3 #text 0 some text
|
---|
249 | 1 15 b 0 None
|
---|
250 | 1 3 #text 0
|
---|
251 |
|
---|
252 | 1 1 c 1 None
|
---|
253 | 0 15 doc 0 None</pre>
|
---|
254 |
|
---|
255 | <p>There are a few things to note:</p>
|
---|
256 | <ul>
|
---|
257 | <li>the increase of the depth value (first row) as children nodes are
|
---|
258 | explored</li>
|
---|
259 | <li>the text node child of the b element, of type 3 and its content</li>
|
---|
260 | <li>the text node containing the line return between elements b and c</li>
|
---|
261 | <li>that elements have the Value None (or NULL in C)</li>
|
---|
262 | </ul>
|
---|
263 |
|
---|
264 | <p>The equivalent routine for <code>processNode()</code> as used by
|
---|
265 | <code>xmllint --stream --debug</code> is the following and can be found in
|
---|
266 | the xmllint.c module in the source distribution:</p>
|
---|
267 | <pre>static void processNode(xmlTextReaderPtr reader) {
|
---|
268 | xmlChar *name, *value;
|
---|
269 |
|
---|
270 | name = xmlTextReaderName(reader);
|
---|
271 | if (name == NULL)
|
---|
272 | name = xmlStrdup(BAD_CAST "--");
|
---|
273 | value = xmlTextReaderValue(reader);
|
---|
274 |
|
---|
275 | printf("%d %d %s %d",
|
---|
276 | xmlTextReaderDepth(reader),
|
---|
277 | xmlTextReaderNodeType(reader),
|
---|
278 | name,
|
---|
279 | xmlTextReaderIsEmptyElement(reader));
|
---|
280 | xmlFree(name);
|
---|
281 | if (value == NULL)
|
---|
282 | printf("\n");
|
---|
283 | else {
|
---|
284 | printf(" %s\n", value);
|
---|
285 | xmlFree(value);
|
---|
286 | }
|
---|
287 | }</pre>
|
---|
288 |
|
---|
289 | <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
|
---|
290 |
|
---|
291 | <p>The previous examples don't indicate how attributes are processed. The
|
---|
292 | simple test "<code><doc a="b"/></code>" provides the following
|
---|
293 | result:</p>
|
---|
294 | <pre>0 1 doc 1 None</pre>
|
---|
295 |
|
---|
296 | <p>This proves that attribute nodes are not traversed by default. The
|
---|
297 | <em>HasAttributes</em> property allow to detect their presence. To check
|
---|
298 | their content the API has special instructions. Basically two kinds of operations
|
---|
299 | are possible:</p>
|
---|
300 | <ol>
|
---|
301 | <li>to move the reader to the attribute nodes of the current element, in
|
---|
302 | that case the cursor is positionned on the attribute node</li>
|
---|
303 | <li>to directly query the element node for the attribute value</li>
|
---|
304 | </ol>
|
---|
305 |
|
---|
306 | <p>In both case the attribute can be designed either by its position in the
|
---|
307 | list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
|
---|
308 | by their name (and namespace):</p>
|
---|
309 | <ul>
|
---|
310 | <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
|
---|
311 | the specified index no relative to the containing element.</li>
|
---|
312 | <li><em>GetAttribute</em>(name): provides the value of the attribute with
|
---|
313 | the specified qualified name.</li>
|
---|
314 | <li>GetAttributeNs(localName, namespaceURI): provides the value of the
|
---|
315 | attribute with the specified local name and namespace URI.</li>
|
---|
316 | <li><em>MoveToAttributeNo</em>(no): moves the position of the current
|
---|
317 | instance to the attribute with the specified index relative to the
|
---|
318 | containing element.</li>
|
---|
319 | <li><em>MoveToAttribute</em>(name): moves the position of the current
|
---|
320 | instance to the attribute with the specified qualified name.</li>
|
---|
321 | <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
|
---|
322 | of the current instance to the attribute with the specified local name
|
---|
323 | and namespace URI.</li>
|
---|
324 | <li><em>MoveToFirstAttribute</em>: moves the position of the current
|
---|
325 | instance to the first attribute associated with the current node.</li>
|
---|
326 | <li><em>MoveToNextAttribute</em>: moves the position of the current
|
---|
327 | instance to the next attribute associated with the current node.</li>
|
---|
328 | <li><em>MoveToElement</em>: moves the position of the current instance to
|
---|
329 | the node that contains the current Attribute node.</li>
|
---|
330 | </ul>
|
---|
331 |
|
---|
332 | <p>After modifying the processNode() function to show attributes:</p>
|
---|
333 | <pre>def processNode(reader):
|
---|
334 | print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
|
---|
335 | reader.Name(), reader.IsEmptyElement(),
|
---|
336 | reader.Value())
|
---|
337 | if reader.NodeType() == 1: # Element
|
---|
338 | while reader.MoveToNextAttribute():
|
---|
339 | print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
|
---|
340 | reader.Name(),reader.Value())</pre>
|
---|
341 |
|
---|
342 | <p>The output for the same input document reflects the attribute:</p>
|
---|
343 | <pre>0 1 doc 1 None
|
---|
344 | -- 1 2 (a) [b]</pre>
|
---|
345 |
|
---|
346 | <p>There are a couple of things to note on the attribute processing:</p>
|
---|
347 | <ul>
|
---|
348 | <li>Their depth is the one of the carrying element plus one.</li>
|
---|
349 | <li>Namespace declarations are seen as attributes, as in DOM.</li>
|
---|
350 | </ul>
|
---|
351 |
|
---|
352 | <h2><a name="Validating">Validating a document</a></h2>
|
---|
353 |
|
---|
354 | <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
|
---|
355 | API. The main one is the ability to DTD validate the parsed document
|
---|
356 | progressively. This is simply the activation of the associated feature of the
|
---|
357 | parser used by the reader structure. There are a few options available
|
---|
358 | defined as the enum xmlParserProperties in the libxml/xmlreader.h header
|
---|
359 | file:</p>
|
---|
360 | <ul>
|
---|
361 | <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
|
---|
362 | <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
|
---|
363 | loading the DTD)</li>
|
---|
364 | <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
|
---|
365 | the DTD)</li>
|
---|
366 | <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
|
---|
367 | reference nodes are not generated and are replaced by their expanded
|
---|
368 | content.</li>
|
---|
369 | <li>more settings might be added, those were the one available at the 2.5.0
|
---|
370 | release...</li>
|
---|
371 | </ul>
|
---|
372 |
|
---|
373 | <p>The GetParserProp() and SetParserProp() methods can then be used to get
|
---|
374 | and set the values of those parser properties of the reader. For example</p>
|
---|
375 | <pre>def parseAndValidate(file):
|
---|
376 | reader = libxml2.newTextReaderFilename(file)
|
---|
377 | reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
|
---|
378 | ret = reader.Read()
|
---|
379 | while ret == 1:
|
---|
380 | ret = reader.Read()
|
---|
381 | if ret != 0:
|
---|
382 | print "Error parsing and validating %s" % (file)</pre>
|
---|
383 |
|
---|
384 | <p>This routine will parse and validate the file. Error messages can be
|
---|
385 | captured by registering an error handler. See python/tests/reader2.py for
|
---|
386 | more complete Python examples. At the C level the equivalent call to cativate
|
---|
387 | the validation feature is just:</p>
|
---|
388 | <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
|
---|
389 |
|
---|
390 | <p>and a return value of 0 indicates success.</p>
|
---|
391 |
|
---|
392 | <h2><a name="Entities">Entities substitution</a></h2>
|
---|
393 |
|
---|
394 | <p>By default the xmlReader will report entities as such and not replace them
|
---|
395 | with their content. This default behaviour can however be overriden using:</p>
|
---|
396 |
|
---|
397 | <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
|
---|
398 |
|
---|
399 | <h2><a name="L1142">Relax-NG Validation</a></h2>
|
---|
400 |
|
---|
401 | <p style="font-size: 10pt">Introduced in version 2.5.7</p>
|
---|
402 |
|
---|
403 | <p>Libxml2 can now validate the document being read using the xmlReader using
|
---|
404 | Relax-NG schemas. While the Relax NG validator can't always work in a
|
---|
405 | streamable mode, only subsets which cannot be reduced to regular expressions
|
---|
406 | need to have their subtree expanded for validation. In practice it means
|
---|
407 | that, unless the schemas for the top level element content is not expressable
|
---|
408 | as a regexp, only chunk of the document needs to be parsed while
|
---|
409 | validating.</p>
|
---|
410 |
|
---|
411 | <p>The steps to do so are:</p>
|
---|
412 | <ul>
|
---|
413 | <li>create a reader working on a document as usual</li>
|
---|
414 | <li>before any call to read associate it to a Relax NG schemas, either the
|
---|
415 | preparsed schemas or the URL to the schemas to use</li>
|
---|
416 | <li>errors will be reported the usual way, and the validity status can be
|
---|
417 | obtained using the IsValid() interface of the reader like for DTDs.</li>
|
---|
418 | </ul>
|
---|
419 |
|
---|
420 | <p>Example, assuming the reader has already being created and that the schema
|
---|
421 | string contains the Relax-NG schemas:</p>
|
---|
422 | <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
|
---|
423 | rngs = rngp.relaxNGParse()<br>
|
---|
424 | reader.RelaxNGSetSchema(rngs)<br>
|
---|
425 | ret = reader.Read()<br>
|
---|
426 | while ret == 1:<br>
|
---|
427 | ret = reader.Read()<br>
|
---|
428 | if ret != 0:<br>
|
---|
429 | print "Error parsing the document"<br>
|
---|
430 | if reader.IsValid() != 1:<br>
|
---|
431 | print "Document failed to validate"</code><br>
|
---|
432 | </pre>
|
---|
433 |
|
---|
434 | <p>See <code>reader6.py</code> in the sources or documentation for a complete
|
---|
435 | example.</p>
|
---|
436 |
|
---|
437 | <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
|
---|
438 |
|
---|
439 | <p style="font-size: 10pt">Introduced in version 2.5.7</p>
|
---|
440 |
|
---|
441 | <p>While the reader is a streaming interface, its underlying implementation
|
---|
442 | is based on the DOM builder of libxml2. As a result it is relatively simple
|
---|
443 | to mix operations based on both models under some constraints. To do so the
|
---|
444 | reader has an Expand() operation allowing to grow the subtree under the
|
---|
445 | current node. It returns a pointer to a standard node which can be
|
---|
446 | manipulated in the usual ways. The node will get all its ancestors and the
|
---|
447 | full subtree available. Usual operations like XPath queries can be used on
|
---|
448 | that reduced view of the document. Here is an example extracted from
|
---|
449 | reader5.py in the sources which extract and prints the bibliography for the
|
---|
450 | "Dragon" compiler book from the XML 1.0 recommendation:</p>
|
---|
451 | <pre>f = open('../../test/valid/REC-xml-19980210.xml')
|
---|
452 | input = libxml2.inputBuffer(f)
|
---|
453 | reader = input.newTextReader("REC")
|
---|
454 | res=""
|
---|
455 | while reader.Read():
|
---|
456 | while reader.Name() == 'bibl':
|
---|
457 | node = reader.Expand() # expand the subtree
|
---|
458 | if node.xpathEval("@id = 'Aho'"): # use XPath on it
|
---|
459 | res = res + node.serialize()
|
---|
460 | if reader.Next() != 1: # skip the subtree
|
---|
461 | break;</pre>
|
---|
462 |
|
---|
463 | <p>Note, however that the node instance returned by the Expand() call is only
|
---|
464 | valid until the next Read() operation. The Expand() operation does not
|
---|
465 | affects the Read() ones, however usually once processed the full subtree is
|
---|
466 | not useful anymore, and the Next() operation allows to skip it completely and
|
---|
467 | process to the successor or return 0 if the document end is reached.</p>
|
---|
468 |
|
---|
469 | <p><a href="mailto:[email protected]">Daniel Veillard</a></p>
|
---|
470 |
|
---|
471 | <p>$Id$</p>
|
---|
472 |
|
---|
473 | <p></p>
|
---|
474 | </body>
|
---|
475 | </html>
|
---|