1 | <?xml version="1.0" encoding="UTF-8"?>
|
---|
2 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
---|
3 | <html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
|
---|
4 | TD {font-family: Verdana,Arial,Helvetica}
|
---|
5 | BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
|
---|
6 | H1 {font-family: Verdana,Arial,Helvetica}
|
---|
7 | H2 {font-family: Verdana,Arial,Helvetica}
|
---|
8 | H3 {font-family: Verdana,Arial,Helvetica}
|
---|
9 | A:link, A:visited, A:active { text-decoration: underline }
|
---|
10 | </style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000"><table border="0" width="100%" cellpadding="5" cellspacing="0" align="center"><tr><td width="120"><a href="http://swpat.ffii.org/"><img src="epatents.png" alt="Action against software patents" /></a></td><td width="180"><a href="http://www.gnome.org/"><img src="gnome2.png" alt="Gnome2 Logo" /></a><a href="http://www.w3.org/Status"><img src="w3c.png" alt="W3C Logo" /></a><a href="http://www.redhat.com/"><img src="redhat.gif" alt="Red Hat Logo" /></a><div align="left"><a href="http://xmlsoft.org/"><img src="Libxml2-Logo-180x168.gif" alt="Made with Libxml2 Logo" /></a></div></td><td><table border="0" width="90%" cellpadding="2" cellspacing="0" align="center" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3" bgcolor="#fffacd"><tr><td align="center"><h1>The XML C parser and toolkit of Gnome</h1><h2>Encodings support</h2></td></tr></table></td></tr></table></td></tr></table><table border="0" cellpadding="4" cellspacing="0" width="100%" align="center"><tr><td bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="2" width="100%"><tr><td valign="top" width="200" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3"><tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Main Menu</b></center></td></tr><tr><td bgcolor="#fffacd"><form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form><ul><li><a href="index.html">Home</a></li><li><a href="html/index.html">Reference Manual</a></li><li><a href="intro.html">Introduction</a></li><li><a href="FAQ.html">FAQ</a></li><li><a href="docs.html" style="font-weight:bold">Developer Menu</a></li><li><a href="bugs.html">Reporting bugs and getting help</a></li><li><a href="help.html">How to help</a></li><li><a href="downloads.html">Downloads</a></li><li><a href="news.html">Releases</a></li><li><a href="XMLinfo.html">XML</a></li><li><a href="XSLT.html">XSLT</a></li><li><a href="xmldtd.html">Validation & DTDs</a></li><li><a href="encoding.html">Encodings support</a></li><li><a href="catalog.html">Catalog support</a></li><li><a href="namespaces.html">Namespaces</a></li><li><a href="contribs.html">Contributions</a></li><li><a href="examples/index.html" style="font-weight:bold">Code Examples</a></li><li><a href="html/index.html" style="font-weight:bold">API Menu</a></li><li><a href="guidelines.html">XML Guidelines</a></li><li><a href="ChangeLog.html">Recent Changes</a></li></ul></td></tr></table><table width="100%" border="0" cellspacing="1" cellpadding="3"><tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Related links</b></center></td></tr><tr><td bgcolor="#fffacd"><ul><li><a href="http://mail.gnome.org/archives/xml/">Mail archive</a></li><li><a href="http://xmlsoft.org/XSLT/">XSLT libxslt</a></li><li><a href="http://phd.cs.unibo.it/gdome2/">DOM gdome2</a></li><li><a href="http://www.aleksey.com/xmlsec/">XML-DSig xmlsec</a></li><li><a href="ftp://xmlsoft.org/">FTP</a></li><li><a href="http://www.zlatkovic.com/projects/libxml/">Windows binaries</a></li><li><a href="http://opencsw.org/packages/libxml2">Solaris binaries</a></li><li><a href="http://www.explain.com.au/oss/libxml2xslt.html">MacOsX binaries</a></li><li><a href="http://lxml.de/">lxml Python bindings</a></li><li><a href="http://cpan.uwinnipeg.ca/dist/XML-LibXML">Perl bindings</a></li><li><a href="http://libxmlplusplus.sourceforge.net/">C++ bindings</a></li><li><a href="http://www.zend.com/php5/articles/php5-xmlphp.php#Heading4">PHP bindings</a></li><li><a href="http://sourceforge.net/projects/libxml2-pas/">Pascal bindings</a></li><li><a href="http://libxml.rubyforge.org/">Ruby bindings</a></li><li><a href="http://tclxml.sourceforge.net/">Tcl bindings</a></li><li><a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml2">Bug Tracker</a></li></ul></td></tr></table></td></tr></table></td><td valign="top" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%"><tr><td><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table border="0" cellpadding="3" cellspacing="1" width="100%"><tr><td bgcolor="#fffacd"><p>If you are not really familiar with Internationalization (usual shortcut
|
---|
11 | is I18N) , Unicode, characters and glyphs, I suggest you read a <a href="http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode">presentation</a>
|
---|
12 | by Tim Bray on Unicode and why you should care about it.</p><p>If you don't understand why <b>it does not make sense to have a string
|
---|
13 | without knowing what encoding it uses</b>, then as Joel Spolsky said <a href="http://www.joelonsoftware.com/articles/Unicode.html">please do not
|
---|
14 | write another line of code until you finish reading that article.</a>. It is
|
---|
15 | a prerequisite to understand this page, and avoid a lot of problems with
|
---|
16 | libxml2, XML or text processing in general.</p><p>Table of Content:</p><ol>
|
---|
17 | <li><a href="encoding.html#What">What does internationalization support
|
---|
18 | mean ?</a></li>
|
---|
19 | <li><a href="encoding.html#internal">The internal encoding, how and
|
---|
20 | why</a></li>
|
---|
21 | <li><a href="encoding.html#implemente">How is it implemented ?</a></li>
|
---|
22 | <li><a href="encoding.html#Default">Default supported encodings</a></li>
|
---|
23 | <li><a href="encoding.html#extend">How to extend the existing
|
---|
24 | support</a></li>
|
---|
25 | </ol><h3><a name="What" id="What">What does internationalization support mean ?</a></h3><p>XML was designed from the start to allow the support of any character set
|
---|
26 | by using Unicode. Any conformant XML parser has to support the UTF-8 and
|
---|
27 | UTF-16 default encodings which can both express the full unicode ranges. UTF8
|
---|
28 | is a variable length encoding whose greatest points are to reuse the same
|
---|
29 | encoding for ASCII and to save space for Western encodings, but it is a bit
|
---|
30 | more complex to handle in practice. UTF-16 use 2 bytes per character (and
|
---|
31 | sometimes combines two pairs), it makes implementation easier, but looks a
|
---|
32 | bit overkill for Western languages encoding. Moreover the XML specification
|
---|
33 | allows the document to be encoded in other encodings at the condition that
|
---|
34 | they are clearly labeled as such. For example the following is a wellformed
|
---|
35 | XML document encoded in ISO-8859-1 and using accentuated letters that we
|
---|
36 | French like for both markup and content:</p><pre><?xml version="1.0" encoding="ISO-8859-1"?>
|
---|
37 | <très>là </très></pre><p>Having internationalization support in libxml2 means the following:</p><ul>
|
---|
38 | <li>the document is properly parsed</li>
|
---|
39 | <li>information about it's encoding is saved</li>
|
---|
40 | <li>it can be modified</li>
|
---|
41 | <li>it can be saved in its original encoding</li>
|
---|
42 | <li>it can also be saved in another encoding supported by libxml2 (for
|
---|
43 | example straight UTF8 or even an ASCII form)</li>
|
---|
44 | </ul><p>Another very important point is that the whole libxml2 API, with the
|
---|
45 | exception of a few routines to read with a specific encoding or save to a
|
---|
46 | specific encoding, is completely agnostic about the original encoding of the
|
---|
47 | document.</p><p>It should be noted too that the HTML parser embedded in libxml2 now obey
|
---|
48 | the same rules too, the following document will be (as of 2.2.2) handled in
|
---|
49 | an internationalized fashion by libxml2 too:</p><pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
---|
50 | "http://www.w3.org/TR/REC-html40/loose.dtd">
|
---|
51 | <html lang="fr">
|
---|
52 | <head>
|
---|
53 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
|
---|
54 | </head>
|
---|
55 | <body>
|
---|
56 | <p>W3C crée des standards pour le Web.</body>
|
---|
57 | </html></pre><h3><a name="internal" id="internal">The internal encoding, how and why</a></h3><p>One of the core decisions was to force all documents to be converted to a
|
---|
58 | default internal encoding, and that encoding to be UTF-8, here are the
|
---|
59 | rationales for those choices:</p><ul>
|
---|
60 | <li>keeping the native encoding in the internal form would force the libxml
|
---|
61 | users (or the code associated) to be fully aware of the encoding of the
|
---|
62 | original document, for examples when adding a text node to a document,
|
---|
63 | the content would have to be provided in the document encoding, i.e. the
|
---|
64 | client code would have to check it before hand, make sure it's conformant
|
---|
65 | to the encoding, etc ... Very hard in practice, though in some specific
|
---|
66 | cases this may make sense.</li>
|
---|
67 | <li>the second decision was which encoding. From the XML spec only UTF8 and
|
---|
68 | UTF16 really makes sense as being the two only encodings for which there
|
---|
69 | is mandatory support. UCS-4 (32 bits fixed size encoding) could be
|
---|
70 | considered an intelligent choice too since it's a direct Unicode mapping
|
---|
71 | support. I selected UTF-8 on the basis of efficiency and compatibility
|
---|
72 | with surrounding software:
|
---|
73 | <ul>
|
---|
74 | <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
|
---|
75 | more costly to import and export CPU wise) is also far more compact
|
---|
76 | than UTF-16 (and UCS-4) for a majority of the documents I see it used
|
---|
77 | for right now (RPM RDF catalogs, advogato data, various configuration
|
---|
78 | file formats, etc.) and the key point for today's computer
|
---|
79 | architecture is efficient uses of caches. If one nearly double the
|
---|
80 | memory requirement to store the same amount of data, this will trash
|
---|
81 | caches (main memory/external caches/internal caches) and my take is
|
---|
82 | that this harms the system far more than the CPU requirements needed
|
---|
83 | for the conversion to UTF-8</li>
|
---|
84 | <li>Most of libxml2 version 1 users were using it with straight ASCII
|
---|
85 | most of the time, doing the conversion with an internal encoding
|
---|
86 | requiring all their code to be rewritten was a serious show-stopper
|
---|
87 | for using UTF-16 or UCS-4.</li>
|
---|
88 | <li>UTF-8 is being used as the de-facto internal encoding standard for
|
---|
89 | related code like the <a href="http://www.pango.org/">pango</a>
|
---|
90 | upcoming Gnome text widget, and a lot of Unix code (yet another place
|
---|
91 | where Unix programmer base takes a different approach from Microsoft
|
---|
92 | - they are using UTF-16)</li>
|
---|
93 | </ul>
|
---|
94 | </li>
|
---|
95 | </ul><p>What does this mean in practice for the libxml2 user:</p><ul>
|
---|
96 | <li>xmlChar, the libxml2 data type is a byte, those bytes must be assembled
|
---|
97 | as UTF-8 valid strings. The proper way to terminate an xmlChar * string
|
---|
98 | is simply to append 0 byte, as usual.</li>
|
---|
99 | <li>One just need to make sure that when using chars outside the ASCII set,
|
---|
100 | the values has been properly converted to UTF-8</li>
|
---|
101 | </ul><h3><a name="implemente" id="implemente">How is it implemented ?</a></h3><p>Let's describe how all this works within libxml, basically the I18N
|
---|
102 | (internationalization) support get triggered only during I/O operation, i.e.
|
---|
103 | when reading a document or saving one. Let's look first at the reading
|
---|
104 | sequence:</p><ol>
|
---|
105 | <li>when a document is processed, we usually don't know the encoding, a
|
---|
106 | simple heuristic allows to detect UTF-16 and UCS-4 from encodings where
|
---|
107 | the ASCII range (0-0x7F) maps with ASCII</li>
|
---|
108 | <li>the xml declaration if available is parsed, including the encoding
|
---|
109 | declaration. At that point, if the autodetected encoding is different
|
---|
110 | from the one declared a call to xmlSwitchEncoding() is issued.</li>
|
---|
111 | <li>If there is no encoding declaration, then the input has to be in either
|
---|
112 | UTF-8 or UTF-16, if it is not then at some point when processing the
|
---|
113 | input, the converter/checker of UTF-8 form will raise an encoding error.
|
---|
114 | You may end-up with a garbled document, or no document at all ! Example:
|
---|
115 | <pre>~/XML -> ./xmllint err.xml
|
---|
116 | err.xml:1: error: Input is not proper UTF-8, indicate encoding !
|
---|
117 | <très>là </très>
|
---|
118 | ^
|
---|
119 | err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
|
---|
120 | <très>là </très>
|
---|
121 | ^</pre>
|
---|
122 | </li>
|
---|
123 | <li>xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and
|
---|
124 | then search the default registered encoding converters for that encoding.
|
---|
125 | If it's not within the default set and iconv() support has been compiled
|
---|
126 | it, it will ask iconv for such an encoder. If this fails then the parser
|
---|
127 | will report an error and stops processing:
|
---|
128 | <pre>~/XML -> ./xmllint err2.xml
|
---|
129 | err2.xml:1: error: Unsupported encoding UnsupportedEnc
|
---|
130 | <?xml version="1.0" encoding="UnsupportedEnc"?>
|
---|
131 | ^</pre>
|
---|
132 | </li>
|
---|
133 | <li>From that point the encoder processes progressively the input (it is
|
---|
134 | plugged as a front-end to the I/O module) for that entity. It captures
|
---|
135 | and converts on-the-fly the document to be parsed to UTF-8. The parser
|
---|
136 | itself just does UTF-8 checking of this input and process it
|
---|
137 | transparently. The only difference is that the encoding information has
|
---|
138 | been added to the parsing context (more precisely to the input
|
---|
139 | corresponding to this entity).</li>
|
---|
140 | <li>The result (when using DOM) is an internal form completely in UTF-8
|
---|
141 | with just an encoding information on the document node.</li>
|
---|
142 | </ol><p>Ok then what happens when saving the document (assuming you
|
---|
143 | collected/built an xmlDoc DOM like structure) ? It depends on the function
|
---|
144 | called, xmlSaveFile() will just try to save in the original encoding, while
|
---|
145 | xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
|
---|
146 | encoding:</p><ol>
|
---|
147 | <li>if no encoding is given, libxml2 will look for an encoding value
|
---|
148 | associated to the document and if it exists will try to save to that
|
---|
149 | encoding,
|
---|
150 | <p>otherwise everything is written in the internal form, i.e. UTF-8</p>
|
---|
151 | </li>
|
---|
152 | <li>so if an encoding was specified, either at the API level or on the
|
---|
153 | document, libxml2 will again canonicalize the encoding name, lookup for a
|
---|
154 | converter in the registered set or through iconv. If not found the
|
---|
155 | function will return an error code</li>
|
---|
156 | <li>the converter is placed before the I/O buffer layer, as another kind of
|
---|
157 | buffer, then libxml2 will simply push the UTF-8 serialization to through
|
---|
158 | that buffer, which will then progressively be converted and pushed onto
|
---|
159 | the I/O layer.</li>
|
---|
160 | <li>It is possible that the converter code fails on some input, for example
|
---|
161 | trying to push an UTF-8 encoded Chinese character through the UTF-8 to
|
---|
162 | ISO-8859-1 converter won't work. Since the encoders are progressive they
|
---|
163 | will just report the error and the number of bytes converted, at that
|
---|
164 | point libxml2 will decode the offending character, remove it from the
|
---|
165 | buffer and replace it with the associated charRef encoding &#123; and
|
---|
166 | resume the conversion. This guarantees that any document will be saved
|
---|
167 | without losses (except for markup names where this is not legal, this is
|
---|
168 | a problem in the current version, in practice avoid using non-ascii
|
---|
169 | characters for tag or attribute names). A special "ascii" encoding name
|
---|
170 | is used to save documents to a pure ascii form can be used when
|
---|
171 | portability is really crucial</li>
|
---|
172 | </ol><p>Here are a few examples based on the same test document and assumin a
|
---|
173 | terminal using ISO-8859-1 as the text encoding:</p><pre>~/XML -> ./xmllint isolat1
|
---|
174 | <?xml version="1.0" encoding="ISO-8859-1"?>
|
---|
175 | <très>là </très>
|
---|
176 | ~/XML -> ./xmllint --encode UTF-8 isolat1
|
---|
177 | <?xml version="1.0" encoding="UTF-8"?>
|
---|
178 | <très>là </très>
|
---|
179 | ~/XML -> </pre><p>The same processing is applied (and reuse most of the code) for HTML I18N
|
---|
180 | processing. Looking up and modifying the content encoding is a bit more
|
---|
181 | difficult since it is located in a <meta> tag under the <head>,
|
---|
182 | so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
|
---|
183 | been provided. The parser also attempts to switch encoding on the fly when
|
---|
184 | detecting such a tag on input. Except for that the processing is the same
|
---|
185 | (and again reuses the same code).</p><h3><a name="Default" id="Default">Default supported encodings</a></h3><p>libxml2 has a set of default converters for the following encodings
|
---|
186 | (located in encoding.c):</p><ol>
|
---|
187 | <li>UTF-8 is supported by default (null handlers)</li>
|
---|
188 | <li>UTF-16, both little and big endian</li>
|
---|
189 | <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
|
---|
190 | <li>ASCII, useful mostly for saving</li>
|
---|
191 | <li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
|
---|
192 | predefined entities like &copy; for the Copyright sign.</li>
|
---|
193 | </ol><p>More over when compiled on an Unix platform with iconv support the full
|
---|
194 | set of encodings supported by iconv can be instantly be used by libxml. On a
|
---|
195 | linux machine with glibc-2.1 the list of supported encodings and aliases fill
|
---|
196 | 3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
|
---|
197 | various Japanese ones.</p><p>To convert from the UTF-8 values returned from the API to another encoding
|
---|
198 | then it is possible to use the function provided from <a href="html/libxml-encoding.html">the encoding module</a> like <a href="html/libxml-encoding.html#UTF8Toisolat1">UTF8Toisolat1</a>, or use the
|
---|
199 | POSIX <a href="http://www.opengroup.org/onlinepubs/009695399/functions/iconv.html">iconv()</a>
|
---|
200 | API directly.</p><h4>Encoding aliases</h4><p>From 2.2.3, libxml2 has support to register encoding names aliases. The
|
---|
201 | goal is to be able to parse document whose encoding is supported but where
|
---|
202 | the name differs (for example from the default set of names accepted by
|
---|
203 | iconv). The following functions allow to register and handle new aliases for
|
---|
204 | existing encodings. Once registered libxml2 will automatically lookup the
|
---|
205 | aliases when handling a document:</p><ul>
|
---|
206 | <li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
|
---|
207 | <li>int xmlDelEncodingAlias(const char *alias);</li>
|
---|
208 | <li>const char * xmlGetEncodingAlias(const char *alias);</li>
|
---|
209 | <li>void xmlCleanupEncodingAliases(void);</li>
|
---|
210 | </ul><h3><a name="extend" id="extend">How to extend the existing support</a></h3><p>Well adding support for new encoding, or overriding one of the encoders
|
---|
211 | (assuming it is buggy) should not be hard, just write input and output
|
---|
212 | conversion routines to/from UTF-8, and register them using
|
---|
213 | xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be
|
---|
214 | called automatically if the parser(s) encounter such an encoding name
|
---|
215 | (register it uppercase, this will help). The description of the encoders,
|
---|
216 | their arguments and expected return values are described in the encoding.h
|
---|
217 | header.</p><p><a href="bugs.html">Daniel Veillard</a></p></td></tr></table></td></tr></table></td></tr></table></td></tr></table></td></tr></table></body></html>
|
---|