XML stream writer library

Thibaut Cuvelier tcuvelier at lyx.org
Fri Jan 8 02:00:34 UTC 2021

On Thu, 7 Jan 2021 at 18:23, Thibaut Cuvelier <tcuvelier at lyx.org> wrote:

> On Thu, 7 Jan 2021, 12:52 Lorenzo Bertini, <lorenzobertini97 at gmail.com>
> wrote:
>> I think almost all the options are on the table at this point. For the
>> sake of completeness I think it's worth mentioning DOM library Boost
>> Property Tree, which popped up frequently while searching.
>> I think Thibaut is right when saying that, for the way LyX is structured
>> now, a SAX writer would be more appropriate, because we won't work on
>> xml directly, but convert the LyX file. However most of the libraries
>> have a DOM approach, and also, if someday we'll convert LyX format to
>> something xml-like, we might have to start all of this again.
>> I did a small benchmark with pugixml and to both read and write a xml
>> document of 2.2Mb of equivalent ~100/120 pages chock full of math: it
>> takes negligble time to both read and write on my really modest laptop
>> A10-9600). Peak memory consumption was 14Mb, but since some MathML was
>> corrupted (it has trouble with backslash \) it's possible it might be
>> way less once fixed: LyX consumption opening the corresponding LyX file
>> was ~120Mb. The benchmark table in
>> <
>> http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1performance>
>> seems to indicate that pugixml and RapidXML have performance just one
>> order greater than strlen, so I don't think parse time will ever be a
>> problem.
> Thanks for your benchmark. For me, the major difference between the two
> libraries is that pugixml is still maintained, but not really RapidXML. And
> XML parsing is very often a source of security problems (not just XXE).
> I'm unfamiliar with the concept of "wrapping" libraries and "layers": is
>> it when you write your own classes and methods on top of some common
>> stuff those libraries do, so if for whatever reason you have to switch
>> you can "plug" another easily?
> Yes, exactly.

Below is my take on
and https://github.com/fffaraz/awesome-cpp#xml

XPath would be very useful if LyX switches to an XML representation (easy
queries on an XML document, think of SQL for XML).
XSLT is a way to describe transformations from XML to anything. If LyX
switches to an XML representation, it might be used to replace C++
exporters (but formula conversion will be a pain!). It might lower the
entry bar for new contributors, even though XSLT is not an easy language.
XQuery is a script language for XML processes.
Apart from Java libraries, only versions 1.0 are implemented: apart from
XPath, it really limits their use… A state-of-the-art implementation of the
current norms is Saxon, which has a C binding.

To allow for validation of XML files (i.e. check they respect some
grammar), DTD is the oldest way (inherited from SGML), XML Schema adds many
features over DTD (like types). The best technology nowadays is RelaxNG
(it's not recent: 2005), which is much more powerful than XML Schema.

XInclude is the XML way of specifying includes of other files (not
necessarily XML). Think \input in LaTeX or LyX child documents with a few
more features.

Name spaces are similar to those of C++, and are especially useful when
mixing several standards (like MathML and DocBook).

A tour of some C++ libraries for XML:
- RapidXML: mostly unmaintained since 2013, no support for namespaces
(except in forks: https://github.com/dwd/rapidxml)
- Boost Property Tree: no XML parser, which limits further use (it can use
RapidXML though, see above)
- libstudxml: C++ library, designed for speed, no DOM
- libxml2: C library, designed for features and not speed (also includes
XPath and XSLT, DTD and XML Schema, namespaces), "mature" and barely not
evolving anymore
- libxml++: depends on glibmm2
- Xerces-C++: C++ library, designed for features and not speed (also
includes XPath, DTD and XML Schema, namespaces), "mature" and barely not
evolving anymore; no XSLT (Xalan could be used, but it only works with a
ancient version of Xerces; XQuilla implemented XPath 2, but is no more
developed since 2016)
- Expat: C library, designed for speed, no DOM by default (provided by
https://github.com/kolotsey/expat-dom), with namespaces
- tinyxml2: C++ library, designed for speed only (also includes XPath
through the unmaintained https://github.com/stanthomas/tinyxml2-ex, no
validation, no namespaces), mature and slowly evolving
- pugixml: C++ library, designed for speed with a few features (like XPath,
no validation, no namespaces), mature and evolving
- libroxml: C library, no clear design goal (includes XPath, namespaces, no
validation), evolving
- Saxon-C: C/C++ wrapper of the state-of-the-art Java library, largest
amount of features (XPath and XSLT 3, DTD and XML Schema validation --
extension for RelaxNG: http://www.cfoster.net/saxon-jing/ --, namespaces),
very mature, really evolving (both performance and features), but it
requires a JVM (Excelsior is built-in, even though it's not been maintained
for quite a long time)
- Qt: no, I was joking :). Qt XML is not supported anymore, it's
recommended to switch to QXmlStreamReader and QXmlStreamWriter (which are
only SAX-like). Qt XML Patterns used to have XPath, XSLT, and XML Schema,
but it's been deprecated a while ago (Qt 5.13 for the last wake-up call,
but it hasn't been touched since Qt 4, basically)

If LyX is being really serious about XML (i.e. moving as many things as
possible to XML technologies), Saxon is probably the way to go. Otherwise,
it's going to be too heavy to ship Saxon and a JVM along with LyX. Instead,
pugixml seems to me like a good choice: a few features (XPath is the most
relevant for LyX, and included in the base library, no need for addons),
good performance, still maintained (there is a chance to have bugs fixed in
a newer version, plus security vulnerabilities taken care of).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lyx.org/pipermail/lyx-devel/attachments/20210108/90ecf8e4/attachment.html>

More information about the lyx-devel mailing list