<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 7 Jan 2021 at 18:23, Thibaut Cuvelier <<a href="mailto:tcuvelier@lyx.org" target="_blank">tcuvelier@lyx.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 7 Jan 2021, 12:52 Lorenzo Bertini, <<a href="mailto:lorenzobertini97@gmail.com" target="_blank">lorenzobertini97@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I think almost all the options are on the table at this point. For the <br>

sake of completeness I think it's worth mentioning DOM library Boost <br>

Property Tree, which popped up frequently while searching.<br>

<br>

I think Thibaut is right when saying that, for the way LyX is structured <br>

now, a SAX writer would be more appropriate, because we won't work on <br>

xml directly, but convert the LyX file. However most of the libraries <br>

have a DOM approach, and also, if someday we'll convert LyX format to <br>

something xml-like, we might have to start all of this again.<br>

<br>

I did a small benchmark with pugixml and to both read and write a xml <br>

document of 2.2Mb of equivalent ~100/120 pages chock full of math: it <br>

takes negligble time to both read and write on my really modest laptop <br>

A10-9600). Peak memory consumption was 14Mb, but since some MathML was <br>

corrupted (it has trouble with backslash \) it's possible it might be <br>

way less once fixed: LyX consumption opening the corresponding LyX file <br>

was ~120Mb. The benchmark table in <br>

<<a href="http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1performance" rel="noreferrer noreferrer" target="_blank">http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1performance</a>> <br>

seems to indicate that pugixml and RapidXML have performance just one <br>

order greater than strlen, so I don't think parse time will ever be a <br>

problem.</blockquote></div></div><div dir="auto"><br></div><div>Thanks for your benchmark. For me, the major difference between the two libraries is that pugixml is still maintained, but not really RapidXML. And XML parsing is very often a source of security problems (not just XXE).<br></div><div dir="auto"><br></div><div dir="auto"></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

I'm unfamiliar with the concept of "wrapping" libraries and "layers": is <br>

it when you write your own classes and methods on top of some common <br>

stuff those libraries do, so if for whatever reason you have to switch <br>

you can "plug" another easily?<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Yes, exactly.</div></div></div></blockquote><div><br></div><div>Below is my take on <a href="https://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c">https://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c</a> and <a href="https://github.com/fffaraz/awesome-cpp#xml">https://github.com/fffaraz/awesome-cpp#xml</a> </div><div><br></div><div>XPath would be very useful if LyX switches to an XML representation (easy queries on an XML document, think of SQL for XML). <br></div><div>XSLT is a way to describe transformations from XML to anything. If LyX switches to an XML representation, it might be used to replace C++ exporters (but formula conversion will be a pain!). It might lower the entry bar for new contributors, even though XSLT is not an easy language.</div><div>XQuery is a script language for XML processes. <br></div><div>Apart from Java libraries, only versions 1.0 are implemented: apart from XPath, it really limits their use… A state-of-the-art implementation of the current norms is Saxon, which has a C binding.<br></div><div><br></div><div>To allow for validation of XML files (i.e. check they respect some grammar), DTD is the oldest way (inherited from SGML), XML Schema adds many features over DTD (like types). The best technology nowadays is RelaxNG (it's not recent: 2005), which is much more powerful than XML Schema.</div><div><br></div><div>XInclude is the XML way of specifying includes of other files (not necessarily XML). Think \input in LaTeX or LyX child documents with a few more features.</div><div><br></div><div>Name spaces are similar to those of C++, and are especially useful when mixing several standards (like MathML and DocBook).<br></div><div><br></div><div>A tour of some C++ libraries for XML:</div><div>- RapidXML: mostly unmaintained since 2013, no support for namespaces (except in forks: <a href="https://github.com/dwd/rapidxml">https://github.com/dwd/rapidxml</a>)<br></div><div>- Boost Property Tree: no XML parser, which limits further use (it can use RapidXML though, see above)</div><div>- libstudxml: C++ library, designed for speed, no DOM<br></div><div>- libxml2: C library, designed for features and not speed (also includes XPath and XSLT, DTD and XML Schema, namespaces), "mature" and barely not evolving anymore<br></div><div>- libxml++: depends on glibmm2</div><div>- Xerces-C++: C++ library, designed for features and not speed

 (also includes XPath, DTD and XML Schema, namespaces), "mature" and barely not evolving anymore; no XSLT (Xalan could be used, but it only works with a ancient version of Xerces; XQuilla implemented XPath 2, but is no more developed since 2016)</div><div>- Expat: C library, designed for speed, no DOM by default (provided by <a href="https://github.com/kolotsey/expat-dom">https://github.com/kolotsey/expat-dom</a>), with namespaces<br></div><div>- tinyxml2: C++ library, designed for speed only (also includes XPath through the unmaintained <a href="https://github.com/stanthomas/tinyxml2-ex">https://github.com/stanthomas/tinyxml2-ex</a>, no validation, no namespaces), mature and slowly evolving</div><div>- pugixml: C++ library, designed for speed with a few features (like XPath, no validation, no namespaces), mature and evolving <br></div><div>- libroxml: C library, no clear design goal (includes XPath, namespaces, no validation), evolving</div><div>- Saxon-C: C/C++ wrapper of the state-of-the-art Java library, largest amount of features (XPath and XSLT 3, DTD and XML Schema validation -- extension for RelaxNG: <a href="http://www.cfoster.net/saxon-jing/">http://www.cfoster.net/saxon-jing/</a> --, namespaces), very mature, really evolving (both performance and features), but it requires a JVM (Excelsior is built-in, even though it's not been maintained for quite a long time)<br></div><div>- Qt: no, I was joking :). Qt XML is not supported anymore, it's recommended to switch to QXmlStreamReader and QXmlStreamWriter (which are only SAX-like). Qt XML Patterns used to have XPath, XSLT, and XML Schema, but it's been deprecated a while ago (Qt 5.13 for the last wake-up call, but it hasn't been touched since Qt 4, basically)<br></div><div><br></div><div>If LyX is being really serious about XML (i.e. moving as many things as possible to XML technologies), Saxon is probably the way to go. Otherwise, it's going to be too heavy to ship Saxon and a JVM along with LyX. Instead, pugixml seems to me like a good choice: a few features (XPath is the most relevant for LyX, and included in the base library, no need for addons), good performance, still maintained (there is a chance to have bugs fixed in a newer version, plus security vulnerabilities taken care of). <br></div></div></div>