XML stream writer library
tcuvelier at lyx.org
Wed Jan 6 19:55:28 UTC 2021
On Tue, 5 Jan 2021 at 10:37, Joel Kulesza <jkulesza at gmail.com> wrote:
> On Tue, Jan 5, 2021 at 1:19 AM Pavel Sanda <sanda at lyx.org> wrote:
>> On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
>> > There are multiple issues here. What is needed to generate HTML and
>> > is a simple SAX writer, not a parser. I've done plenty of research about
>> > it, there's no XML library that does that. Most of them are using a DOM,
>> > which is a total waste of memory for such an application: it stores a
>> > complete XML tree in memory before serialising it. With SAX, you just
>> > a string backend, which is much more lightweight (by several factors).
>> After little bit more thinking, is using DOM actually that big issue?
>> I mean how much it takes - for document of length n its O(n) in space?
>> Sure, it might be cut to constant, but practically speaking when you have
>> 100 pages document what is the real time/memory consumption. Timewise
>> you spent 1s in XML compared to next 30s in conversion figures to pdf or
>> whatever format? Spacewise probably one more time than what we
>> already allocated for document itself.
>> If using more heavy-weight caliber xml lib is not pain from API point
>> of view (and I do not know, you are the expert here) then we might
>> actually consider it, given the difficulties in SAX space?
> I had a similar thought and will note that I've had good success on other
> projects with pugixml.
It's typical to have a DOM tree that is two to five times larger than the
raw text, that's not always negligible (Xerces is close to 2, Java
implementations anywhere between 2 and 5, I haven't checked pugixml or
TinyXML2 for this specific criterion). But that's not the real issue: for
generating HTML and DocBook, for now, DOM is not so useful from a developer
point of view, DOM is more suitable to handle an existing document or to
modify it, not really to generate one from scratch. A SAX writer is really
what's the most appropriate, given the way LyX is internally structured:
there is very little need to go backward when generating the file (e.g.,
add something to the header when encountering some LyX inset).
Using DOM will not really simplify the code (I'm speaking for the DocBook
export, which is highly similar to HTML). However, it might make its logic
easier to understand for a newcomer. Nevertheless, DOM comes with more
complex syntax: with SAX, you are only appending content to the file, with
only strings; with DOM, you have to indicate where you want to write
something (with methods like InsetEndChild), and you pass around complete
XML nodes (built from the same strings).
More specifically, in SAX (where stream is mostly a large string object
with helper methods):
With DOM, taking the example of TinyXML2 (where document is the root of the
DOM tree and node the node in the tree that is being filled):
node->InsertEndChild( document->NewElement("tag") );
Both are perfectly good choices, though. If we write a thin layer on top of
a DOM writer (as Riki suggested, this would allow decoupling with the
actual XML library), we might be able to have a syntax close to that of SAX
while having the extra flexibility of DOM. This way, the LyX code would be
clean, and avoid current intricacies to output things at the right place
(in DocBook, especially the <info> tag).
More specifically, @Pavel: for DocBook, you spend 0% of your time dealing
with images, as it's supposed to be done by the DocBook processor
afterwards. Any gain in the XML part of LyX will be noticeable by the user
for large documents (book-sized).
(And I won't say that something being O(n) is negligible in this case: I'm
using daily exponential-time algorithms that work so much faster than
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lyx-devel