unicodesymbols: have several commands for a single symbol?

Thibaut Cuvelier tcuvelier at lyx.org
Mon Feb 14 02:24:40 UTC 2022


On Sun, 13 Feb 2022 at 09:04, Jürgen Spitzmüller <spitz at lyx.org> wrote:

> Am Sonntag, dem 13.02.2022 um 04:19 +0100 schrieb Thibaut Cuvelier:
> > You mean, with code like
> >
> https://github.com/cburschka/lyx/blob/d3c335a5d524e2edeb73ae1a891fcc58ba5bfd1a/src/BiblioInfo.cpp#L421-L428
> > for the search? I thought it would be good to have a file to store
> > this information, but I wasn't aware of unicodesymbols. I believe
> > that the file shouldn't even be modified at all, thanks to the
> > presence of the Unicode character number at the beginning of the line
> > (0x00c0 "\\`{A}", whith 0xC0 corresponding to 192,
> >
> https://github.com/cburschka/lyx/blob/master/src/insets/InsetERT.cpp#L131
> > ).
> >
> > Based on the contents of unicodesymbols, how could I match " \`{A}",
> > "\`A", and  "\` A" at once? Should I just use tricks like
> >
> https://github.com/cburschka/lyx/blob/d3c335a5d524e2edeb73ae1a891fcc58ba5bfd1a/src/BiblioInfo.cpp#L414-L418
> > (which I'm already doing, in a sense, in
> >
> https://github.com/cburschka/lyx/blob/master/src/insets/InsetERT.cpp#L452-L463
> > )?
>
> I don't know how to do it exactly, but yes, I mean that the information
> you need here should all be in unicodesymbols, or added if not, and
> could be retrieved by the methods defined in Encoding.cpp.
>
> There should be no need to store LaTeX<>Unicode mappings anywhere else.
>

Thanks, I just did that (with a small test file): a460097823.

However, this test showed a limitation in the current unicodesymbols: there
can be only one LaTeX command per symbol. This is a limitation in only a
few cases, like  LyX Document
\textexclamdown and !`: both of them are mapped to ¡ (i.e. ¡), but the
file only allows for one mapping.

I would have no problem saying that this is a corner case that can be
easily ignored, but after all I dived into Unicode mapping within ERTs for
DocBook to handle corner cases… (Albeit not in Spanish.) From a
memory-consumption point of view, supporting several commands for one
symbol would require to store more than one string in CharInfo, potentially
even a vector of strings for all entries (even those that have only one
command): that's a 24 bytes overhead (
https://stackoverflow.com/a/34035291/1066843) for roughly 4000 entries;
that's not so large.

If we decide to solve this problem, we could have several solutions (all
modifying Encodings::read), I could think of two:
- either use a separator symbol in the latexcommand part of each
unicodesymbols line, but it would be hard to find a single character that
is never used for latexcommands
- or have multiple lines for a single character, with duplicate information
for the second one or a simpler line format for these entries. For
instance, for the inverted exclamation mark:

0x00a1 "\\textexclamdown"         ""
"force=cp862;cp1255;euc-jp;euc-jp-platex;euc-kr;utf8-platex" # INVERTED
EXCLAMATION MARK
0x00a1 "!`" # Implicitly, all the other parameters still apply

What do you think of this? Should this be done? What would be the preferred
solution, if so? (Of course, I offer to do this refactoring :).)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lyx.org/pipermail/lyx-devel/attachments/20220214/18f65337/attachment-0001.html>


More information about the lyx-devel mailing list