Converting TeX to HTML: was Looking for a serious TeX hacker
Ulrike Fischer
news3 at nililand.de
Tue Jun 20 11:43:44 CEST 2023
Am Mon, 19 Jun 2023 18:54:44 -0700 schrieb William F Hammond via
texhax:
>> Why should the LaTeX team work on XML to pdf conversion?
>
> I intended my question to inquire about a pipeline approach, something
> like this diagram:
>
>
> | LaTeXML
> | LaTeX source ------>
> |
> | sgmlspl or
> | XSLT or ...
> | LaTeXML's XML ---------->
> |
> | tag capable
> | LaTeX-source with pdflatex
> | hooks for tagging ----------->
> |
> | tagged PDF
>
>
> In the pipeline above the role of LaTeXML's XML could be played
> by any sufficiently structured XML document type modeling LaTeX.
> The problem, whether the target is HTML+MathML or tagged PDF,
> always is getting the user to write something with sufficent
> structure. LaTeXML does a rather impressive job of creating
> formal structure where the user has been informal.
I understand the pipeline, but why do you think that the LaTeX team
is best placed to implement that? I mean I know quite a lot about
LaTeX coding, TeX engines and tagged PDF. I can tell you how a LaTeX
source that should be tagged should look like (but this is a moving
target). But I know nearly nothing about the XML produced by LaTeXML
and about xslt transformations.
Generally: Such a transformation is typically not looseless. You
don't get an identical looking PDF. For journals and authors which
uses LaTeX to get a specific layout that would be problem.
So to make it worthwhile it would need a large advantage over the
direct workflow LaTeX source -> tagged PDF, and don't I quite see
that advantage. LaTeXML can only use the existing source, and if it
can guess how to make a LaTeX source that can be tagged and can
transform something informal into something more formal, LaTeX
should be able to that too, albeit perhaps slower.
Also the main task is not to enhance the source - we actually want
that as many documents as possible can be tagged with only minimal
changes to the source - but to adapt internal code of the kernel and
packages.
As an example: with a current LaTeX you can tag a standard document
with sectioning commands, toc and lists. But if you try that with
the blindtext package it errors:
\DocumentMetadata{testphase={phase-III}}
\documentclass{article}
\usepackage[toc]{blindtext}
\begin{document}
\blinddocument
\end{document}
The problem is that blindtext has a small bug: it forgets at the end
of a list to call an internal command that the tagging code
requires. So we have to find this bug, develop a patch and add that.
This is currently done in a firstaid module, and if you load that
it, it compiles and gives a tagged pdf:
\DocumentMetadata{testphase={phase-III,firstaid}}
\documentclass{article}
\usepackage[toc]{blindtext}
\begin{document}
\blinddocument
\end{document}
--
Ulrike Fischer
http://www.troubleshooting-tex.de/
More information about the texhax
mailing list.