TeX to HTML
I want to create a TeX to HTML service so that I can upload notes in TeX and convert them to HTML. I am going to try to start a habit of reading a research paper per day, taking notes in TeX, and uploading the notes on this site.
Introduction
to HTML conversion has long been tricky. There are basically three approaches:
- Assume that your code is basically just flavored markup, and use a converter that reads such simple , like pandoc (or ). If your code is simple, this works great.
- Use a package that compiles your using itself but provides added info in the resulting file, then turn that file into HTML. That's the approach TeX4HT and Lwarp use.
- LaTeXML: This is a reimplementation of the kernel, but it outputs to XML instead of to DVI. It does natively support a large number of popular packages and classes, but packages it does not support can be loaded and "compiled" using the
--includestyles
flag. - This is the solution that Richard Zach recommends.
TeX4HT
TeX4ht is a system for converting documents written in TeX/LaTeX/ConTeXt/etc. to HTML, various XML flavors, braille, etc., optionally using MathML.
Features
- it supports most LaTeX packages and custom commands
- it supports various input formats
- extensive support for modification of the output
Links
- ChangeLog
- source repository
- bug db
- Work-in-progress documentation
- The original documentation
- TeX4ht: HTML production: a tutorial/introductory article on using and customizing TeX4ht.
- TeX4ht: LaTeX to Web publishing: more recent tutorial on customizing TeX4ht.
- tex4ht setup and cheat sheet from Nasser Abbasi
Basic Invocation for Modern Output
Tex4ht can be invoked in several ways. The original way is to use the htlatex
command. To convert a LaTeX source file.tex to HTML5 that uses UTF-8, with MathML:
$ htlatex file.tex "xhtml,html5,mathml,charset=utf-8" " -cunihtf -utf8"
An easier way is to use make4ht. The following command produces the same output as the previous one, HTML5 in UTF-8 encoding with MathML:
$ make4ht file.tex "mathml"
# If you wat to have MathJax rasterize the MathML
$ make4ht file.tex "mathml,mathjax"
# Perhaps the best method of all is to insert LaTeX into the HTML output, ans have MathJaX rasterize the LaTeX
$ make4ht file.tex "mathjax"
Documentation
TeX4ht is a system that converts LaTeX to various output formats, including HTML.
Basic Usage
$ make4ht filename.tex
By default, TeX4ht converts to html. You can convert to other formats using the -f
option:
$ make4ht -f odt filename.txt
Due to the fact that tex4ht
requires you to edit the .tex file, I think I am going to stick with LaTeXML.
LWarp
The lwarp package converts to HTML by using to process the user’s document and directly generate HTML tags. External utility programs are only used for the final conversion of text and images. Math may be represented by SVG images or MathJax. More than 500 packages and classes are supported, of which more than 60 also support MathJax.
is popular due to the language's visibility, stability, and portability of plain-text markup, regular expression search and replace of both text and formatting commands, easy revision control, the ability to handle large and complex documents, extensive programming capabilities, and the large number of user-supplied packages solving real-world problems. In many cases, it's still faster to type a few arguments than it is to open a dialog box and select and fill in entries, and a powerful programming text editor is more responsive thana word processor.
to HTML is needed because of the rise of self-publishing and the need for scientists, professors, and engineers to publish their own papers on their own websites.
Both HTML5 and CSS3 are quite capable, to the point where they can be used to produce technical books. Nevertheless, there are some practical problems to overcome in order to create a good conversion form to HTML.
to HTML with lwarp
The lwarp
package produces an HTML version of your document with accessibly rendered mathematic while allowing you to use macros, theorem environments, tikz
pictures, and all the other bells and whistles you are used to. This note tells you how to get started with lwarp
.
Long Story Short
Starting with file.tex
, add the following snippet right after the \documentclass
line:
\usepackage[mathjax]{lwarp}
Run pdflatex
on file.tex
as usual enough times to resolve references. Then, in the terminal, invoke this incantation:
lwarpmk html
I am not going to use lwarp
, so I am going to stop reading here. LaTeXML seems easier.
LaTeXML: A LaTeX to XML/MathML/HTML Converter
In the process of developing the Digital Library of Mathematical Functions, we needed a means of transforming sources of our material into XML which would be used for further manipulations, rearrangements and construction of the web site. In particular, a true 'Digital Library' should focus on the semantics of the material, so we should convert the mathematical material into both content and presentation MathML. At the time, we found no suitable software to our needs, so we bean the development of in-house.
The approach of this software is to emulate as far as possible (in Perl), converting / document into 's XML format. That format is then further transformed into HTML of various flavors, with MathML and SVG.
Usage
In most cases, all that should be need to convert file to XML and then to HTML would be:
$ latexml --dest=mydoc.xml mydoc
$ latexmlpost --dest=somewhere/mydoc.html mydoc.xml
This will carry out default transformation into HTML5, which represents mathematics using MathML. Different file extensions (or the --format
option) imply different output formats, including XHTML, HTML 4 w/images for math, JATS, TEI. There are also options to split large documents into several pages, or to combine multiple documents into a single site.
The functionality of latexml
and latexmlpost
are conveniently combines into the single executable latexmlc
, without creating the intermediate XML file. The above commands are equivalent to:
$ latexmlc --dest=somewhere/mydoc.html mydoc
Download
$ sudo dnf install LaTeXML # RPM-based system, this installs preqrequisites as well
$ sudo yum intall LaTeXML # RPM based alternative
C:> choco install latexml # Windows, this may require to have TeX downloaded
Prerequisites
These are installed using the commands above.
- Perl Modules
- Image::Magick or Graphics::Magick
- UUID::Tiny
- perl-doc
Installing Prerequisites:
$ sudo dnf install \ perl-Archive-Zip perl-DB_File perl-File-Which \ perl-Getopt-Long perl-Image-Size perl-IO-String perl-JSON-XS \ perl-libwww-perl perl-Parse-RecDescent perl-Pod-Parser \ perl-Text-Unidecode perl-Test-Simple perl-Time-HiRes perl-URI \ perl-XML-LibXML perl-XML-LibXSLT \ perl-UUID-Tiny texlive ImageMagick ImageMagick-perl # RPM-based systems
This software is in the public domain and is not subject to copyright protection.
The Manual
The design goals of are:
- Faithful emulation of 's behavior
- Easily extensible
- Lossless, preserving both semantic and presentation cues
- Use an abstract -like, extensible document type
- Infer the semantics of the mathematical content
Using LaTeXML
The main commands provided by the system are:
latexml
for converting and BibTeX sources to XML- Converts document (or standard input) to XML. It loads any required definition bindings, reads, tokenizes, expands and digests the document creating an XML structure. It then performs some document rewriting, parses the mathematical content and writes thee result to an XML file.
- Useful options:
--verbose
or--quiet
depending on whether or not you want to see detail of progress and debugging messages being printed during processing. They can be added multiple times to get more / less details.--path={directory}
: Dictionaries to search (in addition to the working directory) for various files can be specified using this command.--includestyles
can be used to tell LaTeXML to process style files (It doesn't process these files by default).
latexmlpost
for various postprocessing tasks including conversion to HTML, processing images, conversion to MathML and so on- Command carries out a set of appropriate transformations in sequence:
- scanning of labels and ids
- Collects information about all labels, ids, indexing command, cross-references and so on to be used in following stages
- filling in the index and bibliography
- An index is built from
\index
markup - When a document contains a request for bibliographies, typically de to the
\bibliography{..}
command, the postprocessor will look for the named bibliographies.
- An index is built from
- cross-referencing
- In this stage, the scanned information is used to fill in the text and links of cross-references within the document. The option
--urlstyle
can control the format of urls within the document.
- In this stage, the scanned information is used to fill in the text and links of cross-references within the document. The option
- conversion of math
- Some specific of the mathematics can be requested with these options:
--mathimages
- converts math to png images--presentationmathml
creates Presentation MathMLcontentmathml
- creates Content MathML--openmath
- creates OpenMath--keepXMath
- preserves XMath
- conversion of graphics and picture environments to web format
- Conversion of graphics (e.g., form the `graphic(s|x) packages'
\includegraphics
) can be enabled or disabled using--graphicsimages
or--nographicsimages
. Similarly, the conversion of picture environments can be controlled with--pictureimages
or--inopictureimages
- Conversion of graphics (e.g., form the `graphic(s|x) packages'
- applying an XSLT stylesheet
- If you wish to restyle the generated HTML either by adding CSS or by customizing the XSLT, change its functionality by adding JavaScript, or even generate an alternative output format with XSLT, some combination of the following options will be useful.
--nodefaultsources
- Omits the default resources--css=stylesheet.css
- Adds a new CSS stylesheet--javascript=program.js
- Adds a JavaScript
- The output format is determined by the file extension of the
--destionation
option or by the option--format
. The recognized formats are: - html or html5
- html4
- xhtml
- xml
latexmlc
combines bothlatexml
andlatexmlpost
into a single command, with some extra functionality
$ latexmlc doc.tex --dest=doc.html # Converts doc to simple HTML5 document
$ latexml --dest=doc.xml doc # converts TeX to XML
$ latexmlpost doc --dest=doc.html # converts XML to HTML
Architecture
The casual user needs only a superficial understanding of the architecture. The processing is broken into the following stages:
- Digestion
- Construction
- Rewriting
- Math Parsing
- Serialization
Customization
The processing of the document, its conversion into xml and ultimately to XHTML or other formats can be customized in various ways, at different stages of processing and in different levels of complexity.. By far, the easiest way to customize the style of the output is by modifying the CSS.
Commands
-
latexml [options] texfile
- Transforms / file into XML.
- If texfile is '-', latexml reads the source from standard input. If texfile has an explicit extension of
.bib
, it is processed as a Bib bibliography.
latexmlpost [options] xmlfile
- Postprocesses an xml file generated by
latexml
to perform common tasks, such as convert math to images and processing graphics inclusions for the web.
- Postprocesses an xml file generated by
latexmlc
- An omni-executable for LaTeXML, capable of stand-alone, socket-server and webservice conversion. Supports both core processing and post-processing.
- Can be used to create ePub documents
latexmlmath [options] texmath
- Transforms a TeX/LaTeX math expression into various formats
- If texmath is '-',
latexmlmath
reads the from standard input. If any of the output files are '-', the result is printed on standard output.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.