Pandoc Documentation Notes
I am learning about `pandoc` because I might use it to make my Jupyter Notebook to HTML conversion system better. I also think I am going to use it for a project - a universal file converter.
References
If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert between many formats. [...] Pandoc understands a number of useful markdown syntax extensions, including document metadata, footnotes, tables, definition lists, superscript and subscript, strikeout, enhanced ordered lists, running example lists ...
LATEX math (and even macros) can be used in markdown documents. Pandoc includes a powerful system for automatic citations and bibliographies. There are many ways to customize pandoc to fit your needs, including a template system and a powerful system for writing filters.
Getting started
Install pandoc
# Windows
C:> choco install pandoc
# Linux
$ # Check whether the pandoc version in your package manager is not outdated.
$ # Pandoc is in the Debian, Ubuntu, Slackware, Arch, Fedora, NixOS, openSUSE, gentoo
$ # and Void repositories.
$ # If not installed:
$ # 1. Go to Github project page
$ sudo wget https://github.com/jgm/pandoc/releases/download/3.5/pandoc-3.5-linux-amd64.tar.gz # Download binary release
$ sudo tar -xvf pandoc-3.5-linux-amd64.tar.gz # Extract the file
$ sudo mv pandoc-3.5/bin/pandoc /usr/local/bin/ # Move the file to a system directory
$ pandoc --version # Check the version
Open a Terminal
Windows: In the cmd window, type chcp 65001 before using pandoc, to set the encoding to UTF-8.
To verify that pandoc is installed, type pandoc --version. You should see a message telling you which version of pandoc is installed.
Using pandoc as a filter
When pandoc is invoked without specifying any input files, it operates as a "filter", taking input from the terminal and sending its output back to the terminal. You can use this feature to play around with pandoc. By default, input is interpreted as pandoc markdown, and output is HTML. But you can change that:
$ pandoc -f html -t markdown # converting HTML to markdown
Pandoc can often figure out the input and output formats from the filename extensions.
User's Guide
$ pandoc [options] [input-file]...
Description
Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LATEX , and Word docx. Pandoc can also produce a PDF.
Pandoc's enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representations of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader and writer. Users can also run custom pandoc filters to modify the intermediate AST.
One should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. Some complex elements, such as complex tables, may not fit into pandoc's simple document model.
While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.
Using Pandoc
If not input-files are specified, input is read from stdin. Output goes to stdout by default. For output to a file, use the -o option:
$ pandoc -o output.html input.txt
By default, pandoc produces a document fragment. To produce a standalone document, use the -s flag:
$ pandoc -s -o output.html input.txt
The input and output formats can be specified with the -f and -t options respectively.
$ pandoc -f html -t markdown hello.html
If no input or output formats are not specified explicitly, pandoc will attempt to guess it from the extensions of the filename. If no inputfile is specified (so that the input comes down from stdin), or if the input files' extensions are unknown, the input format will be assumed to be Markdown.
Pandoc uses UTF-8 character encoding for input and output. If your local character encoding is not UTF-8, you should pipe input and output through iconv:
$ iconv -t utf-8 input.txt | pandoc | iconv -f utf-8
To produce a PDF, specify an output file with a .pdf extension. By default, pandoc will use LATEX to create the PDF. Alternatively, pandoc can use ConTeXt, roffms, or HTML as an intermediate format. To do this, specify an output format with a .pdf extension, but add the --pdf-engine option. You can control the CSS style using variables. You should have LaTeX installed on your machine. See what's required for LATEX here.
Instead of an input file, you can read from the web. Pandoc will fetch the content using HTTP:
$ pandoc -f html -t markdown https://www.fsf.org
$ # You can specify custom headers
$ pandoc -f html -t markdown --request-header User-Agent:"Mozilla/5.0" \
https://www.fsf.org
General Options
- List of General Options
- List of Read Options
- List of Write Options
- Options Affecting Specific Writers
- Exit Codes for Pandoc
Pandoc as a Web Server
If you rename (or symlink) the pandoc executable to pandoc-server, of if you call oandoc with server as the first argument, it will start up a web server with a JSON API. This server exposes most of the conversion functionality of pandoc. For full documentation, see the pandoc-server man page.
A Note on Security
- Although pandoc will not create or modify any files other than those you explicitly ask it to create, a filter or custom writer could in principle do anything on your file system.
- Several input formats support include directives that allow the contents of a file to be included in the output. An untrusted attacker could use these to view the contents of a file to be included in the output. The --sandbox option protects against this threat.
- Several output formats will embed encoded or raw images into the output file. An untrusted attacker could exploit this to view the contents of non-image files on the file system.
- Pandoc parsers can exhibit pathological performance on some corner cases. It is wise to be any pandoc operations under a timeout, to avoid DOS attacks that exploit these issues. If you are using the pandoc executable, you can add the command line options +RTS -M512M -RTS (for example) to limit the heap size to 512MB.
- The HTML generated by pandoc is not guaranteed to be safe. Of raw_html is enabled for the Markdown input, users can inject any arbitrary HTML. Even if it is diabled, users can include dangerous content in URLs and attributes. To be safe, you should run all HTML generated from untrusted user input through an HTML sanitizer.
Pandoc Filters
Pandoc provides an interface for users to write programs (known as filters) which act on pandoc's AST. Pandoc consists of a set of readers and writers. When converting a document from one format to another, text is parsed by a reader into pandoc's intermediate representation of the document - an "abstract syntax tree" or AST - which is then converted by the writer into the target format. A "filter" is a problem that modifies the AST, between the reader and the writer.
INPUT --reader--> AST --filter--> AST --writer--> OUTPUT
Pandoc supports two kinds of filters:
- Lua Filters: Use the Lua language to define transformations on the pandoc AST. They are described in a separate document
- JSON Filters: Described here, are pipe that read from standard input and write to standard output, consuming and producing a JSON representation of the pandoc AST. Lua filters don't require any extra software and are usually faster than JSON filters, but JSON filters can be written in any programming language.
source format
↓
(pandoc)
↓
JSON-formatted AST
↓
(JSON filter)
↓
JSON-formatted AST
↓
(pandoc)
↓
target format