Pandoc Documentation Notes

I am learning about `pandoc` because I might use it to make my Jupyter Notebook to HTML conversion system better. I also think I am going to use it for a project - a universal file converter.

2 460

References

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert between many formats. [...] Pandoc understands a number of useful markdown syntax extensions, including document metadata, footnotes, tables, definition lists, superscript and subscript, strikeout, enhanced ordered lists, running example lists ...

LaTeX\LaTeXLATEX math (and even macros) can be used in markdown documents. Pandoc includes a powerful system for automatic citations and bibliographies. There are many ways to customize pandoc to fit your needs, including a template system and a powerful system for writing filters.

Getting started

Install pandoc

# Windows 
C:> choco install pandoc
# Linux 
$ # Check whether the pandoc version in your package manager is not outdated. 
$ # Pandoc is in the Debian, Ubuntu, Slackware, Arch, Fedora, NixOS, openSUSE, gentoo 
$ # and Void repositories.
$ # If not installed: 
$ # 1. Go to Github project page
$ sudo wget https://github.com/jgm/pandoc/releases/download/3.5/pandoc-3.5-linux-amd64.tar.gz # Download binary release 
$ sudo tar -xvf pandoc-3.5-linux-amd64.tar.gz # Extract the file 
$ sudo mv pandoc-3.5/bin/pandoc /usr/local/bin/ # Move the file to a system directory
$ pandoc --version # Check the version

Open a Terminal

Windows: In the cmd window, type chcp 65001 before using pandoc, to set the encoding to UTF-8.

To verify that pandoc is installed, type pandoc --version. You should see a message telling you which version of pandoc is installed.

Using pandoc as a filter

When pandoc is invoked without specifying any input files, it operates as a "filter", taking input from the terminal and sending its output back to the terminal. You can use this feature to play around with pandoc. By default, input is interpreted as pandoc markdown, and output is HTML. But you can change that:

$ pandoc -f html -t markdown # converting HTML to markdown

Pandoc can often figure out the input and output formats from the filename extensions.

User's Guide

$ pandoc [options] [input-file]...

Description

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX\LaTeXLATEX , and Word docx. Pandoc can also produce a PDF.

Pandoc's enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representations of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader and writer. Users can also run custom pandoc filters to modify the intermediate AST.

One should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. Some complex elements, such as complex tables, may not fit into pandoc's simple document model.

While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

Using Pandoc

If not input-files are specified, input is read from stdin. Output goes to stdout by default. For output to a file, use the -o option:

$ pandoc -o output.html input.txt

By default, pandoc produces a document fragment. To produce a standalone document, use the -s flag:

$ pandoc -s -o output.html input.txt

The input and output formats can be specified with the -f and -t options respectively.

$ pandoc -f html -t markdown hello.html

If no input or output formats are not specified explicitly, pandoc will attempt to guess it from the extensions of the filename. If no inputfile is specified (so that the input comes down from stdin), or if the input files' extensions are unknown, the input format will be assumed to be Markdown.

Pandoc uses UTF-8 character encoding for input and output. If your local character encoding is not UTF-8, you should pipe input and output through iconv:

$ iconv -t utf-8 input.txt | pandoc | iconv -f utf-8

To produce a PDF, specify an output file with a .pdf extension. By default, pandoc will use LaTeX\LaTeXLATEX to create the PDF. Alternatively, pandoc can use ConTeXt, roffms, or HTML as an intermediate format. To do this, specify an output format with a .pdf extension, but add the --pdf-engine option. You can control the CSS style using variables. You should have LaTeX installed on your machine. See what's required for LaTeX\LaTeXLATEX here.

Instead of an input file, you can read from the web. Pandoc will fetch the content using HTTP:

$ pandoc -f html -t markdown https://www.fsf.org
$ # You can specify custom headers
$ pandoc -f html -t markdown --request-header User-Agent:"Mozilla/5.0" \
  https://www.fsf.org

General Options

Pandoc as a Web Server

If you rename (or symlink) the pandoc executable to pandoc-server, of if you call oandoc with server as the first argument, it will start up a web server with a JSON API. This server exposes most of the conversion functionality of pandoc. For full documentation, see the pandoc-server man page.

A Note on Security

  1. Although pandoc will not create or modify any files other than those you explicitly ask it to create, a filter or custom writer could in principle do anything on your file system.
  2. Several input formats support include directives that allow the contents of a file to be included in the output. An untrusted attacker could use these to view the contents of a file to be included in the output. The --sandbox option protects against this threat.
  3. Several output formats will embed encoded or raw images into the output file. An untrusted attacker could exploit this to view the contents of non-image files on the file system.
  4. Pandoc parsers can exhibit pathological performance on some corner cases. It is wise to be any pandoc operations under a timeout, to avoid DOS attacks that exploit these issues. If you are using the pandoc executable, you can add the command line options +RTS -M512M -RTS (for example) to limit the heap size to 512MB.
  5. The HTML generated by pandoc is not guaranteed to be safe. Of raw_html is enabled for the Markdown input, users can inject any arbitrary HTML. Even if it is diabled, users can include dangerous content in URLs and attributes. To be safe, you should run all HTML generated from untrusted user input through an HTML sanitizer.

Pandoc Filters

Pandoc provides an interface for users to write programs (known as filters) which act on pandoc's AST. Pandoc consists of a set of readers and writers. When converting a document from one format to another, text is parsed by a reader into pandoc's intermediate representation of the document - an "abstract syntax tree" or AST - which is then converted by the writer into the target format. A "filter" is a problem that modifies the AST, between the reader and the writer.

INPUT --reader--> AST --filter--> AST --writer--> OUTPUT

Pandoc supports two kinds of filters:

            
                      source format
                            ↓
                         (pandoc)
                            ↓
                    JSON-formatted AST
                            ↓
                      (JSON filter)
                            ↓
                    JSON-formatted AST
                            ↓
                         (pandoc)
                            ↓
                      target format

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC