Reading Some More About Pandoc Before Improving Markdown implementation

Wanted to read some more about Pandoc before improving my markdown to HTML system. I currently do this with marked.js, but I want to use pandoc to improve speed.

1 8

Pandoc Lua Filters

Pandoc has long supported filters, which allow the pandoc abstract syntax tree (AST) to be manipulated between the parsing and writing phase. Traditional Pandoc Filters accept a JSON representation of the pandoc AST and produce an altered JSON representation of the AST. They may be written in any programming language, and invoked from pandoc using the --filter option.

Although traditional filters are very flexible, they have a couple of disadvantages. First, there is some overhead in writing JSON to stdout and reading it from stdin (twice, once on each side of the filter). Second, whether a filter will work will depends on the detail of the user's environment. A filter may require an interpreter for a certain programming language to be a variable, as well as a library for manipulating the pandoc AST in JSON form.

Pandoc makes it possible to write filters in Lua without any external dependencies at all. A Lua interpreter and a Lua library for creating pandoc filters is built into the pandoc executable. Pandoc data types are marshaled to Lua directly, avoiding the overhead of writing JSON to stdout and reading it from stdin. Here is an example of a Lua filter that converts strong emphasis to small caps:

return {
  Strong = function (elem)
    return pandoc.SmallCaps(elem.content)
  end,
}

or equivalently,

function Strong(elem)
  return pandoc.SmallCaps(elem.content)
end

This says: walk the AST, and when you find a Strong element, replace it with a SmallCaps element with the same content. To run it, save it in a file (smallcaps.lua) and invoke pandoc with --lua-filter=smallcaps.lua.

Filter Performance Comparison:

Command Time
pandoc 1.01s
pandoc --filter ./smallcaps 1.36s
pandoc --filter ./smallcaps.py 1.40s
pandoc --filter ./smallcaps.lua 1.03s

The Lua Filter avoids the substantial overhead associated with marshaling to and from JSON over a pipe.

Lua Filter Structure

Lua filters are tables with element names as keys and values consisting of functions acting on those elements. Filters are expected to be put into separate files and are passed via the --lua-filter command-line argument. For example, if a filter is defined in a file current-date.lua, then it would be applied like this:

$ pandoc --lua-filter=current-date.lua -f markdown MANUAL.txt 

The --lua-filter option may be supplied multiple times. Pandoc applies all filters (including JSON filters specified via --filter and Lua filters specified via --lua-filter) in the order they appear on the command line. Pandoc expects each Lua file to return a list of filters, the filters in that list are called sequentially, each on the result of the previous filter. If there is no value returned by the filter script then pandoc will try to generate a single filter by collecting all top-level functions whose names correspond to those of pandoc elements (e.g. Str, Para, Meta, or Pandoc) - that is why the two examples above are equivalent.

For each filter, the document is traversed and each element subjected to the filter. Elements for which the filter contains an entry (i.e. a function of the same name) are passed to Lua element filtering function. In other words, filter entries will be called fro each corresponding element in the document, getting the respective element as input.

The return value fo a filter function must be one of the following:

The function's output must result in an element of the same type as the input. This means a filter function acting on an line element must return either nil, an inline, or a list of inlines, a function filtering a block element must return one of nil, a block, or a list of block elements. Pandoc will throw an error if this condition is violated. Elements without matching functions are left untouched.

Filters on Element Sequences

For some filtering tasks, it is necessary to know the order in which elements occur in teh document. It is not enough then to inspect a single element at a time. There are two special function names, which can be used to define filters on lists or blocks or lists of inlines.

  1. Inlines (inlines)
  2. If present in a filter, this function will be called on all lists of inline elements, like the content of a Para (paragraph) block, or the description of an Image. The inlines argument passed to the function will be a List of Inline elements for each call.
  3. Blocks (blocks)
  4. If present in a filter, this function will be called on all lists of block elements, like the content of a MetaBlocks meta element block, on each item of a list, and the main content of the Pandoc document. The blocks argument passed to the function will be a List of Block elements for each call.

These filter functions are special in that the result must either be nil, in which case the list is left unchanged, or must be a list of the correct type, i.e., the same type as the input argument. Single elements are not allowed as return values, as a single element in this context usually hints at a bug.

Traversal Order

The traversal order of filters can be selected by setting the key traverse to either topdown or typewise; the default is typewise.

local filter = {
  traverse = 'topdown',
  -- ... filter functions ...
}
return filter
Typewise Traversal

Element filter functions within a filter set are called in a fixed order, skipping any which are not present:

  1. functions for Inline elements
  2. the Inlines filter function
  3. functions for Block elements
  4. the Blocks filter function
  5. the Meta filter function, and last
  6. the Pandoc filter function

It is still possible to force a different order by explicitly returning multiple filter sets. For example, if the filter for Meta is to be run before that for Str one can write:

-- ... filter definitions ...

return {
  { Meta = Meta },  -- (1)
  { Str = Str }     -- (2)
}

Filters are applied in the order in which they are returned. All functions in set (1) are thus run before those in (2), causing the filter function for Meta to be run before filtering of Str elements is started.

Topdown Traversal

It is sometimes more natural to traverse the document tree depth-first from the root towards the leaves, and all in a single run. For example, a block list [Plain [Str "a"], Para [Str "b"]] will try the following filter functions, in order: Blocks, Plain, Inlines, Str, Para, Inlines, Str. Topdown traversals can be cut short by returning false as a second value from the filter function. No child-element of the returned element is processed in that case.

Globa

Pandoc passes additional data to Lua filters by setting global variables:

| FORMAT | The global FORMAT is set to the format of the pandoc writer being used (html5 , latex , etc.), so the behavior of a filter can be made conditional on the eventual output format. |
| PANDOC_READER_OPTIONS | Table of the options which were provided to the parser. (ReaderOptions) |
| PANDOC_WRITER_OPTIONS | Table of the options that will be passed to the writer. While the object can be modified, the changes will not be picked up by pandoc. (WriterOptions) Accessing this variable in custom writers is deprecated. Starting with pandoc 3.0, it is set to a placeholder value (the default options) in custom writers. Access to the actual writer options is provided via the Writer or ByteStringWriter function, to which the options are passed as the second function argument. Since: pandoc 2.17 |
| PANDOC_VERSION | Contains the pandoc version as a Version object which behaves like a numerically indexed table, most significant number first. E.g., for pandoc 2.7.3, the value of the variable is equivalent to a table {2, 7, 3} . Use tostring(PANDOC_VERSION) to produce a version string. This variable is also set in custom writers. |
| PANDOC_API_VERSION | Contains the version of the pandoc-types API against which pandoc was compiled. It is given as a numerically indexed table, most significant number first. E.g., if pandoc was compiled against pandoc-types 1.17.3, then the value of the variable will behave like the table {1, 17, 3} . Use tostring(PANDOC_API_VERSION) to produce a version string. This variable is also set in custom writers. |
| PANDOC_SCRIPT_FILE | The name used to involve the filter. This value can be used to find files relative to the script file. This variable is also set in custom writers. |
| PANDOC_STATE | The state shared by all readers and writers. It is used by pandoc to collect and pass information. The value of this variable is of type CommonState and is read-only. |
| pandoc | The pandoc module, described in the next section, is available through the global pandoc . The other modules described herein are loaded as subfields under their respective name. |
| lpeg | This variable holds the lpeg module, a package based on Parsing Expression Grammars (PEG). It provides excellent parsing utilities and is documented on the official LPeg homepage. Pandoc uses a built-in version of the library, unless it has been configured by the package maintainer to rely on a system-wide installation. Note that the result of require 'lpeg' is not necessarily equal to this value; the require mechanism prefers the system’s lpeg library over the built-in version. |
| re | Contains the LPeg.re module, which is built on top of LPeg and offers an implementation of a regex engine. Pandoc uses a built-in version of the library, unless it has been configured by the package maintainer to rely on a system-wide installation. Note that the result of require 're is not necessarily equal to this value; the require mechanism prefers the system’s lpeg library over the built-in version. |

Pandoc Module

The pandoc Lua module is loaded into the filter's Lua environment and provides a set of functions and constants to make creation and manipulation of elements easier. The global variable pandoc is bound to the module and should not be overritten for this reason. Two major functionalities are provided by the module: element creator functions and access to some of pandoc's main functionalities.

Element creator functions like Str, Para, and Pandoc are designed to allow easy creation of new elements that are simple to use and can be read back from the Lua environment. Internally, pandoc uses these functions to create Lua objects which are passed to element filter functions. This means that elements created via this module will behave exactly as those elements accessible through the filter function parameter.

Pandoc Custom Writers

If you need to render a format not already handled by pandoc, or you want to change how pandoc renders a format, you can create a custom writer using the Lua language. Pandoc has a built-in Lua interpreter, so you needn't install any additional software to do this. A custom writer is a Lua file that defines how to render the document. Writers must define just a single function, named wither Writer or ByteStringWriter, which gets passed the document and writer options, and then handles the conversion of the document, rendering it into a string.

Writers

Custom writers using the new style must contain a global function named Writer or BytesStringWriter. Pandoc calls this function with the document and the writer options as arguments, and expects the function to return a UTF-8 encoded string.

function Writer (doc, opts)
  -- ...
end

Writers that do not return text but binary data should define a function with name ByteStringWriter instead. The function must still return a string, but it does not have to be UTF-8 encoded and can contain arbitrary data. If both Writer and ByteStringWriter functions are defined, then only Writer function will be used.

Comments

You have to be logged in to add a comment

User Comments

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language

Insert Chart

ESC

Use the search box below

Upload Previous Version of Article State

ESC