Reading Some More About Pandoc Before Improving Markdown implementation
Wanted to read some more about Pandoc before improving my markdown to HTML system. I currently do this with marked.js, but I want to use pandoc to improve speed.
Pandoc Lua Filters
Pandoc has long supported filters, which allow the pandoc abstract syntax tree (AST) to be manipulated between the parsing and writing phase. Traditional Pandoc Filters accept a JSON representation of the pandoc AST and produce an altered JSON representation of the AST. They may be written in any programming language, and invoked from pandoc using the --filter option.
Although traditional filters are very flexible, they have a couple of disadvantages. First, there is some overhead in writing JSON to stdout and reading it from stdin (twice, once on each side of the filter). Second, whether a filter will work will depends on the detail of the user's environment. A filter may require an interpreter for a certain programming language to be a variable, as well as a library for manipulating the pandoc AST in JSON form.
Pandoc makes it possible to write filters in Lua without any external dependencies at all. A Lua interpreter and a Lua library for creating pandoc filters is built into the pandoc executable. Pandoc data types are marshaled to Lua directly, avoiding the overhead of writing JSON to stdout and reading it from stdin. Here is an example of a Lua filter that converts strong emphasis to small caps:
return {
Strong = function (elem)
return pandoc.SmallCaps(elem.content)
end,
}
or equivalently,
function Strong(elem)
return pandoc.SmallCaps(elem.content)
end
This says: walk the AST, and when you find a Strong element, replace it with a SmallCaps element with the same content. To run it, save it in a file (smallcaps.lua) and invoke pandoc with --lua-filter=smallcaps.lua.
Filter Performance Comparison:
Command | Time |
---|---|
pandoc | 1.01s |
pandoc --filter ./smallcaps | 1.36s |
pandoc --filter ./smallcaps.py | 1.40s |
pandoc --filter ./smallcaps.lua | 1.03s |
The Lua Filter avoids the substantial overhead associated with marshaling to and from JSON over a pipe.
Lua Filter Structure
Lua filters are tables with element names as keys and values consisting of functions acting on those elements. Filters are expected to be put into separate files and are passed via the --lua-filter command-line argument. For example, if a filter is defined in a file current-date.lua, then it would be applied like this:
$ pandoc --lua-filter=current-date.lua -f markdown MANUAL.txt
The --lua-filter option may be supplied multiple times. Pandoc applies all filters (including JSON filters specified via --filter and Lua filters specified via --lua-filter) in the order they appear on the command line. Pandoc expects each Lua file to return a list of filters, the filters in that list are called sequentially, each on the result of the previous filter. If there is no value returned by the filter script then pandoc will try to generate a single filter by collecting all top-level functions whose names correspond to those of pandoc elements (e.g. Str, Para, Meta, or Pandoc) - that is why the two examples above are equivalent.
For each filter, the document is traversed and each element subjected to the filter. Elements for which the filter contains an entry (i.e. a function of the same name) are passed to Lua element filtering function. In other words, filter entries will be called fro each corresponding element in the document, getting the respective element as input.
The return value fo a filter function must be one of the following:
- nil: this means that teh object should remain unchanged
- a pandoc object: this must be of the same type as the input and will replace the original object
- a list of pandoc objects: these will replace the original object; the list is merged with the neighbors of the original objects (spliced into the list the original object belongs to); returning an empty lust deletes the object
The function's output must result in an element of the same type as the input. This means a filter function acting on an line element must return either nil, an inline, or a list of inlines, a function filtering a block element must return one of nil, a block, or a list of block elements. Pandoc will throw an error if this condition is violated. Elements without matching functions are left untouched.
Filters on Element Sequences
For some filtering tasks, it is necessary to know the order in which elements occur in teh document. It is not enough then to inspect a single element at a time. There are two special function names, which can be used to define filters on lists or blocks or lists of inlines.
- Inlines (inlines)
- If present in a filter, this function will be called on all lists of inline elements, like the content of a Para (paragraph) block, or the description of an Image. The inlines argument passed to the function will be a List of Inline elements for each call.
- Blocks (blocks)
- If present in a filter, this function will be called on all lists of block elements, like the content of a MetaBlocks meta element block, on each item of a list, and the main content of the Pandoc document. The blocks argument passed to the function will be a List of Block elements for each call.
These filter functions are special in that the result must either be nil, in which case the list is left unchanged, or must be a list of the correct type, i.e., the same type as the input argument. Single elements are not allowed as return values, as a single element in this context usually hints at a bug.
Traversal Order
The traversal order of filters can be selected by setting the key traverse to either topdown or typewise; the default is typewise.
local filter = {
traverse = 'topdown',
-- ... filter functions ...
}
return filter
Typewise Traversal
Element filter functions within a filter set are called in a fixed order, skipping any which are not present:
- functions for Inline elements
- the Inlines filter function
- functions for Block elements
- the Blocks filter function
- the Meta filter function, and last
- the Pandoc filter function
It is still possible to force a different order by explicitly returning multiple filter sets. For example, if the filter for Meta is to be run before that for Str one can write:
-- ... filter definitions ...
return {
{ Meta = Meta }, -- (1)
{ Str = Str } -- (2)
}
Filters are applied in the order in which they are returned. All functions in set (1) are thus run before those in (2), causing the filter function for Meta to be run before filtering of Str elements is started.
Topdown Traversal
It is sometimes more natural to traverse the document tree depth-first from the root towards the leaves, and all in a single run. For example, a block list [Plain [Str "a"], Para [Str "b"]] will try the following filter functions, in order: Blocks, Plain, Inlines, Str, Para, Inlines, Str. Topdown traversals can be cut short by returning false as a second value from the filter function. No child-element of the returned element is processed in that case.
Globa
Pandoc passes additional data to Lua filters by setting global variables:
| FORMAT | The global FORMAT is set to the format of the pandoc writer being used (html5 , latex , etc.), so the behavior of a filter can be made conditional on the eventual output format. |
| PANDOC_READER_OPTIONS | Table of the options which were provided to the parser. (ReaderOptions) |
| PANDOC_WRITER_OPTIONS | Table of the options that will be passed to the writer. While the object can be modified, the changes will not be picked up by pandoc. (WriterOptions) Accessing this variable in custom writers is deprecated. Starting with pandoc 3.0, it is set to a placeholder value (the default options) in custom writers. Access to the actual writer options is provided via the Writer or ByteStringWriter function, to which the options are passed as the second function argument. Since: pandoc 2.17 |
| PANDOC_VERSION | Contains the pandoc version as a Version object which behaves like a numerically indexed table, most significant number first. E.g., for pandoc 2.7.3, the value of the variable is equivalent to a table {2, 7, 3} . Use tostring(PANDOC_VERSION) to produce a version string. This variable is also set in custom writers. |
| PANDOC_API_VERSION | Contains the version of the pandoc-types API against which pandoc was compiled. It is given as a numerically indexed table, most significant number first. E.g., if pandoc was compiled against pandoc-types 1.17.3, then the value of the variable will behave like the table {1, 17, 3} . Use tostring(PANDOC_API_VERSION) to produce a version string. This variable is also set in custom writers. |
| PANDOC_SCRIPT_FILE | The name used to involve the filter. This value can be used to find files relative to the script file. This variable is also set in custom writers. |
| PANDOC_STATE | The state shared by all readers and writers. It is used by pandoc to collect and pass information. The value of this variable is of type CommonState and is read-only. |
| pandoc | The pandoc module, described in the next section, is available through the global pandoc . The other modules described herein are loaded as subfields under their respective name. |
| lpeg | This variable holds the lpeg module, a package based on Parsing Expression Grammars (PEG). It provides excellent parsing utilities and is documented on the official LPeg homepage. Pandoc uses a built-in version of the library, unless it has been configured by the package maintainer to rely on a system-wide installation. Note that the result of require 'lpeg' is not necessarily equal to this value; the require mechanism prefers the system’s lpeg library over the built-in version. |
| re | Contains the LPeg.re module, which is built on top of LPeg and offers an implementation of a regex engine. Pandoc uses a built-in version of the library, unless it has been configured by the package maintainer to rely on a system-wide installation. Note that the result of require 're is not necessarily equal to this value; the require mechanism prefers the system’s lpeg library over the built-in version. |
Pandoc Module
The pandoc Lua module is loaded into the filter's Lua environment and provides a set of functions and constants to make creation and manipulation of elements easier. The global variable pandoc is bound to the module and should not be overritten for this reason. Two major functionalities are provided by the module: element creator functions and access to some of pandoc's main functionalities.
Element creator functions like Str, Para, and Pandoc are designed to allow easy creation of new elements that are simple to use and can be read back from the Lua environment. Internally, pandoc uses these functions to create Lua objects which are passed to element filter functions. This means that elements created via this module will behave exactly as those elements accessible through the filter function parameter.
Pandoc Custom Writers
If you need to render a format not already handled by pandoc, or you want to change how pandoc renders a format, you can create a custom writer using the Lua language. Pandoc has a built-in Lua interpreter, so you needn't install any additional software to do this. A custom writer is a Lua file that defines how to render the document. Writers must define just a single function, named wither Writer or ByteStringWriter, which gets passed the document and writer options, and then handles the conversion of the document, rendering it into a string.
Writers
Custom writers using the new style must contain a global function named Writer or BytesStringWriter. Pandoc calls this function with the document and the writer options as arguments, and expects the function to return a UTF-8 encoded string.
function Writer (doc, opts)
-- ...
end
Writers that do not return text but binary data should define a function with name ByteStringWriter instead. The function must still return a string, but it does not have to be UTF-8 encoded and can contain arbitrary data. If both Writer and ByteStringWriter functions are defined, then only Writer function will be used.
<aside>
Element
<details>
Element
Comments
You have to be logged in to add a comment
User Comments