nbconvert, nbformat, and traitlets

Doing some reading to improve my knowledge of how Jupyter Notebook to HTML conversion works.

1 115

nbconvert

Python API For nbconvert

Using nbconvert enables:

presentation of information in familiar formats, such as PDF
publishing of research using LaTeX and opens the door for embedding notebooks in papers
collaboration with others who may not use the notebook in their work
sharing contents with many people via the web using HTML

Overall, notebook conversion and the nbconvert tool given scientists and researchers the flexibility to deliver information in a timely way across different formats.Primarily, the nbconvert tool allows you to convert a Jupyter .ipynb notebook document file into another static format including HTML, LaTeX, PDF, Markdown, reStructuredText, and more. nbconvert can also add productivity to your workflow when used to execute notebooks programmatically.

pip install nbconvert

For converting markdown to formats other than HTML, nbconvert uses Pandoc. For converting notebooks to PDF (with --to pdf), nbconvert makes use of LaTeX and XeTeX as the rendering engine. For converting notebooks to PDF with --to webpdf, nbconvert requires the playwright Chromium automation library.

The command-line syntax to run nbconvert script is:

$ jupyter convert --to FORMAT notebook.ipynb

This will convert the Jupyter notebook file notebook.ipynb into the output format given by the FORMAT string.

Supported Output Formats:

HTML
LaTeX
PDF
WebPDF
Reveal.js HTML Slideshow
Markdown
Ascii
reStructuredText
executable script
notebook

Jupyter also provides a few templates for output formats. These can be specified with an additional --template argument and are listed in the sections below:

--to html
- HTML export.
  - --template lab: A full static HTML render of the notebook. This looks very similar to the Jupyter lab interactive view. The lab template supports the extra --theme option, which defaults to light. This extra option not only allows you to use the default light or dark themes provided by the JupyterLab, but it allows you to use custom themes.
  - --template classic: Simplified HTML, using the classic jupyter look and feel
  - --template basic: Base HTML, rendering with minimal structure and style
  - --embed images: If this option is provided, embed images as base 64 urls in the resulting HTML file

nbconvert has been designed to work in memory so that it works well in a database or web-based environment too. the main principle of nbconvert is to instantiate an Exporter that controls the pipeline through which notebooks are converted.

""" Download a notebook """
from urllib.request import urlopen

url = "https://jakevdp.github.io/downloads/notebooks/XKCD_plots.ipynb"
response = urlopen(url).read().decode()
response[0:60] + " ..."
## '{\n "cells": [\n  {\n   "cell_type": "markdown",\n   "metadata": ...'

""" Read the message using nbformat. Doing this will guarantee that the notebook structure is valid.
"""
import nbformat

jake_notebook = nbformat.reads(response, as_version=4)
jake_notebook.cells[0]

"""
The nbformat API returns a special type of dictionary. 
"""

"""
The nbconvert API exposes some basic exporters for common formats and defaults. You will start by using one of them. First, you will import one of these exporters (specifically, the HTML exporter), then instantiate it using most of the defaults, and then you will use it to process the notebook we downloaded earlier.
"""
from traitlets.config import Config

# 1. Import the exporter
from nbconvert import HTMLExporter

# 2. Instantiate the exporter. We use the `classic` template for now; we'll get into more details
# later about how to customize the exporter further.
html_exporter = HTMLExporter(template_name="classic")

# 3. Process the notebook we loaded earlier
(body, resources) = html_exporter.from_notebook_node(jake_notebook)

"""
The exporter returns a tuple containing the source of the converted notebook, as well as a resources dict. The resources dict contains (among other things), the extracted png, jpg, etc. from the notebook when applicable. The basic HTML exporter leaves the figures as embedded basse64, but you can configyre it to extract the figures. `Exporter`s are stateless, so you woun't be able to extract any useful information beyond their configuration. 
"""

"""
When exporting, you may want to extract the base64 encoded figures as files. While the HTML exporter does not do this by default, the RstExporter does. 
"""
# Import the RST exproter
from nbconvert import RSTExporter

# Instantiate it
rst_exporter = RSTExporter()
# Convert the notebook to RST format
(body, resources) = rst_exporter.from_notebook_node(jake_notebook)

print(body[:970] + "...")
print("[.....]")
print(body[800:1200] + "...")

A high-level overview of the process of converting a notebook to another format:

Retrieve the notebook and its accompanying resources
Feed the notebook into the Exporter, which:
Sequentially feeds the notebook into an array of Preprocessors. Preprocessors only act on the structure of the notebook, and have unrestricted access to it.
Feeds the notebook into the Jinja templating engine, which converts it to a particular format depending on which template is selected
The exporter returns the converted notebook and other relevant resources as a tuple.
You write the data to the disk using the built-in FilesWriter (which writes the notebook and any extracted files to disk), or elsewhere using a custom Writer.

To extract figures when using the HTML exporter, we will want to change which Preprocessors we are using. There are several preprocessors that come with nbconvert, including one called the ExtractOutputPreprocessor. The ExtractOutputPreprocessor is responsible fro crawling the notebook, finding all of the figures, and putting them into the resources directory, as well as choosing the key (ie filename.extension) that can replace the figure inside the template. To enable the ExtractOutputPreprocessor, we must add it to the exporter's list of preprocessors:

# create a configuration object that changes the preprocessors
from traitlets.config import Config

c = Config()
c.HTMLExporter.preprocessors = ["nbconvert.preprocessors.ExtractOutputPreprocessor"]

# create the new exporter using the custom config
html_exporter_with_figs = HTMLExporter(config=c)
html_exporter_with_figs.preprocessors

There are an endless number of transformations that you may want to apply to a notebook. In particularly complicated cases, you may want to actually create your own Preprocessor. To create your own preprocessor, you will need to subclass from nbconvert.preprocessors.Preprocessor and overwrite either the preprocess and/or preprocess_cell methods.

Programmatically Creating Templates

from jinja2 import DictLoader

dl = DictLoader(
    {
        "footer": """
{%- extends 'lab/index.html.j2' -%}

{% block footer %}
FOOOOOOOOTEEEEER
{% endblock footer %}
"""
    }
)


exportHTML = HTMLExporter(extra_loaders=[dl], template_file="footer")
(body, resources) = exportHTML.from_notebook_node(jake_notebook)
for l in body.split("\n")[-4:]:
    print(l)

Removing Cells, Inputs, or Outputs

When converting Notebooks into other formats, it is possible to remove parts of a cell, or entire cells, using preprocessors. The notebook will remain unchanged, but the outputs will have certain pieces removed.

The most straightforward way to control which pieces of cells are removed is to use cell tags. These are single-string snippets of metadata that are stored in each cells "tag" field. The TagRemovePreprocessor can be used to remove inputs, outputs, or entire cells.

Sometimes you'd rather remove cells based on their content rather than their tags. In this case, you can use the RegexRemovePreprocessor.

Executing Notebooks

Jupyter notebooks are often saved with their output cells that have been cleared. nbconvert provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file. This section shows how to execute a .ipynb notebook document saving the result in notebook format. Executing notebooks can be very helpful to run all notebooks in Python library in one step, or as a way to automate the data analysis in projects involving more than one notebook.

Executing Notebooks from the command line

$ jupyter nbconvert --to notebook --execute mynotebook.ipynb

Executing Notebooks Using the Python API Interface

"""
Import nbconvert and the ExecutePreprocessor class
"""
import nbformat 
from nbconvert.preprocessors import ExecutePreprocessor

"""
Load the notebook
"""
with open(notebook_filename) as f:
    nb = nbformat.read(f, as_version=4)

"""
Configure the notebook execution mode
- We specified two arguments, `timeout` and `kernel_name`, which define respectively the cell execution timeout and the execution kernel
"""
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
"""
Execute/Run: To actually run the notebook we call the method preprocess
"""
ep.preprocess(nb, {'metadata': {'path': 'notebooks/'}})
"""
Finally, save the resulting notebook
"""
with open('executed_notebook.ipynb', 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)

The arguments passed to ExecutePreprocessor are configuration options called traitlets. There are many cool things about traitlets.

Handling Errors and Exceptions

An error during the notebook execution, by default, will stop the execution and raise a CellExecutionError. Conveniently, the source cell causing the error and the original error name and messaging are also printed. After the error, we can still save the notebook as before:

with open('executed_notebook.ipynb', mode='w', encoding='utf-8') as f:
    nbformat.write(nb, f)

The saved notebook contains the output up until the failing cell, and includes a full stack trace and error.

A useful pattern to execute notebooks while handling errors is the following:

from nbconvert.preprocessors import CellExecutionError

try:
    out = ep.preprocess(nb, {'metadata': {'path': run_path}})
except CellExecutionError:
    out = None
    msg = 'Error executing the notebook "%s".\n\n' % notebook_filename
    msg += 'See notebook "%s" for the traceback.' % notebook_filename_out
    print(msg)
    raise
finally:
    with open(notebook_filename_out, mode='w', encoding='utf-8') as f:
        nbformat.write(nb, f)

If your notebook contains any Jupyter Widgets, the state of all the widgets can be stored in the notebook's metadata. This allows rendering of the live widgets on for instance nbviewer, or when converting to HTML.

Configuration Options

Configuration options may be set in a file, ~/.jupyter.jupyter_nbconvert_config.py, or at the command line when starting nbconvert, i.e. jupyter nvconvert --Application.log_level=10. The most specific setting will always be used.

Creating Custom Templates for nbconvert

Most exporters in nbconvert are subclasses of TemplateExporter, and make use of jinja to render notebooks into the destination format. Alternative nbconvert templates can be selected by name from the command line with the --template option.

Nbconvert templates are directories containing resources for nbconvert template exporters such as jinja templates and associated assets. They are installed in the data directory of nbconvert, namely {installation_prefix}/share/jupyter/nbconvert. Nbconvert includes several templates already. In order to add additional paths to be searched, you need to pass TemplateExporter.extra_template_basedirs config options indicating the extra directories to search for templates.

The content of nbconvert templates

Nbconvert templates all include a conf.json file at the root of the directory, which is used to indicate:

the base template that it is inheriting from
the mimetypes of the template
preprocessors classes to register in the exporter when using that template

Nbconvert walks up the inheritance structure determined by conf.json and produces an aggregated configuration, merging the directories of registered preprocessors. The lexical ordering of the preprocessors by name determines the order in which they will be run. Besides the cond.json file, nbconvert templates most typically include jinja template files, although any other resource from the base template can be overridden in the derived template.

In nbconvert, jinja templates can inherit from any other jinja template available in its current directory or base template directory by name. Jinja templates of other directories can be addressed by their relative path from the Jupyter data directory.

Additional exporters may be registered by entry_points.

Under the hood, nbconvert uses pygments to highlight code. pdf, webpdf, and html exporting support changing the highlighting style.

Architecture of nbconvert

Python APU for working with nbconvert

nbformat

nbformat Python API

The Notebook File Format

The official Jupyter Notebook format is defined with this JSON schema, which is used by Jupyter to validate notebooks.

Top-Level Structure

At the highest level, a Jupyter Notebook is a dictionary with a few keys:

metadata (dict)
nbformat (int)
nbformat_minor (int)
cells (list)

{
    "metadata": {
        "kernel_info": {
            # if kernel_info is defined, its name field is required.
            "name": "the name of the kernel"
        },
        "language_info": {
            # if language_info is defined, its name field is required.
            "name": "the programming language of the kernel",
            "version": "the version of the language",
            "codemirror_mode": "The name of the codemirror mode to use [optional]",
        },
    },
    "nbformat": 4,
    "nbformat_minor": 0,
    "cells": [
        # list of cell dictionaries, see below
    ],
}

Some fields, such as code input and text output,m are characteristically multi-line strings. When these fields are written to disk, they may be written as a list of strings, which should be joined with '' when reading back into memory.

Cell Types

There are a few basic cell types for encapsulating code and text. All cells have the following basic structure:

{
    "cell_type": "type",
    "metadata": {},
    "source": "single string or [list, of, strings]",
}

Markdown Cells

Markdown cells are used for body-text and contain markdown, as defined in GitHub-flavored markdown, and implemented in marked.

Code Cells

Code cells are the primary content of Jupyter Notebooks. They contain source code in the language of the document's associated kernel, and a list of outputs associated with executing that code. They also have an execution_count, which must be an integer or null.

Code cell outputs

A code cell can have a variety of outputs (stream data or rich mime-type output). These correspond to messages produced as a result of executing the cell. All outputs have an output_type filed, which is a string defining what type of output it is.

steam output

{
    "output_type": "stream",
    "name": "stdout",  # or stderr
    "text": "[multiline stream text]",
}

display_data

Rich display outputs, as created by display_data messages, contain data keyed by mime-type. This is often called a mime-bundle, and shows up in various locations in the notebook format and message spec. The metadata of these messages may be keyed by mime-type as well.

{
    "output_type": "display_data",
    "data": {
        "text/plain": "[multiline text data]",
        "image/png": "[base64-encoded-multiline-png-data]",
        "application/json": {
            # JSON data is included as-is
            "key1": "data",
            "key2": ["some", "values"],
            "key3": {"more": "data"},
        },
        "application/vnd.exampleorg.type+json": {
            # JSON data, included as-is, when the mime-type key ends in +json
            "key1": "data",
            "key2": ["some", "values"],
            "key3": {"more": "data"},
        },
    },
    "metadata": {
        "image/png": {
            "width": 640,
            "height": 480,
        },
    },
}

execute_result

Results of executing a cell (as created by displayhook in Python) are stored in execute_result outputs. execute_result outputs are identifcal to display_data, adding only a execution_count filed, which must be an integer.

{
    "output_type": "execute_result",
    "execution_count": 42,
    "data": {
        "text/plain": "[multiline text data]",
        "image/png": "[base64-encoded-multiline-png-data]",
        "application/json": {
            # JSON data is included as-is
            "json": "data",
        },
    },
    "metadata": {
        "image/png": {
            "width": 640,
            "height": 480,
        },
    },
}

error

Failed execution may show an error.

{
    'output_type': 'error',
    'ename' : str,   # Exception name, as a string
    'evalue' : str,  # Exception value, as a string

    # The traceback will contain a list of frames,
    # represented each as a string.
    'traceback' : list,
}

Raw NBConvert Cells

A raw cell is defined as content that should be included unmodified in nbconvert output. This cell could include raw LaTeX for nbconvert to pdf via latex, or resturctured text for use in Sphinx documentation. The notebook authoring environment does not render raw cells.

Cell Attachments

Markdown and raw cells can have a number of attachments, typically inline images that can be referenced in the markdown content of a cell. The attachments dictionary of a cell contains a set of mime-bundlers keyed by filename that represents the files attached to the cell.

Cell ids

Since the 4.5 schema release, all cells have an id field which must be a string of length 1-64 with alphanumeric - and _ as legal characters to use. These ids must be unique to any given Notebook following the nbformat spec.

Metadata

Metadata is a place that you can put arbitrary JSONable information about your notebook, cell, or output.

Supported Markup Formats

Most Jupyter Notebook interfaces use the marked.js JavaScript library for rendering markdown. This supports markdown in the following markdown flavors:

Architecture of nbconvert

This is a high-level outline of the basic workflow, structures and objects in nbconvert. This exposition has a two-fold goal:

To alert you to the affordances available for customization or direct contributions
To provide a map of where and when different events occur, which should aid in tracking down bugs

Nbconvert takes in a notebook, which is a JSON object, and operates on that object. This can include operations that take a notebook and return a notebook. Or it could be that we wish to systematically alter the notebook. But often we want to have the notebook's structured content in a different format. The basic unit if structure in a notebook is the cell. Accordingly, since our templating engine is capable of expressing structure, the basic unit in our templates will often be specified at the cell level. Each cell has a certain type, and the three most important cell types for our purposes are code, markdown, and raw Nbconvert. Code cells can be split further into their input and their output. The template's structure can be seen as a mechanism for selecting content on which to operate. Because the template operates on individual cells, this has some upsides and some drawbacks.

Note that all that we’ve described is happening in memory. This is crucial in order to ensure that this functionality is available when writing files is more challenging. Nonetheless, the reason for using nbconvert almost always involves producing some kind of output file. We take the in-memory object and write a file appropriate for the output type.

Classes

Exporter

The primary class in nbconvert is the nbconvert.exporters.exporter.Exporter. Exporters encapsulate the operation of turning a notebook into another format. There is one Exporter for each format supported in nbconvert. The first thing an Exporter does is load a notebook, usually from a file in nbconvert. The first thing an Exporter does is load a notebook, usually from a file via nbformat. Most of what a typical Exporter does is select and configure preprocessors, filters, and templates.

Preprocessors

A nbconvert.preprocessors.Preprocessor is an object that transforms the content of the notebook to be exported. The result of a preprocessor being applied to a notebook is always a notebook. These operations include re-executing cells, stripping output, removing bundled outputs to separate files, etc. If you want to add operations that modify a notebook before exporting, a preprocessor is the place to start. Once a notebook is preprocessed, it's time to convert the notebook into the destination format.

Templates

Most Exporters in nbconvert are a subclass of nbconvert.exporters.templateexporter.TemplateExporter, which make use of jinja to render a notebook into the destination format. Nbconvert templates can be selected from the command line.

Filters

Filters are Python callables which take something (typically text) as an input and produce a text output. If you want to perform custom transformations of particular outputs, a filter may be the way to go.

{% block stream_stdout -%}
<div class="output_subarea output_stream output_stdout output_text">
<pre>
{{- output.text | ansi2html -}}
</pre>
</div>
{%- endblock stream_stdout %}

The {{- output.text | ansi2html -}} bit will invoke the ansi2html filter to transform the text output. Typically, filters are pure functions. Once it has passed through the template, an Exporter is done with the notebook, and returns the file data.

Writers

A Writer takes care of writing the resulting file(s) where they should end up. There are two basic Writers in nbconvert:

stdout - writes the result to stdout (for pipe-style workflows)
Files (default) - writes the result top the filesystem

Once the output is written, nbconvert has done its job.

Postprocessors

A Postprocessor is something that runs after everything is exported and written to the filesystem. The only postprocessor in nbconvert at this point is the nbconvert.postprocessors.serve.ServePostProcessor, which is used for serving real.js HTML slideshows.

Traitlets

Traitlets is a framework that lets Python classes have attributes with type checking, dynamically default values, and 'on change' callbacks. The package also includes a mechanism to use traitlets for configuration, loading values from files or from command line arguments. This is a distinct layer on top of traitlets, so you can use traitlets in your code without using the configuration machinery.

In short, traitlets let the user define classes that have:

Attributes (traits) with type checking and dynamically computed default values
Traits emit change events when attributes are modified
Traitlets perform some validation and allow coercison of new trait values on assignment. They also allow the user to define custom validation logic for attributes on the value of other attributes

At its most basic, traitlets provides type checking and dynamic default value generation of attributes on traitlets.HasTraits subclasses:

from traitlets import HasTraits, Int, Unicode, default
import getpass


class Identity(HasTraits):
    username = Unicode()

    @default("username")
    def _default_username(self):
        return getpass.getuser()

class Foo(HasTraits):
    bar = Int()


foo = Foo(bar="3")  # raises a TraitError

"""
TraitError: The 'bar' trait of a Foo instance must be an int,
but a value of '3' <class 'str'> was specified
"""

Traitlets implement the observer pattern:

class Foo(HasTraits):
    bar = Int()
    baz = Unicode()


foo = Foo()


def func(change):
    print(change["old"])
    print(change["new"])  # as of traitlets 4.3, one should be able to
    # write print(change.new) instead


foo.observe(func, names=["bar"])
foo.bar = 1  # prints '0\n 1'
foo.baz = "abc"  # prints nothing

Each trait type (Int, Unicode, Dict, etc.) may have its own validatioon or coercion logic. In addition, we can register custom cross-validators that may depend on the state of other attributes.

User Comments

There are currently no comments for this article.