loading...

Robust Jupyter report generation using static analysis

edublancas profile image Eduardo Blancas Updated on ・13 min read

Jupyter notebooks are a great format for generating data analysis reports since they can contain rich output such as tables and charts in a single file. With the release of papermill, a package that lets you parametrize and execute .ipynb files programmatically, it became easier to use notebooks as templates to generate analytical reports. When developing a Machine Learning model, I use Jupyter notebooks in tandem with papermill to generate a report for each experiment I run, this way, I can always go back and check performance metrics, tables and charts to compare one experiment to another.

After trying out the Jupyter notebook + papermill combination in a few projects, I found some recurring problems:

  1. .ipynb stores cell's output in the same file. This is good for the final report, but for development purposes, if two or more people edit the same notebook, cell's output will get into the way, making git merge a big pain
  2. Even if we make sure cell's output is deleted before pushing to the git repository, comparing versions using git diff yields illegible results (.ipynb files are JSON files with complex structure)
  3. Notebooks are developed interactively: cells are added and moved around, this interactivity often causes a top to bottom execution to result in errors. Given that papermill executes notebooks cell by cell, something as simple as a syntax error in the very last cell will be raised until such cell is executed
  4. Papermill doesn't validate input parameters, it just adds a new cell. This might lead to unexpected behavior, such as an "undefined variable" errors or inadvertently using a default parameter value. This is especially frustrating for long-running notebooks where one finds out errors after waiting for the notebook to finish execution

In this blog post I'll explain my workflow for robust report generation, this post is divided in two parts, part I discusses the solution to problems 1 and 2, part II covers 3 and 4. Incorporating this workflow will help you better integrate your report's source code with git and save precious time by automatically preventing notebook execution when errors are detected.

Along the way, you'll also learn a few interesting things:

  • How Jupyter notebooks are represented (the .ipynb format)
  • How to read and manipulate notebooks using the nbformat package
  • How to convert a Python script (.py) to a Jupyter notebook using jupytext
  • Basic Jupyter notebook static analysis using pyflakes and parso
  • How to programmatically execute Jupyter notebooks using papermill
  • How to automate report validation and generation using ploomber

Workflow summary

The solution for problems 1 and 2 is to use a different format for development and convert to .ipynb right before during execution, jupytext does exactly that. Problems 3 and 4 are approached by doing static analysis before executing the notebook.

Step by step summary:

  1. Work on .py files (instead of .ipynb) to make git integration easier
  2. Declare your "notebook" parameters at the top, tagging the cell as "parameters" (see jupytext reference)
  3. Before executing your notebook, validate the .py file using pyflakes and parso
  4. If validation succeeds, use jupytext to convert your .py file to a .ipynb notebook
  5. Execute your .ipynb notebook using papermill

Alternatively, you can use ploomber to automate the whole process, sample code is provided at the end of this post.

Part I: To ease git integration replace .ipynb notebooks with .py files

How are notebooks represented on disk?

A notebook.ipynb file is just a JSON file with a certain structure, which is defined in the nbformat package. When we open the Jupyter application (by using the jupyter notebook command), Jupyter uses nbformat under the hood to save our changes in the .ipynb file.

Let's see how we can create a notebook by directly manipulating an object and then serializing it to JSON:

# create a new notebook (nbformat.v4 defines the lastest jupyter notebook format)
nb = nbformat.v4.new_notebook()

# let's add a new code cell
cell = nbformat.v4.new_code_cell(source='# this line was added programatically\n 1 + 1')
nb.cells.append(cell)

# what kind of object is this?
print(f'A notebook is an object of type: {type(nb)}')

Console output: (1/1):

A notebook is an object of type: <class 'nbformat.notebooknode.NotebookNode'>

We can convert the notebook object to its JSON representation:

writer = nbformat.v4.nbjson.JSONWriter()
nb_json = writer.writes(nb)
print('Notebook JSON representation (content of the .ipynb file):\n\n', nb_json)

Console output: (1/1):

Notebook JSON representation (content of the .ipynb file):

 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this line was added programatically\n",
    " 1 + 1"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}

Notebook is a great output format since it supports embedded charts and tables in a single file, which we can easily share or review later but it's not a good choice as source code format. Say we edit the previous notebook, by just changing the first cell and adding a second one:

# edit first cell
nb['cells'][0]['source'] = '# Change cell\n 2 + 2'

# add a new one
cell = nbformat.v4.new_code_cell(source='# This is a new cell\n 3 + 3')
nb.cells.append(cell)

How our changes would look like for the reviewer?

# generate diff view between the old and the new notebook
nb_json_edited = writer.writes(nb)
diff = difflib.ndiff(nb_json.splitlines(keepends=True),
                     nb_json_edited.splitlines(keepends=True))
print(''.join(diff), end="")

Console output: (1/1):

{
   "cells": [
    {
     "cell_type": "code",
     "execution_count": null,
     "metadata": {},
     "outputs": [],
     "source": [
-     "# this line was added programatically\n",
+     "# Change cell\n",
-     " 1 + 1"
?       ^   ^
+     " 2 + 2"
?       ^   ^
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "# This is a new cell\n",
+     " 3 + 3"
     ]
    }
   ],
   "metadata": {},
   "nbformat": 4,
   "nbformat_minor": 4
  }

It's hard to see what's going on, and this is just a notebook with two cells and no output. In a real notebook with dozens of cells, understanding the difference between the old and new versions by eye is impossible.

To ease git integration, I use plain .py files and only convert them to .ipynb notebooks before execution. We could parse a .py file and convert it to a valid .ipynb file using nbformat, but there are important details such as tags or markdown cells that we have to take care of, fortunately, jupytext does that for us.

Furthermore, once jupytext is installed, opening a .py file in the jupyter notebook application will treat the file as a notebook and we will be able to run, add, remove cells as usual.

Let's see how to convert a .py file to .ipynb using jupytext:

# define your "notebook" in a plain .py file
# note that jupytext defines a syntax to support markdown cells and cell tags
py_code = ("""# This is a markdown cell

# + tags=['parameters']
x = 1
y = 2
""")

# use jupyter to convert it to a notebook object
nb = jupytext.reads(py_code, fmt='py')
print(f'Object type:\n{type(nb)}')

Console output: (1/1):

Object type:
<class 'nbformat.notebooknode.NotebookNode'>

Using .py files solve solves problems 1 and 2. Let's now discuss problems 3 and 4.

Part II: To catch errors before execution, use static analysis

Static analysis is the analysis of source code without execution. Since our notebooks usually take a lot to run, we want to catch as many errors as we can before running them, given that we got rid of the complex .ipynb format, we can now use tools that analyze Python source code to spot errors.

How Jupyter notebooks are executed by papermill?

It is important to understand how papermill executes notebooks to motivate this section. papermill performs a cell by cell execution: it takes the code from the first cell, sends it to the Python kernel, waits for a response, saves output and repeats this process for all cells. You can see the details in the source code here, you'll notice that PapermillNotebookClient is a subclass of NotebookClient, which is part of the nbclient, an official package that also implements a notebook executor.

This cell by cell logic has an important implication: an error in cell i, will be raised until such cell is executed, image your notebook looks like this:


import time

# cell 1 - simulate a long-running operation
time.sleep(3600)

# ...
# ...

# cell 100 - there is a syntax error here (missing ":")!
if x > 10
    pass

Something as simple as a syntax error will make your notebook crash until it reaches cell 100. To fix this problem, we will do a very simple static analysis in the whole notebook source code before executing it.

Finding errors with pyflakes

To prevent some runtime errors, we will run a few checks in our source code before executing it. pyflakes is a tool that looks for errors in our source code by parsing it. Since pyflakes does not execute code, it is limited in terms of how many errors it can find but it is very useful to find simple errors that would otherwise be detected at runtime. For the full list of errors pyflakes can detect, see this.

Let's see how it works:

py_code = """
import time

time.sleep(3600)

x = 1
y = 2

# z is never defined!
x + y + z

print('Variables. x: {}. y: {}'.format(x))
"""

_ = pyflakes_check(py_code, filename='my_file.py')

Console output: (1/1):

my_file.py:10:9 undefined name 'z'
my_file.py:12:7 '...'.format(...) is missing argument(s) for placeholder(s): 1

pyflakes found that variable 'z' is used but never defined, had we executed this notebook, we'd have find out about the error after waiting for one hour.

There are other projects similar to pyflakes, such as pylint. pylint is able to find more errors that pyflakes but it also flags style errors (such as inconsistent indentation), we probably don't want to prevent notebook execution due to style issues, so we'd have to filter out some messages. pyflakes works just fine for our purposes.

Parametrized notebooks with papermill

papermill can parametrize notebooks which allows you to use them as templates. Say you have a notebook called yearly_template.ipynb that takes a year as a parameter and generates a summary for data generated in that year, you could execute it from the command line using papermill like this:

papermill yearly_template.ipynb report_2019.ipynb -p year 2019

.ipynb files support cell tags, when you execute a notebook, papermill will inject a new cell with your parameters just below a cell tagged with "parameters". Although we are not dealing with .ipynb files anymore, we can still tag cells using jupytext syntax. Let's define a simple notebook:

# + tags=['parameters']
year = None


# +
print('the year is {}'.format(year))

If you convert the code above to .ipynb and then execute it using papermill, papermill will execute the following cells:

# Cell 1: cell tagged with "parameters"
year = None


# Cell 2: injected by papermill
year = 2019


# Cell 3
print('the year is {}'.format(year))

papermill limits itself to inject the passed parameters and execute the notebook, it does not perform any kind of validation. Adding a simple validation logic can help us prevent runtime errors before execution.

Extracting declared parameters with parso

I want parametrized notebooks to behave more like functions: they should refuse to run if any parameter is missing or if anything is passed but not declared. To enable this feature we have to analyze the "parameters" cell and compare it with the parameters passed via papermill. parso is a package that parses Python code and allows us to do exactly that.

params_cell = """
# + tags=['parameters']
a = 1
b = 2
c = 3
"""

# parse "parameters" cell, find which variables are defined
module = parso.parse(params_cell)
print('\nDefined variables: ', list(module.get_used_names()))

Console output: (1/1):

Defined variables:  ['a', 'b', 'c']

We see that parso detected the three variables, we can use this information to validate input parameters against declared ones.

Note: I recently discovered that finding declared variables can also be done with the ast module, which is part of the standard library.

Putting it all together

We now implement the logic in a single function, we'll take Python source code as input and validate using pyflakes and parso.

def check_notebook_source(nb_source, params, filename='notebook'):
    """
    Perform static analysis on a Jupyter notebook source raises
    an exception if validation fails

    Parameters
    ----------
    nb_source : str
        Jupyter notebook source code in jupytext's py format,
        must have a cell with the tag "parameters"

    params : dict
        Parameter that will be added to the notebook source

    filename : str
        Filename to identify pyflakes warnings and errors
    """
    # parse the JSON string and convert it to a notebook object using jupytext
    nb = jupytext.reads(nb_source, fmt='py')

    # add a new cell just below the cell tagged with "parameters"
    # this emulates the notebook that papermill will run
    nb, params_cell = add_passed_parameters(nb, params)

    # run pyflakes and collect errors
    res = check_source(nb, filename=filename)
    error_message = '\n'

    # pyflakes returns "warnings" and "errors", collect them separately
    if res['warnings']:
        error_message += 'pyflakes warnings:\n' + res['warnings']

    if res['errors']:
        error_message += 'pyflakes errors:\n' + res['errors']

    # compare passed parameters with declared
    # parameters. This will make our notebook behave more
    # like a "function", if any parameter is passed but not
    # declared, this will return an error message, if any parameter
    # is declared but not passed, a warning is shown
    res_params = check_params(params_cell['source'], params)
    error_message += res_params

    # if any errors were returned, raise an exception
    if error_message != '\n':
        raise ValueError(error_message)

    return True

Let's now see the implementation of the functions used above:

def check_params(params_source, params):
    """
    Compare the parameters cell's source with the passed parameters, warn
    on missing parameter and raise error if an extra parameter was passed.
    """
    # params are keys in "params" dictionary
    params = set(params)

    # use parso to parse the "parameters" cell source code and get all variable names declared
    declared = set(parso.parse(params_source).get_used_names().keys())

    # now act depending on missing variables and/or extra variables

    missing = declared - params
    extra = params - declared

    if missing:
        warnings.warn(
            'Missing parameters: {}, will use default value'.format(missing))

    if extra:
        return 'Passed non-declared parameters: {}'.format(extra)
    else:
        return ''
def check_source(nb, filename):
    """
    Run pyflakes on a notebook, wil catch errors such as missing passed
    parameters that do not have default values
    """
    # concatenate all cell's source code in a single string
    source = '\n'.join([c['source'] for c in nb.cells])

    # this objects are needed to capture pyflakes output
    warn = StringIO()
    err = StringIO()
    reporter = Reporter(warn, err)

    # run pyflakes.api.check on the source code
    pyflakes_check(source, filename=filename, reporter=reporter)

    warn.seek(0)
    err.seek(0)

    # return any error messages returned by pyflakes
    return {'warnings': '\n'.join(warn.readlines()),
            'errors': '\n'.join(err.readlines())}
def add_passed_parameters(nb, params):
    """
    Insert a cell just below the one tagged with "parameters"

    Notes
    -----
    Insert a code cell with params, to simulate the notebook papermill
    will run. This is a simple implementation, for the actual one see:
    https://github.com/nteract/papermill/blob/master/papermill/parameterize.py
    """
    # find "parameters" cell
    idx, params_cell = _get_parameters_cell(nb)

    # convert the parameters passed to valid python code
    # e.g {'a': 1, 'b': 'hi'} to:
    # a = 1
    # b = 'hi'
    params_as_code = '\n'.join([_parse_token(k, v) for k, v in params.items()])

    # insert the cell with the passed parameters
    nb.cells.insert(idx + 1, {'cell_type': 'code', 'metadata': {},
                              'execution_count': None,
                              'source': params_as_code,
                              'outputs': []})
    return nb, params_cell


def _get_parameters_cell(nb):
    """
    Iterate over cells, return the index and cell content
    for the first cell tagged "parameters", if not cell
    is found raise a ValueError
    """
    for i, c in enumerate(nb.cells):
        cell_tags = c.metadata.get('tags')
        if cell_tags:
            if 'parameters' in cell_tags:
                return i, c

    raise ValueError('Notebook does not have a cell tagged "parameters"')


def _parse_token(k, v):
    """
    Convert parameters to their Python code representation

    Notes
    -----
    This is a very simple way of doing it, for a more complete implementation,
    check out papermill's source code:
    https://github.com/nteract/papermill/blob/master/papermill/translators.py
    """
    return '{} = {}'.format(k, repr(v))

Testing our check_notebook_source function

Here we show some use cases for our validation function.

Raise error if "parameters" cell does not exist:

notebook_no_parameters_tag = """
a + b
"""

try:
    check_notebook_source(notebook_no_parameters_tag, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/1):

Raised exception: Notebook does not have a cell tagged "parameters"

Do not raise errors if "parameters" cell exist and passed parameters match:

notebook_ab = """
# + tags=['parameters']
a = 1
b = 2

# +
a + b
"""
assert check_notebook_source(notebook_ab, {'a': 1, 'b': 2})

Warn if using a default value:

_ = check_notebook_source(notebook_ab, {'a': 1})

Console output: (1/1):

/Users/Edu/miniconda3/envs/blog/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Missing parameters: {'b'}, will use default value

Raise an error if passing and undeclared parameter:

try:
    check_notebook_source(notebook_ab, {'a': 1, 'b': 2, 'c': 3})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/2):

Raised exception:

Console output: (2/2):

Passed non-declared parameters: {'c'}

Raise an error if a variable is used but never declared:

notebook_w_warning = """
# + tags=['parameters']
a = 1
b = 2

# +
# variable "c" is used but never declared!
a + b + c
"""
try:
    check_notebook_source(notebook_w_warning, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/1):

Raised exception: 
pyflakes warnings:
notebook:7:9 undefined name 'c'

Catch syntax error:

notebook_w_error = """
# + tags=['parameters']
a = 1
b = 2

# +
if
"""
try:
    check_notebook_source(notebook_w_error, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/2):

Raised exception:

Console output: (2/2):

pyflakes errors:
notebook:6:3: invalid syntax

if

  ^

Automating the workflow using ploomber

To implement this workflow effectively, we have to make sure that our validation function is always run, then we have to convert the .py to .ipynb and finally, execute it using papermill. ploomber can automate this workflow easily, it can even convert the final output to several formats such as HTML. We only have to pass the source code and place our check_notebook_source inside the on_render hook.

Note: we include the notebook's source code as strings in the following example for simplicity, in a real project, it is better to load them from a file.

# source code for report 1
nb = """
# # My report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x + y =', x + y)
"""

# source code for report 2
nb_another = """
# # Another report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x - y =', x - y)
"""

# on render hook: run before executing the notebooks
def on_render(task):
    # task.params are read-only, get a copy
    params = task.params.to_dict()

    # papermill (the library ploomber uses to execute notebooks) only supports
    # parameters that are JSON serializable, to test what papermill will
    # actually run, we have to check on the product in its serializable form
    params['product'] = params['product'].to_json_serializable()
    check_notebook_source(str(task.source), params)


# store all reports under output/
out_dir = Path('output')
out_dir.mkdir(exist_ok=True)

# ploomber gives you parallel execution for free
dag = DAG(executor='parallel')

# ploomber supports exporting ipynb files to several formats
# using the official nbconvert package, we convert our
# reports to HTML here by just adding the .html extension
t1 = NotebookRunner(nb, File(out_dir / 'out.html'), dag,
                    name='t1',
                    ext_in='py',
                    kernelspec_name='python3',
                    params={'x': 1})
t1.on_render = on_render

t2 = NotebookRunner(nb_another, File(out_dir / 'another.html'), dag,
                    name='t2',
                    ext_in='py',
                    kernelspec_name='python3',
                    params={'x': 10, 'y': 50})
t2.on_render = on_render

# run the pipeline. No errors are raised but note that a warning is shown
dag.build()

Console output: (1/1):

/Users/Edu/dev/ploomber/src/ploomber/dag.py:469: UserWarning: Task "NotebookRunner: t1 -> File(output/out.html)" had the following warnings:

Missing parameters: {'y'}, will use default value
  warnings.warn(warning)

Summary

Jupyter notebooks (.ipynb) is a great output format, but using it under version control causes a lot of trouble, by using simple .py files and leveraging jupytext, we get the best of both worlds: we edit simple Python source code files but our generated reports are executed as Jupyter notebooks which allows them to contain rich output such as tables and charts. To save time, we developed a function that validates our input notebook and catches errors before the notebook is executed. Finally, by using ploomber, we were able to create a clean and efficient workflow: HTML reports are transparently generated from plain .py files.

Found an error in this post? Click here to let us know.


Originally posted at ploomber.io

Discussion

markdown guide