DEV Community: Eduardo Blancas

How to Access Google Sheets from a Jupyter Notebook

Eduardo Blancas — Tue, 16 Aug 2022 16:12:00 +0000

Data is the starting point for all the projects and products in data science. The first lines of a script are dedicated to reading the data. This is a fact and does not change based on the project. What changes is the source of data. There are a variety of places where we store the data such as databases, S3 bucket, BigQuery, external files, spreadsheets, and so on.

Google sheets are quite common for storing small to medium size data. One of the nice things about Google sheets is that you can directly connect to them from a Jupyter notebook. You don’t have to download the data in your local directory and read it from there. Another advantage of connecting from within a notebook is that you can update the sheet directly.

Consider you have a data manipulating task to do. You write your script that connects to Google sheets, read the data, do the manipulation, and write the updated file back to the same sheet. You can schedule this script so that the data in the sheet is always up to date.

In this article, we will learn how to access Google sheets from a Jupyter notebook.

Create a project

The first step is to create a project in the Google Cloud Console, which can be done by clicking on create project in the console. You will have a quota of 12 free projects with your account.

Enable Google Drive API and Google Sheets API

The next step is to enable the Google Drive API and Google Sheets API. In the Google Cloud Console menu, select APIs & Services and then Enabled APIs & Services as shown below:

If you have multiple projects, you will need to select the one you want to enable the API for. If you only have one project, it will automatically be selected.

Click on “ENABLE APIS and SERVICES” and then search for “Google Drive API” and then “Google Sheets API”. In the search results, click on the related icons as shown below. Then click on ENABLE API in the page that opens up.

Create credentials

Now that we have the Google Drive API enabled, we need to create credentials. At the top right corner of the page that opened up after enabling the API, you will see the CREATE CREDENTIALS icon and then select service account credentials. The page shown below opens up.

Give it a name and click CREATE AND CONTINUE. Then the grant pages open up, just hit CONTINUE and DONE. You will then see the following screen. Click on the link in the email part.

It will take you to the service account details page. Copy the email address shown here. This email will be used for connecting to your account.

We also need to generate a key. Go back to the service accounts page. Click on the three dots under the actions.

Click on MANAGE KEYS and then ADD a KEY. It will ask you to choose the format, select JSON and hit CREATE. The json file will automatically be downloaded.

Connecting to the Google Sheet

You, of course, need a Google sheet to connect to. I created one that contains some sample sales data. You will need to share the Google sheet with the email address copied in the previous step. Click on the share button at the top right corner of the Google sheet, paste this address and hit SEND.

We are now ready to connect to this sheet. Open up a Jupyter notebook. We will use a Python library called gspread, which can be installed with pip.

!pip install gspread

The next step is to import the library.

import gspread

We need to give the service account details to gspread, which can be done using the service_account method.

sa = gspread.service_account(filename="project-1-357814-ba841f7c3630.json")

The path to the json file that contains the service credentials is passed to the filename parameter. If the file is in the same working directory as the notebook, you can just write the name of the file.

The sa is a gspread client, which can be used for connecting to the sheets by using the open method and the sheet name.

sheet = sa.open("sample_sales")

A Google sheet document might have multiple pages (i.e. worksheets) so we also need to specify the page name before getting the data. Ours has one page that is called “Sheet1”.

work_sheet = sheet.worksheet("Sheet1")

We have the data in a gspread worksheet object. We can extract the data using the get_all_records method and create a Pandas DataFrame as follows:

import pandas as pd

df = pd.DataFrame(work_sheet.get_all_records())

Let's take a look at the first 5 rows of the data using the head method.

df.head()

We have successfully connected to a Google sheet and retrieved the data it contains. Most of the steps we have completed need to be done only for once. After that, it is just a matter of writing the sheet name, which is definitely more practical than downloading the data as a CSV file and then reading it.

Organizing your data analysis code can become a challenging task, check out our open-source framework which allows you to build modular data analysis pipelines so you can extract those insights from the spreadsheets!

Analyze and plot 5.5M records in 20s with BigQuery and Ploomber

Eduardo Blancas — Mon, 08 Aug 2022 21:55:42 +0000

This tutorial will show how you can use Google Cloud and Ploomber to develop a scalable and production-ready pipeline.

We'll use Google BigQuery (data warehouse) and Cloud Storage to show how we can transform big datasets with ease using SQL, plot the results with Python, and store the results in the cloud. Thanks to BigQuery scalability (we'll use a dataset with 5.5M records!) and Ploomber's convenience, the entire process from importing the data to the summary report is on the cloud takes less than 20 seconds!

Introduction

Before we begin, I'll quickly go over two Google Cloud services we use for this project. Google BigQuery is a serverless data warehouse that allows us to analyze data at scale. In simpler terms, we can store massive datasets and query using SQL without managing servers. On the other hand, Google Cloud Storage is a storage service; it is the equivalent service to Amazon S3.

Since our analysis comprises SQL and Python, we use Ploomber, an open-source framework to write maintainable pipelines. It abstracts all the details, so we focus on writing the SQL and Python scripts.

Finally, the data. We'll be using a public dataset that contains statistics of people's names in the US over time. The dataset contains 5.5M records. Here's what it looks like:

Let's now take a look at the pipeline's architecture!

Architecture overview

The first step is the create-table.sql script; such script runs a CREATE TABLE statement to copy a public dataset. create-view.sql and create-materialized-view.sql use the existing table and generate a view and a materialized view (their purpose is to show how we can create other types of SQL relations, we don't use the outputs).

The dump-table.sql queries the existing table, and it dumps the results into a local file. Then, the plot.py Python script uses the local data file, generates a plot, and uploads it in HTML format to Cloud Storage. The whole process may seem intimidating, but Ploomber makes this straightforward!

Let's now configure the cloud services we'll use!

Setup

We need to create a bucket in Cloud Storage and a dataset in BigQuery; the following sections explain how to do so.

Cloud Storage

Go to the Cloud Storage console (select a project or create a new one, if needed) and create a new bucket (you may use an existing one if you prefer so). In our case, we'll create a bucket "ploomber-bucket" under the project "ploomber":

Then, enter a name (in our case "ploomber-bucket"), and click on "CREATE":

Let's now configure BigQuery.

BigQuery

Go to the BigQuery console and create a dataset. To do so, click on the three stacked dots next to your project's name and then click on "Create dataset":

Now, enter "my_dataset" as the Dataset ID and "us" in Data location (location
is important since we'll be using a public dataset located in such region),
then click on "CREATE DATASET":

Google Cloud is ready now! Let's now configure our local environment.

Local setup

First, let's authenticate so we can make API calls to Google Cloud. Ensure
you authenticate with an account that has enough permissions in the project
to use BigQuery and Cloud Storage:

gcloud auth login

If you have trouble, check the docs.

Now, let's install Ploomber to get the code example:

# note: this example requires ploomber 0.19.2 or higher
pip install ploomber --upgrade

# download example
ploomber examples -n templates/google-cloud -o gcloud

# move to the example folder
cd gcloud

Let's now review the structure of the project.

Project structure

pipeline.yaml Pipeline declaration
clients.py Functions to create BigQuery and Cloud Storage clients
requirements.txt Python dependencies
sql/ SQL scripts (executed in BigQuery)
scripts/ Python scripts (executed locally, outputs uploaded to Cloud Storage)

You can look at the files in detail here. For this tutorial, I'll quickly mention a few crucial details.

pipeline.yaml is the central file in this project; Ploomber uses this file
to assemble your pipeline and run it, here's what it looks like:

# Content of pipeline.yaml
tasks:
  # NOTE: ensure all products match the dataset name you created
  - source: sql/create-table.sql
    product: [my_dataset, my_table, table]

  - source: sql/create-view.sql
    product: [my_dataset, my_view, view]

  - source: sql/create-materialized-view.sql
    product: [my_dataset, my_materialized_view, view]

  # dump data locally (and upload outputs to Cloud Storage)
  - source: sql/dump-table.sql
    product: products/dump.parquet
    chunksize: null

  # process data with Python (and upload outputs to Cloud Storage)
  - source: scripts/plot.py
    product: products/plot.html

Each task in the pipeline.yaml file contains two elements: the source code
we want to execute and the product. You can see that we have a few SQL scripts
that generate tables and views. However, the dump-table.sql creates a
.parquet file. This indicates to Ploomber that it should download the results
instead of storing them on BigQuery. Finally, the plot.py script contains an
.html output; Ploomber will automatically run the script and store the
results in the HTML file.

You might be wondering how the order is determined. Ploomber extracts references
from the source code itself; for example, the create-view.sql depends on
create-table.sql. If we look at the code, we'll see the reference:

# Content of sql/create-view.sql
DROP VIEW IF EXISTS {{ product }};
CREATE VIEW {{ product }} AS
SELECT *
FROM {{ upstream["create-table"] }}

There is a placeholder {{ upstream["create-table"] }}, this indicates
that we should run create-table.sql first. At runtime, Ploomber will replace
the placeholder for the table name. We also have a second placeholder
{{ product }}, this will be replaced by the value in the pipeline.yaml file.

That's it for the pipeline.yaml. Let's review the clients.py file.

Configure `clients.py`

clients.py contains a function that returns clients to communicate with
BigQuery and Cloud Storage.

For example, this is how we connect to BigQuery:

# Content of clients.py
def db():
    """Client to send queries to BigQuery
    """
    return DBAPIClient(connect, dict())

Note that we're returning a ploomber.clients.DBAPIClient object. Ploomber
wraps BigQuery's connector, so it works with other databases.

Secondly, we configure the Cloud Storage client:

# Content of clients.py
def storage():
    """Client to upload files to Google Cloud Storage
    """
    # ensure your bucket_name matches
    return GCloudStorageClient(bucket_name='ploomber-bucket',
                               parent='my-pipeline')

Here, we return a ploomber.clients.GCloudStorageClient object (ensure
that the bucket_name matches yours!)

Great, we're ready to run the pipeline!

Running the pipeline

Ensure your terminal is open in the gcloud folder and execute the following:

# install dependencies
pip install -r requirements.txt

# run the pipeline
ploomber build

After a few seconds of running the ploomber build command, you should see
something like this:

name                      Ran?      Elapsed (s)    Percentage
------------------------  ------  -------------  ------------
create-table              True          5.67999      30.1718
create-view               True          1.84277       9.78868
create-materialized-view  True          1.566         8.31852
dump-table                True          5.57417      29.6097
plot                      True          4.16257      22.1113

If you get an error, you most likely have a misconfiguration. Please send us a
message on Slack so we can help you fix it!

If you open the BigQuery console, you'll see the new tables and views:

In the Cloud Storage console, you'll see the HTML report:

Finally, if you download and open the HTML file, you'll see the plot!

Incremental builds

It may take a few iterations to get the final analysis. This process involves
making small changes to your code and rerunning the workflow. Ploomber can
keep track of source code changes to accelerate iterations, so it only
executes outdated scripts next time. Enabling this requires a bit of extra
configuration since Ploomber needs to store your pipeline's metadata, we
already pre-configured the same workflow, so it stores the metadata in a SQLite
database, you can run it with the following command:

ploomber build --entry-point pipeline.incremental.yaml

If you run the command another time, you'll see that it skips all tasks:

name                      Ran?      Elapsed (s)    Percentage
------------------------  ------  -------------  ------------
create-table              False               0             0
create-view               False               0             0
create-materialized-view  False               0             0
dump-table                False               0             0
plot                      False               0             0

Now try changing plot.py and rerun the pipeline; you'll see that it skips
most tasks!

Closing remarks

This tutorial showed how to build maintainable and scalable data analysis pipelines on Google Cloud. Ploomber has many other features to simplify your
workflow such as parametrization (store outputs on a different each time you
run the pipeline), task parallelization, and even cloud execution (in case
you need more power to run your Python scripts!).

Check out our documentation to learn more, and don't hesitate to send us any questions!

Tips and Tricks to Use Jupyter Notebooks Effectively

Eduardo Blancas — Mon, 01 Aug 2022 17:27:00 +0000

The Jupyter Notebook is a web-based interactive computing platform, and it is usually the first tool we learn about in data science. Most of us start our learning journeys in Jupyter notebooks. They are great for learning, practicing, and experimenting.

There are several reasons why the Jupyter notebook is a highly popular tool. Here are some of them:

Being able to see the code and the output together makes it easier to learn and practice.
It supports Markdown cells which are great for write-ups, preparing reports, and documenting your work.
In-line outputs including data visualizations are highly useful for exploratory data analysis.
You can run the code cell-by-cell which expedites the debugging process as well as understanding other people’s code.

Although we quite often use Jupyter notebooks in our work, we do not make the most out of them and fail to discover its full potential. In this article, we will go over some tips and tricks to make more use of Jupyter notebooks. Some of these are shortcuts that can increase your efficiency.

New cell

Creating a new cell is one of the most frequently done operations while working in a Jupyter notebook so a quick way of doing this is very helpful.

ESC + A creates a new cell above the current cell
ESC + B creates a new cell below the current cell

Cell output

One of the great features of Jupyter notebooks is that they maintain the state of execution of each cell. In other words, cell outputs are cached. This is very useful because you do not have to execute a cell each time you want to check its output or results.

However, some outputs take too much space and make the overall content hard to follow. We can hide a cell output with “ESC + O” and unhide by pressing these keys again.

If a cell is not needed anymore, you can delete it with “ESC + D + D”.

Magic commands

Magic commands are built-in for the IPython kernel. They are quite useful for performing a variety of tasks. Magic commands start with the “%” syntax element. Here are some examples that will help you get more productive.

# prints the current working directory
%pwd

# change the current working directory
%cd

# list files and folders in the current working directory
%ls

# list files and folders in a specific folder
%ls [path to folder]

# export the current current IPython history to a notebook file
%notebook [filename]

# lists currently available magics
%lsmagic

If you're looking to improve your Jupyter workflow, check out Ploomber's open-source projects: Ploomber for developing modular data pipelines, Soorgeon for refactoring and cleaning), or nbsnapshot for notebook testing.

Multiple outputs

By default, when you execute a cell that returns multiple outputs, only the last output is shown. Here is an example:

mylist = list("programming")
myset = set("programming")

mylist
myset

# output
{'a', 'g', 'i', 'm', 'n', 'o', 'p', 'r'}

It is possible to see all the outputs but we need to change a setting as below:

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

If we execute the code block above, we will see the output of both variables.

mylist = list("programming")
myset = set("programming")

mylist
myset

# output
['p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']
{'a', 'g', 'i', 'm', 'n', 'o', 'p', 'r'}

Shortcuts

There are several shortcuts that you can use in Jupyter notebooks. Here are the ones I find quite useful:

CTRL + D (⌘ + D in Mac) deletes what is written in the current line.

ESC + M changes a cell to Markdown.
ESC + UP or ESC + K selects the cell above.
ESC + DOWN or ESC + J selects the cell below.
ESC + SHIFT + M merges selected cells. If only one cell is selected, it is merged with the cell below.

Learning about a function

Python has a rich selection of third party libraries, which simplify tasks and speed up development processes. Those libraries typically have a lot of functions and methods. We sometimes can’t remember what a function does exactly or its syntax.

In such cases, we can learn about the signature and docstring of a function inside the Jupyter notebook. We just need to add ? at the end of the function name. Here is how we can learn about the query function of Pandas library.

import pandas as pd
pd.DataFrame.query?

We have learned about some useful tips and tricks for Jupyter notebooks. You do not have to use all of them immediately but you will see they increase your productivity and efficiency once you start using them.

Questions? Join our growing community of Jupyter practitioners!

Credits

Originally published at ploomber.io by Soner Yildirim, re-shared with permission

Machine Learning Model Selection with Nested Cross-Validation

Eduardo Blancas — Wed, 27 Jul 2022 16:14:49 +0000

Supplemental material.

Three Tools for Executing Jupyter Notebooks

Eduardo Blancas — Tue, 26 Jul 2022 02:35:00 +0000

Introduction

Executing notebooks can be very helpful in various situations, especially for long-running code execution (e.g., training a model) or parallelized execution (e.g., training a hundred models at the same time). It is also vital in data analysis automation in projects at regular intervals or involving more than one notebook. This blog post will introduce three commonly used ways of executing notebooks: Ploomber, Papermill, and NBClient.

Ploomber

Ploomber is the complete solution for notebook execution. It builds on top of papermill and extends it to allow writing multi-stage workflows where each task is a notebook. Meanwhile, it automatically manages orchestration. Hence you can run notebooks in parallel without having to write extra code.

Another feature of Ploomber is that you can use the percent format (supported by VSCode, PyCharm, etc.) and execute it as a notebook, to automatically capture the outputs like charts or tables in an output file.

Also, you can export the pipelines to airflow, Kubernetes, etc. Please refer to this documentation for more information on how to go from a notebook to a production pipeline.

Ploomber offers two interfaces for notebook execution: YAML and Python. The first one is the easiest to get started, and the second offers more flexibility for building more complex workflows. Furthermore, it provides a free cloud service to execute your notebooks in the cloud and parallelize experiments.

Execute Notebooks via Python API

Ploomber offers a Python API for executing notebooks. The following example will run first.ipynb, then second.ipynb and store the executed notebooks in out/first.ipynb and out/second.ipynb.

from pathlib import Path
from ploomber import DAG
from ploomber.tasks import NotebookRunner
from ploomber.products import File

dag = DAG()
first = NotebookRunner(Path('first.ipynb'), File('out/first.ipynb'), dag=dag)
second = NotebookRunner(Path('second.ipynb'), File('out/second.ipynb'), dag=dag)
first >> second

dag.build()

Execute Notebooks via YAML API

Ploomber also offers a YAML API for executing notebooks:

tasks:
  - source: first.ipynb
    product: out/first.ipynb

  - source: second.ipynb
    product: out/second.ipynb

Then users call the following code to execute the notebook:

ploomber build

Execute Notebooks on Cloud

Ploomber also supports executing notebooks on the cloud by running:

ploomber cloud build

Please refer to this documentation for more information on Cloud Execution with Ploomber.

Papermill

Papermill's main feature is to allow injecting parameters to a notebook. Therefore, you can use them as templates (e.g. run the same model training notebook with different parameters). However, it limits itself to providing a function to execute the notebook. Hence there is no way to manage concurrent executions.

There are two ways to execute the notebook:

Python API
Command Line Interface

Execute Notebooks via Python API

Papermill offers a Python API. Users can execute notebooks with Papermill by running:

import papermill as pm

pm.execute_notebook(
   'path/to/input.ipynb',
   'path/to/output.ipynb',
   parameters = dict(alpha=0.6, ratio=0.1)
)

Execute Notebooks via CLI

Users can also execute notebooks via CLI. To run a notebook using the CLI, enter the following papermill command in the terminal with the input notebook, location for the output notebook, and options.

papermill input.ipynb output.ipynb -p alpha 0.6 -p l1_ratio 0.1

Via CLI, users can choose to execute a notebook with parameters in types of a parameters file, a YAML string, or raw strings. You can refer to this documentation for more information on executing a notebook with parameters via CLI.

NBClient

NBClient provides a convenient way to execute the input cells of a .ipynb notebook file and save the results, both input and output cells, as a .ipynb file. If you need to export notebooks to other formats, such as reStructured Text or Markdown (optionally executing them), please refer to nbconvert.

It offers a few extra features like notebook-level and cell-level hooks and also supports two ways of executing notebooks:

Python API
Command Line Interface

Execute Notebooks via Python API

The following quick example shows how to import nbformat and NotebookClient classes, then load and configure the notebook notebook_filename. We specified two optional arguments, timeout and kernel_name, which define the cell execution timeout and the execution kernel. Usually, we don’t need to set these options, but these and others are available to control the execution context.

import nbformat
from nbclient import NotebookClient

nb = nbformat.read(notebook_filename, as_version=4)
client = NotebookClient(nb, timeout=600, kernel_name='python3')

Then we can execute the notebook by running:

client.execute()

And we can save the resulting notebook in the current folder in the file executed_notebook.ipynb by running:

nbformat.write(nb, 'executed_notebook.ipynb')

Execute Notebooks via CLI

NBClient supports running notebooks via CLI for the most basic use cases. However, for more sophisticated execution options, consider the Ploomber!

Running a notebook is this easy:

jupyter execute notebook.ipynb

It expects notebooks as input arguments and accepts optional flags to modify the default behavior. And we can pass more than one notebook as well with:

jupyter execute notebook.ipynb notebook2.ipynb

Summary

In a nutshell, NBClient is the most basic way to execute notebooks and Papermill builts on top of NBClient. Both of them support running notebooks via Python API and CLI.

Ploomber is the most complete and most convenient solution. It builds on top of papermill and extends it to allow writing multi-stage workflows where each task is a notebook. Besides Python API and CLI, users are also supported to execute notebooks via YAML API or on the cloud with Ploomber.

Enjoyed this article? Join our growing community of Jupyter users

References

The 10 Trending Python Repositories on GitHub (May 2022)

Eduardo Blancas — Thu, 23 Jun 2022 15:11:10 +0000

A few months ago, I discovered that GitHub keeps track of trending repositories, and since then, I often take a look at it to see what's up. So this month, I decided to share my thoughts on what I found; let's get started!

DALL·E Mini

AI model that generates images from text.

The announcement of Open AI's DALL·E 2 took the community by storm, but given that it's not available, it's no surprise that this project is seeing significant interest.

PaddleNLP

NLP library with pre-trained models.

PaddleNLP is a library for Natural Language processing. It provides a comprehensive set of Chinese transformer models, and its design is based on Hugging Face's Transformer library.

ColossalAI

A framework for large-scale Deep Learning parallel training.

As transformer architectures become the standard in many CV and NLP tasks, better performance comes with larger model sizes. Colossal AI aims to provide a simple AI to train large models in parallel.

DeepFaceLive

A library to swap faces from a website or video.

DeepFaceLive allows changing the face in real-time or from a recording. Imagine hopping on a Zoom call and looking like Keanu Reeves!. Crazy!

Label Studio

A data labeling tool for audio, text, images, videos, and time series via a UI.

Getting accurately labeled data is the first task in many ML projects. Label Studio supports many types of data and offers a graphical user interface to do it.

Intermission: Ploomber

Ploomber is a framework to develop pipelines interactively (Jupyter, VSCode) and deploy them to the cloud (K8s, Airflow AWS, SLURM).

Interactive tools like Jupyter make it hard to develop maintainable projects; Ploomber allows data scientists to keep the interactive workflow they are used to but embrace best practices from software engineering to ease the transition to production.

DevOps Exercise

A collection of >2.2k DevOps interview questions.

The first non-AI repository on the list! This repository hosts more than 2.2k DevOps questions to help you prepare for your interview!

PaddleOCR

A library for creating Optical Character Recognition tools.

PaddleOCR supports many OCR-related algorithms to help users through data production, model training, compression, inference, and deployment.

DeepFaceLab

DeepFaceLab is a library to replace faces in videos.

Another deepfakes library! According to the repository, more than 95% deepfake videos are created with DeepFaceLab.

IVY

Ivy aims to provide a single interface for ML frameworks.

With the explosion of computational frameworks such as JAX, TensorFlow, PyTorch, MXNet, and NumPy, it's hard for practitioners to keep up and master them. Ivy aims to unify them so you can write once and export to any of them.

Airflow

Airflow is a platform to author, schedule, and monitor workflows.

Airflow is one of the most widely used platforms for managing workflows. It allows you to define workflows as directed acyclic graphs of tasks and schedule them.

Originally posted at ploomber.io

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks Using Open-Source Tools

Eduardo Blancas — Thu, 23 Jun 2022 12:14:44 +0000

Notebooks are great for rapid iterations and prototyping but quickly get messy. After working on a notebook, my code becomes difficult to manage and unsuitable for deployment. In production, code organization is essential for maintainability (it's much easier to improve and debug organized code than a long, messy notebook).

In this post, I'll describe how you can use our open-source tools to cover the entire life cycle of a Data Science project: starting from a messy notebook until you have that code running in production. Let's get started!

The first step is to clean up our notebook with automated tools; then, we'll automatically refactor our monolithic notebook into a modular pipeline with soorgeon; after that, we'll test that our pipeline runs; and, finally, we'll deploy our pipeline to Kubernetes. The main benefit of this workflow is that all steps are fully automated, so we can return to Jupyter, iterate (or fix bugs), and deploy again effortlessly.

Cleaning up the notebook

The interactivity of notebooks makes it simple to try out new ideas, but it also yields messy code. While exploring data, we often rush to write code without considering readability. Lucky for us, there are tools like isort and black which allow us to easily re-format our code to improve readability. Unfortunately, these tools only work with .py files; however, soorgeon enable us to run them on notebook files (.ipynb):

pip install soorgeon
soorgeon clean path/to/notebook.ipynb

Note: If you need an example notebook to try these commands, here's one:

curl https://raw.githubusercontent.com/ploomber/soorgeon/main/examples/machine-learning/nb.ipynb -o notebook.ipynb

Check out the image at the beginning of this section: I introduced some extra whitespace on the left notebook. However, after applying soorgeon clean (picture on the right), we see that the extra whitespace went away. So now we can focus on writing code and apply soorgeon clean to use auto-formatting easily!

Refactoring the notebook

Creating analysis on a single notebook is convenient: we can move around sections and edit them easily; however, this has many drawbacks: it's hard to collaborate and test. Organizing our analysis in multiple files will allow us to define clear boundaries so multiple pipelines can work in the project without getting into each other's ways.

The process of going from a single notebook to a modular pipeline is time-consuming and error-prone; fortunately, soorgeon can do the heavy lifting for us:

pip install soorgeon
soorgeon refactor path/to/notebook.ipynb

Upon refactoring, we'll see a bunch of new files:

Ploomber turns our notebook into a modularized project automatically! It generates a README.md with basic instructions and a requirements.txt (extracting package names from import statements). Furthermore, it creates a tasks/ directory with a few .ipynb files; these files come from the original notebook sections separated by markdown headings. soorgeon refactor figures out which sections depend on which ones.

If you prefer to export .py files; you can pass the --file-format option:

soorgeon refactor nb.ipynb --file-format py

The tasks/ directory will have .py files this time:

.
├── README.md
├── nb.ipynb
├── pipeline.yaml
├── requirements.txt
└── tasks
    ├── clean.py
    ├── linear-regression.py
    ├── load.py
    ├── random-forest-regressor.py
    └── train-test-split.py

soorgeon uses Markdown headings to determine how many output tasks to generate. In our case, there are five of them. Then, soorgeon analyzes the code to resolve the dependencies among sections and adds the necessary code to pass outputs to the each task.

For example, our "Train test split" section creates a variables X, y, X_train, X_test, y_train, and y_test; and the last four variables are used by the "Linear regression" section:

By determining input and output variables, soorgeon determines that the "Linear regression" section depends on the "Train test split" section. Furthermore, the "Random Forest Regressor" section also depends on the "Train test split" since it also uses the variables generated by the "Train test split" section. With this information, soorgeon builds the dependency graph.

Testing our pipeline

Now it's time to ensure that our modular pipeline runs correctly. To do so, we'll use the second package in our toolbox: ploomber. Ploomber allows us to develop and execute our pipelines locally.

# install dependencies
pip install -r requirements.txt

# execute pipeline
ploomber build

name               Ran?      Elapsed (s)    Percentage
-----------------  ------  -------------  ------------
load               True         14.4272       38.6993
clean              True          7.89353      21.1734
train-test-split   True          2.98341       8.00263
linear-regression  True          3.77029      10.1133
random-forest-     True          8.20591      22.0113

ploomber offers a lot of tools to manage our pipeline; for example, we can generate a plot:

ploomber plot

We can see the dependency graph; there are three serial tasks: load, clean, and train-test-split. After them, we see two independent tasks: linear-regression, and random-forest-regressor. The advantage of modularizing our work is that members of our team can work independently, we can test tasks in isolation, and run independent tasks in parallel. With ploomber we can keep developing the pipeline with Jupyter until we're ready to deploy!

Deployment

To keep things simple, you may deploy your Ploomber pipeline with cron, and run ploomber build on a schedule. However, in some cases, you may want to leverage existing infrastructure. We got you covered! With soopervisor, you can export your pipeline to Airflow, AWS Batch, Kubernetes, SLURM, or Kubeflow.

# add a target environment named 'argo'
soopervisor add argo --backend argo-workflows

# generate argo yaml spec
soopervisor export argo --skip-tests  --ignore-git

# submit workflow
argo submit -n argo argo/argo.yaml

soopervisor add adds some files to our project, like a preconfigured Dockerfile (which we can modify if we want to). On the other hand soopervisor export takes our existing pipeline and exports it to Argo Workflows so we can run it on Kubernetes.

By changing the --backend argument in the soopervisor add command, you can switch to other supported platforms. Alternatively, you may sign up for free cloud service, which allows you to run your notebooks in the cloud with one command.

Final remarks

Notebook cleaning and refactoring are time-consuming and error-prone, and we are developing tools to make this process a breeze. In this blog post, we went from having a monolithic notebook until we had a modular pipeline running in production—all of that in an automated way using open-source tools. So please let us know what features you'd like to see. Join our community and share your thoughts!

Collaborative Data Science with Ploomber

Eduardo Blancas — Sat, 27 Jun 2020 23:03:50 +0000

Introducing Ploomber Spec API, a simple way to sync Data Science teamwork using a short YAML file.

The did you upload the latest data version yet? nightmare

Data pipelines are multi-stage processes. Whether you are doing data visualization or training a Machine Learning model, there is an inherent workflow structure. The following diagram shows a typical Machine Learning pipeline:

Imagine you are working with three colleagues, taking one feature branch each (A to D). To execute your pipeline, you could write a master script that runs all tasks from left to right.

During development, end-to-end runs are rare because they take too long. If you want to skip redundant computations (tasks whose source code has not changed), you could open the master script and manually run outdated tasks, but this will soon turn into a mess. Not only you have to keep track of task's status along your feature branch, but also in every other task that merges with any of your tasks. For example, if you need to use the output from "Join features", you have to ensure the output generated by all four branches is updated.

Since full runs take too long and keeping track of outdated tasks manually is a laborious process, you might resort to the evil trick of sharing intermediate results. Everyone uploads the latest version of all or some selected tasks (most likely the ones with the computed features) to a shared location that you can then copy to your local workspace.

But to generate the latest version of each task, you have to ensure it was generated from the latest version of all its upstream dependencies, which take us back to original problem; there are simply no guarantees about data lineage. Using a data file whose origin is unknown as input severely compromises reproducibility.

Introducing the Ploomber Spec API

The new API offers a simple way to sync Data Science teamwork. All you have to do is list source code location and products (files or database tables/views) for each task in a pipeline.yaml file.

# pipeline.yaml

# clean data from the raw table
- source: clean.sql
  product: clean_data
  # function that returns a db client
  client: db.get_client

# aggregate clean data
- source: aggregate.sql
  product: agg_data
  client: db.get_client

# dump data to a csv file
- class: SQLDump
  source: dump_agg_data.sql
  product: output/data.csv  
  client: db.get_client

# visualize data from csv file
- source: plot.py
  product:
    # where to save the executed notebook
    nb: output/executed-notebook-plot.ipynb
    # tasks can generate other outputs
    data: output/some_data.csv

Ploomber will analyze your source code to determine dependencies and skip a task if its source code (and the source code of all its upstream dependencies) has not changed since the last run.

Say your colleagues updated a few tasks. To bring the pipeline up-to-date:

git pull
ploomber entry pipeline.yaml

If you run the command again, nothing will be executed.

Apart from helping you sync with your team. Ploomber is great for developing pipelines iteratively: modify any part and call build again, only modified tasks will be executed. Since the tool is not tied up with git, you can experiment without committing changes; if you don't like them, just discard and build again. If your pipeline fails, fix the issue, build again and execution will resume from the crashing point.

Ploomber is robust to code style changes. It won't trigger execution if you only added whitespace or formatted your source code.

Try it!

Click here to try out the live demo (no installation required).

If you prefer to run it locally:

pip install "ploomber[all]"

# create a new project with basic structure
ploomber new

ploomber entry pipeline.yaml

If you want to know how Ploomber works and what other neat features there are, keep on reading.

Inferring dependencies and injecting products

For Jupyter notebooks (and annotated Python scripts), Ploomber looks for a "parameters" cell and extracts dependencies from an "upstream" variable.

# annotated python file, it will be converted to a notebook during execution
import pandas as pd

# + tags=["parameters"]
upstream = {'some_task': None}
product = None

# +
df = pd.read_csv(upstream['some_task'])
# do data processing...
df.to_csv(product['data'])

In SQL files, it will look for and "upstream" placeholder:

CREATE TABLE {{product}}
SELECT * FROM {{upstream['some_task']}}
WHERE x > 10

Once it figured out dependencies, the next step is to inject products declared in the YAML file to the source code and upstream dependencies to downstream consumers.

In Jupyter notebooks and Python scripts, Ploomber injects a cell with a variable called "product" and an "upstream" dictionary with the location of its upstream dependencies.

import pandas as pd

# + tags=["parameters"]
upstream = {'some_task': None}
product = None

# + tags=["injected-parameters"]
# this task uses the output from "some_task" as input
upstream = {'some_task': 'output/from/some_task.csv'}
product = {'data': 'output/current/task.csv'}

# +
df = pd.read_csv(upstream['some_task'])
# do data processing...
df.to_csv(product['data'])

For SQL files, Ploomber replaces the placeholders with the appropriate table/view names. For example, the SQL script shown above will be resolved as follows:

CREATE TABLE clean
SELECT * FROM some_table
WHERE x > 10

Embracing Jupyter notebooks as an output format

If you look at the plot.py task above, you'll notice that it has two products. This is because "source" is interpreted as the set of instructions to execute, while the product is the executed notebook with cell outputs. This executed notebook serves as a rich log that can include tables and charts, which is incredibly useful for debugging data processing code.

Since existing cell outputs from the source file are ignored, there is no strong reason to use .ipynb files as sources. We highly recommend you to work with annotated Python scripts (.py) instead. They will be converted to notebooks at runtime via jupytext and then executed using papermill.

Another nice feature from jupytext is that you can develop Python scripts interactively. Once you start jupyter notebook, your .py files will render as regular .ipynb files. You can modify and execute cells at will, but building your pipeline will enforce a top-to-bottom execution. This helps prevent Jupyter notebooks most common source of errors: hidden state due to unordered cell execution.

Using annotated Python scripts makes code versioning simpler. Jupyter notebooks (.ipynb) are JSON files, this makes code reviews and merges harder; by using plain scripts as sources and notebooks as products you get the best of both worlds: simple code versioning and rich execution logs.

Seamlessly mix Python and (templated) SQL

If your data lives in a database, you could write a Python script that connects to it, sends the query and closes the connection. Ploomber allows you to skip boilerplate code so you focus on writing the SQL part. You could even write entire pipelines using SQL alone.

jinja templating is integrated, which can help you modularize your SQL code by using macros.

If your database is supported by SQLAlchemy or it has a client that implements the DBAPI interface, it will work with Ploomber. This covers pretty much all databases.

Interactive development and debugging

Ploomber does not compromise structure with interactivity. You can load your pipeline and interact with it using the following command:

ipython -i -m ploomber.entry pipeline.yaml -- --action status

This will start an interactive session with a dag object.

Visualize pipeline dependencies (requires graphviz):

dag.plot()

You can interactively develop Python scripts:

dag['task'].develop()

This will open the Python script as a Jupyter notebook with injected parameters but will remove them before the file is saved.

Line by line debugging is also supported:

dag['task'].debug()

Since SQL code goes through a rendering process to replace placeholders, it is useful to see how the rendered code looks like:

print(dag['sql_task'].source)

Closing remarks

There are many more features available through the Python API that are not yet implemented in the spec API. We are currently porting some of the most important features (integration testing, task parallelization).

We want to keep the spec API short and simple for data scientists looking to get the Ploomber experience without having to learn the Python framework. For many projects, the spec API is more than enough.

The Python API is recommended for projects that require advanced features such as dynamic pipelines (pipelines whose exact number of tasks is determined by its parameters).

Where to go from here

Found an error in this post? Click here to let us know.

Originally posted at ploomber.io

Rethinking Continuous Integration for Data Science

Eduardo Blancas — Thu, 18 Jun 2020 00:54:56 +0000

Prelude: Software development practice in Data Science

As Data Science and Machine learning get wider industry adoption, practitioners realize that deploying data products comes with a high (and often unexpected) maintenance cost. As Sculley and co-authors argue in their well-known paper:

(ML systems) have all the maintenance problems of traditional code plus an additional set of ML-specific issues.

Paradoxically, even though data-intensive systems have higher maintenance cost than their traditional software counterparts, software engineering best practices are mostly overlooked. Based on my conversations with fellow data scientists, I believe that such practices are ignored primarily because they are perceived as unnecessary extra work due to misaligned incentives.

Data projects ultimate objective is to impact business, but this impact is really hard to assess during development. How much impact a dashboard will have? What about the impact of a predictive model? If the product is not yet in production, it is hard to estimate business impact and we have to resort to proxy metrics: for decision-making tools, business stakeholders might subjectively judge how much a new dashboard can help them improve their decisions, for a predictive model, we could come up with a rough estimate based on model's performance.

This causes the tool (e.g. a dashboard or model) to be perceived as the unique valuable piece in the data pipeline, because it is what the proxy metric acts upon. In consequence, most time and effort is done in trying to improve this final deliverable, while all the previous intermediate steps get less attention.

If the project is taken to production, depending on the overall code quality, the team might have to refactor a lot of the codebase to be production-ready. This refactoring can range from doing small improvements to a complete overhaul, the more changes the project goes through, the harder will be to reproduce original results. All of this can severely delay or put launch at risk.

A better approach is to always keep our code deploy-ready (or almost) at anytime. This calls for a workflow that ensures our code is tested and results are reproducible always. This concept is called Continuous Integration and is a widely adopted practice in software engineering. This blog post introduces an adapted CI procedure that can be effectively applied in data projects with existing open source tools.

Summary

Structure your pipeline in several tasks, each one saving intermediate results to disk
Implement your pipeline in such a way that you can parametrize it
The first parameter should sample raw data to allow quick end-to-end runs for testing
A second parameter should change artifacts location to separate testing and production environments
On every push, the CI service runs unit tests that verify logic inside each task
The pipeline is then executed with a data sample and integration tests verify integrity of intermediate results

What is Continuous Integration?

Continuous Integration (CI) is a software development practice where small changes get continuously integrated in the project's codebase. Each change is automatically tested to ensure that the project will work as expected for end-users in a production environment.

To contrast the difference between traditional software and a data project we compare two use cases: a software engineer working on an e-commerce website and a data scientist developing a data pipeline that outputs a report with daily sales.

In the e-commerce portal use case, the production environment is the live website and end-users are people who use it; in the data pipeline use case, the production environment is the server that runs the daily pipeline to generate the report and end-users are business analysts that use the report to inform decisions.

We define data pipeline as a series of ordered tasks whose inputs are raw datasets, intermediate tasks generate transformed datasets (saved to disk) and the final task produces a data product, in this case, a report with daily sales (but this could be something else, like a Machine Learning model). The following diagram, shows our daily report pipeline example:

Each blue block represents a pipeline task, the green block represents a script that generates the final report. Orange blocks contain the schemas for the raw source. Every task generates one product: blue blocks generate data file (but this could also be tables/views in a database) while the green block generates the report with charts and tables.

Continuous Integration for Data Science: Ideal workflow

As I mentioned in the prelude, the last task in the data pipeline is often what gets the most attention (e.g. the trained model in a Machine Learning pipeline). Not surprisingly, existing articles on CI for Data Science/Machine Learning also focus on this; but to effectively apply the CI framework we have to think in terms of the whole computational chain: from getting raw data to delivering a data product. Failing to acknowledge that a data pipeline has a richer structure causes data scientists to focus too much on the very end and ignore code quality in the rest of the tasks.

In my experience, most bugs generate along the way, even worse, in many cases errors won't break the pipeline, but contaminate your data and compromise your results. Each step along the way should be given equal importance.

Let's make things more concrete with a description of the proposed workflow:

A data scientist pushes code changes (e.g. modifies one of the tasks in the pipeline)
Pushing triggers the CI service to run the pipeline end-to-end and test each generated artifact (e.g. one test could verify that all rows in the customers table have a non-empty customer_id value)
If tests pass, a code review follows
If changes are approved by the reviewer, code is merged
Every morning, the "production" pipeline (latest commit in the main branch) runs end-to-end and sends the report to the business analysts

Such workflow has two primary advantages:

Early bug detection: Bugs are detected in the development phase, instead of production
Always production-ready: Since we required code changes to pass all the tests before integrating them to the main branch, we ensure we can deploy our latest stable feature continuously by just deploying the latest commit in the master branch

This workflow is what software engineers do in traditional software projects. I call this ideal workflow because it is what we'd do if we could do an end-to-end pipeline run in a reasonable amount of time. This isn't true for a lot of projects due to data scale: if our pipeline takes hours to run end-to-end it is unfeasible to run it every time we make a small change. This is why we cannot simply apply the standard CI workflow (steps 1 to 4) to Data Science. We'll make a few changes to make it feasible for projects where running time is a challenge.

Software testing

CI allows developers to continuously integrate code changes by running automated tests: if any of the tests fail, the commit is rejected. This makes sure that we always have a working project in the main branch.

Traditional software is developed in small, largely independent modules. This separation is natural, as there are clear boundaries among components (e.g. sign up, billing, notifications, etc). Going back to the e-commerce website use case, an engineer's to-do list might look like this:

People can create new accounts using email and password
Passwords can be recovered by sending a message to the registered email
Users can login using previously saved credentials

Once the engineer writes the code to support such functionality (or even before!), he/she will make sure the code works by writing some tests, which will execute the code being tested and check it behaves as expected:

from my_project import create_account, users_db


def test_create_account():
    # simulate creating a new account
    create_account('someone@ploomber.io', 'somepassword')
    # verify the account was created by qerying the users database
    user = users_db.find_with_email('someone@ploomber.io')
    assert user.exists()

But unit testing is not the only type of testing, as we will see in the next section.

Testing levels

There are four levels of software testing. It is important to understand the differences to develop effective tests for our projects. For this post, we'll focus on the first two.

Unit testing

The snippet I showed in the previous section is called a unit test. Unit tests verify that a single unit works. There isn't a strict definition of unit but it's often equivalent to calling a single procedure, in our case, we are testing the create_account procedure.

Unit testing is effective in traditional software projects because modules are designed to be largely independent from each other; by unit testing them separately, we can quickly pinpoint errors. Sometimes new changes break tests not due to the changes themselves but because they have side effects, if the module is independent, it gives us guarantee that we should be looking for the error within the module's scope.

The utility of having procedures is that we can reuse them by parametrizing their behavior with input parameters. The input space for our create_account function is the combination of all possible email addresses and all possible passwords. There is an infinite number of combinations but it is reasonable to say that if we test our code against a representative number of cases we can conclude the procedure works (and if we find a case where it doesn't, we fix the code and add a new test case). In practice this boils down to testing procedure against a set of representative cases and known edge cases.

Given that tests run in an automated way, we need a pass/fail criteria for each. In the software engineering jargon this is called a test oracle. Coming up with good test oracles is essential for testing: tests are useful to the extent that they evaluate the right outcome.

Integration testing

The second testing level are integration tests. Unit tests are a bit simplistic since they test units independently, this simplification is useful for efficiency, as there is no need to start up the whole system to test a small part of it.

But sometimes errors arise when inputs and outputs cross module's boundaries. Even though our modules are largely independent, they still have to interact with each other at some point (e.g. the billing module has to interaction with the notifications module to send a receipt). To catch potential errors during this interaction we use integration testing.

Writing integration tests is more complex than writing unit tests as there are more elements to be considered. This is why traditional software systems are designed to be loosely coupled by limiting the number of interactions and avoiding cross-module side effects. As we will see in the next section, integration testing is essential for testing data projects.

Effective testing

Writing tests is an art of its own, the purpose of testing is to catch as most errors as we can during development so they don't show up in production. In a way, tests are simulating user's actions and check that the system behaves as expected, for that reason, an effective test is one that simulates realistic scenarios and appropriately evaluates whether the system did the right thing or not.

An effective test should meet four requirements:

1. The simulated state of the system must be representative of the system when the user is interacting with it

The goal of tests is to prevent errors in production, so we have to represent the system status as closely as possible. Even though our e-commerce website might have dozens of modules (user signup, billing, product listing, customer support, etc), they are designed to be as independent as possible, this makes simulating our system easier. We could argue that having a dummy database is enough to simulate the system when a new user signs up, the existence or absence of any other module should have no effect in the module being tested. The more interactions among components, the harder it is to test a realistic scenario in production.

2. Input data be representative of real user input

When testing a procedure, we want to know if given an input, the procedure does what it's supposed to do. Since we cannot run every possible input, we have to think of enough cases that represent regular operation as well as possible edge cases (e.g. what happens if a user signs up with an invalid e-mail address). To test our create_account procedure, we should pass a few regular e-mail accounts but also some invalid ones and verity that it either creates the account or shows an appropriate error message.

3. Appropriate test oracle

As we mentioned in the previous section, the test oracle is our pass/fail criteria. The simpler and smaller the procedure to test, the easier is to come up with one. If we are not testing the right outcome, our test won't be useful. Our test for create_account implies that checking the users table in the database is an appropriate way of evaluating our function.

4. Reasonable runtime

While tests run, the developer has to wait until results come back. If testing is slow, we will either have to wait for a long time which might lead to developers just ignore the CI system altogether. This causes code changes to accumulate making debug much harder (it is easier to find the error when we changed 5 lines than when we changed 100)

Effective testing for data pipelines

In the previous sections, we described the first two levels of software testing and the four properties of an effective test. This section discusses how to adapt testing techniques from traditional software development to data projects.

Unit testing for data pipelines

Unlike modules in traditional software, our pipeline tasks (blocks in our diagram) are not independent, they have a logical execution. To accurately represent the state of our system we have to respect such order. Since the input for one task depends on the output from their upstream dependencies, root cause for an error can be either in the failing task or in any upstream task. This isn't good for us since it increases the number of potential places to search for the bug, abstracting logic in smaller procedures and unit testing them helps reduce this problem.

Say that our task add_product_information performs some data cleaning before joining sales with products:

import pandas as pd
from my_project import clean


def add_product_information(upstream):
    # load
    sales = pd.read_parquet(upstream['sales'])
    products = pd.read_parquet(upstream['products'])

    # clean
    sales_clean = clean.fix_timestamps(sales)
    products_clean = clean.remove_discontinued(products)

    # join
    output = sales_clean.merge(products_clean, on='product_id')
    output.to_parquet('clean/sales_w_product_info.parquet')

We abstracted cleaning logic in two sub-procedures clean.fix_timestamps and clean.remove_discontinued, errors in any of the sub-procedures will propagate to the output and in consequence, to any downstream tasks. To prevent this, we should add a few unit tests that verify logic for each sub-procedure in isolation.

Often, pipeline tasks that transform data are composed of just few calls with little custom logic to external packages (e.g. pandas). In such cases, unit testing won't be very effective. Imagine one of the tasks in your pipeline looks like this:

# cleaning
# ...
# ...

# transform
series = df.groupby(['customer_id', 'product_category']).price.mean()

df = pd.DatFrame({'mean_price': series})

return df

Assuming you already unit tested the cleaning logic, there isn't much to unit test about your transformations, writing unit tests for such simple procedures is not a good investment of your time. This is where integration testing comes in.

Integration testing for data pipelines

Our pipeline flows inputs and outputs until it generates the final result. This flow can break if task input expectations aren't true (e.g. column names), moreover, each data transformation encodes certain assumptions we make about the data. Integration tests help us verity that outputs flow through the pipeline correctly.

If we wanted to test the group by transformation shown above, we could run the pipeline task and evaluate our expectations using the output data:

# since we are grouping by these two keys, they should be unique
assert df.customer_id.is_unique
assert df.product_category.is_unique

# price is always positive, mean should be as well
assert df.mean_price > 0

# check there are no NAs (this might happen if we take the mean of
# an array with NAs)
assert not df.mean_price.isna().sum()

These four assertions are quickly to write and clearly encode our output expectations. Let's now see how we can write effective integration tests in detail.

State of the system

As we mentioned in the previous section, pipeline tasks have dependencies. In order to accurately represent the system status in our tests, we have to respect execution order and run our integration tests after each task is done, let's modify our original diagram to reflect this:

Test oracle

The challenge when testing pipeline tasks is that there is no single right answer. When developing a create_user procedure, we can argue that inspecting the database for the new user is an appropriate measure of success, but what about a procedure that cleans data?

There is no unique answer, because the concept of clean data depends on the specifics of our project. The best we can do is to explicitly code our output expectations as a series of tests. Common scenarios to avoid are including invalid observations in the analysis, null values, duplicates, unexpected column names, etc. Such expectations are good candidates for integration tests to prevent dirty data from leaking into our pipeline. Even tasks that pull raw data should be tested to detect data changes: columns get deleted, renamed, etc. Testing raw data properties help us quickly identify when our source data has changed.

Some changes such as column renaming will break our pipeline even if we don't write a test, but explicitly testing has a big advantage: we can fix the error in the right place and avoid redundant fixes. Imagine what would happen if renaming a column breaks two downstream tasks, each one being developed by a different colleague, once they encounter the error they will be tempted to rename the column in their code (the two downstream tasks), when the correct approach is to fix in the upstream task.

Furthermore, errors that break our pipeline should be the least of our worries, the most dangerous bugs in data pipelines are sneaky; they won't break your pipeline but will contaminate all downstream tasks in subtle ways that can severely flaw your data analysis and even flip your conclusion, which is the worst possible scenario. Because of this, I cannot stress how important is to code data expectations as part of any data analysis project.

Pipeline tasks don't have to be Python procedures, they'll often be SQL scripts and you should test them in the same way. For example, you can test that there are no nulls in certain column with the following query:

SELECT NOT EXISTS(
    SELECT * FROM some_table
    WHERE some_column IS NULL
)

For procedures whose output is not a dataset, coming up with a test oracle gets trickier. A common output in data pipelines are human-readable documents (i.e. reports). While it is technically possible to test graphical outputs such as tables or charts, this requires more setup. A first (and often good enough) approach is to unit test the input that generates visual output (e.g. test the function that prepares the data for plotting instead of the actual plot). If you're curious about testing plots, click here.

Realistic input data and running time

We mentioned that realistic input data is important for testing. In data projects we already have real data we can use in our tests, however, passing the full dataset for testing is unfeasible as data pipelines have computationally expensive tasks that take a lot to finish.

To reduce running time and keep our input data realistic we pass a data sample. How this sample is obtained depends on the specifics of the project. The objective is to get a representative data sample whose properties are similar to the full dataset. In our example, we could take a random sample of yesterday's sales. Then, if we want to test certain properties (e.g. that our pipeline handles NAs correctly), we could either insert some NAs in the random sample or use another sampling method such as stratified sampling. Sampling only needs to happen in tasks that pull raw data, downstream tasks will just process whatever output came from their upstream dependencies.

Sampling is only enabled during testing. Make sure your pipeline is designed to easily switch this setting off and keep generated artifacts (test vs production) clearly labeled:

from my_project import daily_sales_pipeline


def test_with_sample():
    # run with sample and stores all artifacts in the testing folder
    pipeline = daily_sales_pipeline(sample=True, artifacts='/path/to/output/testing')
    pipeline.build()

The snipped above makes the assumption that we can represent our pipeline as a "pipeline object" and call it with parameters. This is a very powerful abstraction that makes your pipeline flexible to be executed under different settings. Upon task successful execution, you should run the corresponding integration test. For example, say we want to test our add_product_information procedure, our pipeline should call the following function one such task is done:

import pandas as pd


def test_product_information(product):
    df = pd.read_parquet(product)
    assert not df.customer_id.isna().sum()
    assert not df.product_id.isna().sum()
    assert not df.category.isna().sum()
    assert not (df.price < 0).sum()

Note that we are passing the path to the data as an argument to the function, this will allow us to easily switch the path to load the data from. This is important to avoid pipeline runs to interfere with each other. For example, if you have several git branches, you can organize artifacts by branch in a folder called /data/{branch-name}; if you are sharing a server with a colleague, each one could save their artifacts to /data/{username}.

If you are working with SQL scripts, you can apply the same testing pattern:

def test_product_information_sql(client, relation):
    # Assume client is an object to send queries to the db
    # and relation the table/view to test
    query = """
    SELECT EXISTS(
        SELECT * FROM {product}
        WHERE {column} IS NULL
    )
    """
    assert not client.execute(query.format(relation=relation, column='customer_id'))
    assert not client.execute(query.format(relation=relation, column='product_id'))
    assert not client.execute(query.format(relation=relation, column='category'))

Apart from sampling, we can we further speed up testing by running tasks in parallel. Although there's a limited amount of parallelization we can do, which is given by the pipeline structure: we cannot run a task until their upstream dependencies are completed.

Parametrized pipelines and executing tests upon task execution is supported in our library Ploomber.

The testing tradeoff in Data Science

Data projects have much more uncertainty than traditional software. Sometimes we don't even know if the project is even technically possible, so we have to invest some time to give an answer. This uncertainty comes in detriment of good software practices: we want to reduce uncertainty and estimate project's impact by making as much progress as we can to answer questions about feasibility, thus, good software practices (such as testing) are usually not perceived as actual progress and are habitually overlooked.

My recommendation is to incrementally increase testing as you make progress. During early stages, it is important to focus on integration tests as they are quick to implement and effective. The most common errors in data transformations are easy to detect using simple assertions: check that IDs are unique, no duplicates, no empty values, columns fall within expected ranges. You'll be surprised how many bugs you catch with a few lines of code. These errors are obvious once you take a look at the data but might not even break your pipeline, they will just produce wrong results, integration testing prevents this.

Second, leverage off-the-shelf packages as much as possible, especially for highly complex data transformations or algorithms; but beware of quality and favor maintained packages even if they don't offer state-of-the-art performance. Third-party packages come with their own tests which reduces work for you.

There might also be parts that are not as critical or are very hard to test. Plotting procedures are a common example: unless you are producing a highly customized plot, there is little benefit on testing a small plotting function that just calls matplotlib and customizes axis a little bit. Focus on testing the input that goes into the plotting function.

As your project matures, you can start focusing on increasing your testing coverage and paying some technical debt.

Debugging data pipelines

When tests fail, it is time to debug. Our first line of defense is logging: whenever we run our pipeline we should generate a relevant set of logging records for us to review. I recommend you to take a look to the logging module in the Python standard library which provides a flexible framework for this (do not use print for logging), a good practice is to keep a file with logs from every pipeline run.

While logging can hint you where the problem is, designing your pipeline for easy debugging is critical. Let's recall our definition of data pipeline:

Series of ordered tasks whose inputs are raw datasets, intermediate tasks generate transformed datasets (saved to disk) and the final task produces a data product.

Keeping all intermediate results in memory is definitely faster, as disk operations are slower than memory. However, saving results to disk makes debugging much easier. If we don't persist intermediate results to disk, debugging means we have to re-execute our pipeline again to replicate the error conditions, if we keep intermediate results, we just have to reload the upstream dependencies for the failing task. Let's see how we can debug our add_product_information procedure using the Python debugger from the standard library:

import pdb

pdb.runcall(add_product_information,
            upstream={'sales': path_to_sales, 'product': path_to_product})

Since our tasks are isolated from each other and only interact via inputs and outputs, we can easily replicate the error conditions. Just make sure that you are passing the right input parameters to your function. You can easily apply this workflow if you use Ploomber's debugging capabilities.

Debugging SQL scripts is harder since we don't have debuggers as we do in Python. My recommendation is to keep your SQL scripts in a reasonable size: once a SQL script becomes too big, you should consider breaking it down in two separate tasks. Organizing your SQL code using WITH helps with readability and can help you debug complex statements:

WITH customers_subset AS (
    SELECT * FROM customers WHERE ..
), products_subset AS (
    SELECT * FROM products WHERE ...
),
SELECT *
FROM customers
JOIN products
USING (product_id)

If you find an error in a SQL script organized like this, you can replace the last SELECT statement for something like SELECT * FROM customers_subset to take a look at intermediate results.

Running integration tests in production

In traditional software, tests are only run in the development environment, it is assumed that if a piece of code reaches production, it must have been tested and works correctly.

For data pipelines, integration tests are part of the pipeline itself and it is up to you to decide whether to execute them or not. The two variables that play here are response time and end-users. If running frequency is low (e.g. a pipeline that executes daily) and end-users are internal (e.g. business analysts) you should consider keeping the tests in production. A ML training pipeline also follows this pattern, it has low running frequency because it executes on demand (whenever you want to train a model) and the end-users are you and any other person in the team. This is important given that we run our tests with a sample of the data, running them with the full dataset might give a different result either because our sampling method didn't capture certain properties in the data.

Another common (and often unforeseeable) scenario are data changes. It is important that you keep yourself informed of planned changes in upstream data (e.g. a migration to a different warehouse platform) but there's still a chance that you'd find out data changes until you pass new data through the pipeline. In the best case scenario, your pipeline will raise an exception that you'll be able to detect, worst case, your pipeline will execute just fine but the output will contain wrong results. For this reason, it is important to keep your integration tests running in the production environment.

Bottom line: If you can allow a pipeline to delay their final output (e.g the daily sales report), keep tests in production and make sure you are properly notified about them, the simplest solution is to make your pipeline send you an e-mail.

For pipelines where output is expected often and quickly (e.g. an API) you can change your strategy. For non-critical errors, you can log instead of raising exceptions but for critical cases, where you know a failing tests will prevent you from returning an appropriate results (e.g. user entered a negative value for an "age" column), you should return an appropriate error message. Handling errors in production is part of model monitoring, which we will cover in an upcoming post.

Revisited workflow

We now revisit the workflow based on observations from the previous sections. On every push, unit tests run first, then pipeline is then executed with a sample of the data, upon each task execution, integration tests are run to verify each output, if all tests pass, the commit is marked as successful. This is the end of the CI process and should only take a few minutes.

Given that we are continuously testing each code change, we should be able to deploy anytime. This idea of continuously deploying software is called Continuous Deployment, it deserves a dedicated post but here's the summary.

Since we need to generate a daily report, the pipeline runs every morning. The first step is to pull (from the repository or an artifact store) the latest stable version available install it in the production server, integration tests run on each successful task run to check data expectations, if any of these tests fails, a notification is sent. If everything goes well, the pipeline emails the report to business analysts.

Implementation details

This section provides general guidelines and resources to implement the CI workflow with existing tools.

Unit testing

To unit test logic inside each data pipeline task, we can leverage existing tools. I highly recommend using pytest. It has a small learning curve for basic usage; but and as you get more comfortable with it, I'd advise to explore more of their features (e.g. fixtures). Becoming a power user of any testing framework comes with great benefits, as you'll invest less time writing tests and maximize their effectiveness to catch bugs. Keep practicing until writing tests becomes the natural first step before writing any actual code. This technique of writing tests first is called Test-driven development (TDD).

Running integration tests

Integration tests have more tooling requirements since they need to take the data pipeline structure into account (run tasks in order), parametrization (for sampling) and testing execution (run tests after each task). There's been a recent surge in workflow management tools that can be helpful to do this to some extent.

Our library Ploomber supports all features required to implement this workflow: representing your pipeline as a DAG, separating dev/test/production environments, parametrizing pipelines, running test functions upon task execution, integration with the Python debugger, among other features.

External systems

A lot of simple to moderately complex data applications are developed in a single server: the first pipeline tasks dump raw data from a warehouse and all downstream tasks output intermediate results as local files (e.g. parquet or csv files). This architecture allows to easily contain and execute the pipeline in a different system: to test locally, just run the pipeline and save the artifacts in a folder of your choice, to run it in the CI server, just copy the source code and execute the pipeline there, there is no dependency on any external system.

However, for cases where data scale is a challenge, the pipeline might just serve as an execution coordinator doing little to no actual computations, think for example of a purely SQL pipeline that only sends SQL scripts to an analytical database and waits for completion.

When execution depends on external systems, implementing CI is harder, because you depend on another system to execute your pipeline. In traditional software projects, this is solved by creating mocks, which mimic the behavior of another object. Think about the e-commerce website: the production database is a large server that supports all users. During development and testing, there is no need for such big system, a smaller one with some data (maybe a sample of real data or even fake data) is enough, as long as it accurately mimics behavior of the production database.

This is often not possible in data projects. If we are using a big external server to speed up computations, we most likely only have that system (e.g a company-wide Hadoop cluster) and mocking it is unfeasible. One way to tackle this is to store pipeline artifacts in different "environments". For example, if you are using a large analytical database for your project, store production artifacts in a prod schema and testing artifacts in a test schema. If you cannot create schemas, you can also prefix all your tables and views (e.g. prod_customers and test_customers). Parametrizing your pipeline can help you easily switch schemas/suffixes.

CI server

To automate testing execution you need a CI server. Whenever you push to the repository, the CI server will run tests against the new commit. There are many options available, verify if the company you work for already has a CI service. If there isn't one, you won't get the automated process but you can still run implement it halfway by running your tests locally on each commit.

Extension: Machine Learning pipeline

Let's modify our previous daily report pipeline to cover an important use case: developing a Machine Learning model. Say that we now want to forecast daily sales for the next month. We could do this by getting historical sales (instead of just yesterday's sales), generating features and training a model.

Our end-to-end process has two stages: first, we process the data to generate a training set, then we train models and select the best one. If we follow the same rigorous testing approach for each task along the way, we will be able to catch dirty data from getting into our model, remember: garbage in, garbage out. Sometimes practitioners focus too much on the training task by trying out a lot of fancy models or complex hyperparameter tuning schemas. While this approach is certainly valuable, there is usually a lot of low-hanging fruit in the data preparation process that can impact our model's performance significantly. But to maximize this impact, we must ensure that the data preparation stage is reliable and reproducible.

Bugs in data preparation cause either results that are too good to be true (i.e. data leakage) or suboptimal models; our tests should address both scenarios. To prevent data leakage, we can test for existence of problematic columns in the training set (e.g. a column whose value is known only after our target variable is visible). To avoid suboptimal performance, integrations tests that verify our data assumptions play an important role but we can include other tests to check quality in our final dataset such as verifying we have data across all years and that data regarded as unsuitable for training does not appear.

Getting historical data will to increase CI running time overall but data sampling (as we did in the daily report pipeline) helps. Even better, you can cache a local copy with the data sample to avoid fetching the sample every time you run your tests.

To ensure full model reproducibility, we should only train models using artifacts that generated from an automated process. Once tests pass, a process could automatically trigger an end-to-end pipeline execution with the full dataset to generate training data.

Keeping historical artifacts can also help with model audibility, given a hash commit, we should be able to locate the generated training data, moreover, re-executing the pipeline from the same commit should yield identical results.

Model evaluation as part of the CI workflow

Our current CI workflow tests our pipeline with a data sample to make sure the final output is suitable for training. Wouldn't it be nice if we could also test the training procedure?

Recall that the purpose of CI is to allow developers integrate small changes iteratively, for this to be effective, feedback needs to come back quickly. Training ML models usually comes with a long running time; unless we have a way of finishing our training procedure in a few minutes, we'll have to think how to test swiftly.

Let's analyze two subtly different scenarios to understand how we can integrate them in the CI workflow.

Testing a training algorithm

If you are implementing your own training algorithm, you should test your implementation independent of the rest of your pipeline. This tests verify the correctness of your implementation.

This is something that any ML framework does (scikit-learn, keras, etc.), since they have to ensure that improvements to the current implementations do not break them. In most cases, unless you are working with a very data-hungry algorithm, this won't come with a running time problem because you can unit test your implementation with synthetic/toy dataset. This same logic applies to any training preprocessors (such as data scaling).

Testing your training pipeline

In practice, training is not a single-stage procedure. The first step is to load your data, then you might do some final cleaning such as removing IDs or hot-encoding categorical features. After that, you pass the data a multi-stage training pipeline that involves splitting, data preprocessing (e.g. standardize, PCA, etc), hyperparameter tuning and model selection. Things can go wrong in any of these steps, especially if your pipeline has highly customized procedures.

Testing your training pipeline is hard because there is no obvious test oracle. My advice is to try to make your pipeline as simple as possible by leveraging existing implementations (scikit-learn has amazing tools for this) to reduce the amount of code to test.

In practice, I've found useful to define a test criteria relative to previous results. If the first time I trained a model I got an accuracy of X, then I save this number and use it as reference. Subsequent experiments should fail within a reasonable range of X: sudden drops or gains in performance trigger an alert to review results manually. Sometimes this is good news, it means that performance is improving because my new features are working, other times, it is bad news: sudden gains in performance might come from information leakage while sudden drops from incorrectly processing data or accidentally dropping rows/columns.

To keep running time feasible, run the training pipeline with the data sample and have your test compare performance with a metric obtained using the same sampling procedure. This is more complex than it sounds because results variance will increase if you train with less data which makes coming up with the reasonable range more challenging.

If the above strategy does not work, you can try using a surrogate model in your CI pipeline that is faster to train and increase your data sample size. For example, if you are training a neural network, you could train using a simpler architecture to make training faster and increase the data sample used in CI to reduce variance across CI runs.

The next frontier: CD for Data Science

CI allows us to integrate code in short cycles, but that's not the end of the story. At some point we have to deploy our project, this is where Continuous Delivery and Continuous Deployment come in.

The first step towards deployment is releasing our project. Releasing is taking all necessary files (i.e. source code, configuration files, etc) and putting them in a format that can be used for installing our project in the production environment. For example, releasing a Python package requires uploading our code to the Python Package Index.

Continuous Delivery ensures that software can be released anytime, but deployment is still a manual process (i.e. someone has to execute instructions in the production environment), in other words, it only automates the release process. Continuous Deployment involves automating release and deployment. Let's now analyze this concepts in terms of data projects.

For pipelines that produce human-readable documents (e.g. a report), Continuous Deployment is straightforwards. After CI passes, another process should grab all necessary files and create a an installable artifact, then, the production environment can use this artifact to setup and install our project. Next time the pipeline runs, it should be using the latest stable version.

On the other hand, Continuous Deployment for ML pipelines is much harder. The output of a pipeline is not a unique model, but several candidate models that should be compared to deploy the best one. Things get even more complicated if we already have a model in production, because it could be that no deployment is the best option (e.g. if the new model doesn't improve predictive power significant but comes with an increase in runtime or more data dependencies).

An even more important (and more difficult) property to assess than predictive power is model fairness. Every new deployment must be evaluated for bias towards sensitive groups. Coming up with an automated way to assess a model in both predictive power and fairness is very difficult and it deservers its own post. If you want to know more about model fairness, this is a great place to start.

But Continuous Delivery for ML is still a manageable process. Once a commit passes all tests with a data sample (CI), another process runs the pipeline with the full dataset and stores the final dataset in object storage (CD stage 1).

The training procedure then loads the artifacts and finds optimal models by tuning hyperparameters for each selected algorithms. Finally, it serializes the best model specification (i.e. algorithm and its best hyperparameters) along with evaluation reports (CD stage 2). When it's all done, we take a look at the reports and choose a model for deployment.

In the previous section, we discussed how we can include model evaluation in the CI workflow. The proposed solution is limited by CI running time requirements; after the first stage in the CD process is done, we can include a more robust solution by training the latest best model specification with the full dataset, this will catch bugs causing performance drops at the moment, instead of having to wait for the second stage to finish, given it has a much higher running time. The CD workflow looks like this:

Triggering CD from a successful CI run can be manual, a data scientist might not want to generate datasets for every passing commit, but it should be easy to do given the commit hash (i.e. with a single click or command).

It is also convenient allow manual execution of the second stage because data scientists often use the same dataset to run several experiments by customizing the training pipeline, thus, a single dataset can potentially trigger many training jobs.

Experiment reproducibility is critical in ML pipelines. There is a one-to-one relationship between a commit, a CI run and a data preparation run (CD stage 1), thus, we can uniquely identify a dataset by the commit hash that generated it. And we should be able to reproduce any experiment by running the data preparation step again and running the training stage with the same training parameters.

Closing remarks

As CI/CD processes for Data Science start to mature and standardize, we'll start to see new tools to ease implementation. Currently, a lot of data scientists are not even considering CI/CD as part of their workflow. In some cases, they just don't know about it, in others, because implementing CI/CD effectively requires a setup process for repurposing existing tools. Data scientists should not worry about setting up a CI/CD service, they should just focus on writing their code, tests and push.

Apart from CI/CD tools specifically tailored for data projects, we also need data pipeline management tools to standardize pipeline development. In the last couple years, I've seen a lot of new projects, unfortunately, most of them focus on aspects such as scheduling or scaling rather than user experience, which is critical if we want software engineering practices such as modularization and testing to be embraced by all data scientists. This is the reason why we built Ploomber, to help data scientists easily and incrementally adopt better development practices.

Shortening the CI-developer feedback loop is critical for CI success. While data sampling is an effective approach, we can do better by using incremental runs: changing a single task in a pipeline should only trigger the least amount of work by re-using previously computed artifacts. Ploomber already offers this to some extent and we are experimenting ways to improve this feature.

I believe CI is the most important missing piece in the Data Science software stack: we already have great tools to perform AutoML, one-click model deployment and model monitoring. CI will close the gap to allow teams confidently and continuously train and deploy models.

To move our field forward, we must start paying more attention to our development processes.

Found an error in this post? Click here to let us know.

Originally posted at ploomber.io

Robust Jupyter report generation using static analysis

Eduardo Blancas — Mon, 20 Apr 2020 23:25:39 +0000

Jupyter notebooks are a great format for generating data analysis reports since they can contain rich output such as tables and charts in a single file. With the release of papermill, a package that lets you parametrize and execute .ipynb files programmatically, it became easier to use notebooks as templates to generate analytical reports. When developing a Machine Learning model, I use Jupyter notebooks in tandem with papermill to generate a report for each experiment I run, this way, I can always go back and check performance metrics, tables and charts to compare one experiment to another.

After trying out the Jupyter notebook + papermill combination in a few projects, I found some recurring problems:

.ipynb stores cell's output in the same file. This is good for the final report, but for development purposes, if two or more people edit the same notebook, cell's output will get into the way, making git merge a big pain
Even if we make sure cell's output is deleted before pushing to the git repository, comparing versions using git diff yields illegible results (.ipynb files are JSON files with complex structure)
Notebooks are developed interactively: cells are added and moved around, this interactivity often causes a top to bottom execution to result in errors. Given that papermill executes notebooks cell by cell, something as simple as a syntax error in the very last cell will be raised until such cell is executed
Papermill doesn't validate input parameters, it just adds a new cell. This might lead to unexpected behavior, such as an "undefined variable" errors or inadvertently using a default parameter value. This is especially frustrating for long-running notebooks where one finds out errors after waiting for the notebook to finish execution

In this blog post I'll explain my workflow for robust report generation, this post is divided in two parts, part I discusses the solution to problems 1 and 2, part II covers 3 and 4. Incorporating this workflow will help you better integrate your report's source code with git and save precious time by automatically preventing notebook execution when errors are detected.

Along the way, you'll also learn a few interesting things:

How Jupyter notebooks are represented (the .ipynb format)
How to read and manipulate notebooks using the nbformat package
How to convert a Python script (.py) to a Jupyter notebook using jupytext
Basic Jupyter notebook static analysis using pyflakes and parso
How to programmatically execute Jupyter notebooks using papermill
How to automate report validation and generation using ploomber

Workflow summary

The solution for problems 1 and 2 is to use a different format for development and convert to .ipynb right before during execution, jupytext does exactly that. Problems 3 and 4 are approached by doing static analysis before executing the notebook.

Step by step summary:

Work on .py files (instead of .ipynb) to make git integration easier
Declare your "notebook" parameters at the top, tagging the cell as "parameters" (see jupytext reference)
Before executing your notebook, validate the .py file using pyflakes and parso
If validation succeeds, use jupytext to convert your .py file to a .ipynb notebook
Execute your .ipynb notebook using papermill

Alternatively, you can use ploomber to automate the whole process, sample code is provided at the end of this post.

Part I: To ease git integration replace `.ipynb` notebooks with `.py` files

How are notebooks represented on disk?

A notebook.ipynb file is just a JSON file with a certain structure, which is defined in the nbformat package. When we open the Jupyter application (by using the jupyter notebook command), Jupyter uses nbformat under the hood to save our changes in the .ipynb file.

Let's see how we can create a notebook by directly manipulating an object and then serializing it to JSON:

# create a new notebook (nbformat.v4 defines the lastest jupyter notebook format)
nb = nbformat.v4.new_notebook()

# let's add a new code cell
cell = nbformat.v4.new_code_cell(source='# this line was added programatically\n 1 + 1')
nb.cells.append(cell)

# what kind of object is this?
print(f'A notebook is an object of type: {type(nb)}')

Console output: (1/1):

A notebook is an object of type: <class 'nbformat.notebooknode.NotebookNode'>

We can convert the notebook object to its JSON representation:

writer = nbformat.v4.nbjson.JSONWriter()
nb_json = writer.writes(nb)
print('Notebook JSON representation (content of the .ipynb file):\n\n', nb_json)

Console output: (1/1):

Notebook JSON representation (content of the .ipynb file):

 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this line was added programatically\n",
    " 1 + 1"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}

Notebook is a great output format since it supports embedded charts and tables in a single file, which we can easily share or review later but it's not a good choice as source code format. Say we edit the previous notebook, by just changing the first cell and adding a second one:

# edit first cell
nb['cells'][0]['source'] = '# Change cell\n 2 + 2'

# add a new one
cell = nbformat.v4.new_code_cell(source='# This is a new cell\n 3 + 3')
nb.cells.append(cell)

How our changes would look like for the reviewer?

# generate diff view between the old and the new notebook
nb_json_edited = writer.writes(nb)
diff = difflib.ndiff(nb_json.splitlines(keepends=True),
                     nb_json_edited.splitlines(keepends=True))
print(''.join(diff), end="")

Console output: (1/1):

{
   "cells": [
    {
     "cell_type": "code",
     "execution_count": null,
     "metadata": {},
     "outputs": [],
     "source": [
-     "# this line was added programatically\n",
+     "# Change cell\n",
-     " 1 + 1"
?       ^   ^
+     " 2 + 2"
?       ^   ^
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "# This is a new cell\n",
+     " 3 + 3"
     ]
    }
   ],
   "metadata": {},
   "nbformat": 4,
   "nbformat_minor": 4
  }

It's hard to see what's going on, and this is just a notebook with two cells and no output. In a real notebook with dozens of cells, understanding the difference between the old and new versions by eye is impossible.

To ease git integration, I use plain .py files and only convert them to .ipynb notebooks before execution. We could parse a .py file and convert it to a valid .ipynb file using nbformat, but there are important details such as tags or markdown cells that we have to take care of, fortunately, jupytext does that for us.

Furthermore, once jupytext is installed, opening a .py file in the jupyter notebook application will treat the file as a notebook and we will be able to run, add, remove cells as usual.

Let's see how to convert a .py file to .ipynb using jupytext:

# define your "notebook" in a plain .py file
# note that jupytext defines a syntax to support markdown cells and cell tags
py_code = ("""# This is a markdown cell

# + tags=['parameters']
x = 1
y = 2
""")

# use jupyter to convert it to a notebook object
nb = jupytext.reads(py_code, fmt='py')
print(f'Object type:\n{type(nb)}')

Console output: (1/1):

Object type:
<class 'nbformat.notebooknode.NotebookNode'>

Using .py files solve solves problems 1 and 2. Let's now discuss problems 3 and 4.

Part II: To catch errors before execution, use static analysis

Static analysis is the analysis of source code without execution. Since our notebooks usually take a lot to run, we want to catch as many errors as we can before running them, given that we got rid of the complex .ipynb format, we can now use tools that analyze Python source code to spot errors.

How Jupyter notebooks are executed by papermill?

It is important to understand how papermill executes notebooks to motivate this section. papermill performs a cell by cell execution: it takes the code from the first cell, sends it to the Python kernel, waits for a response, saves output and repeats this process for all cells. You can see the details in the source code here, you'll notice that PapermillNotebookClient is a subclass of NotebookClient, which is part of the nbclient, an official package that also implements a notebook executor.

This cell by cell logic has an important implication: an error in cell i, will be raised until such cell is executed, image your notebook looks like this:


import time

# cell 1 - simulate a long-running operation
time.sleep(3600)

# ...
# ...

# cell 100 - there is a syntax error here (missing ":")!
if x > 10
    pass

Something as simple as a syntax error will make your notebook crash until it reaches cell 100. To fix this problem, we will do a very simple static analysis in the whole notebook source code before executing it.

Finding errors with pyflakes

To prevent some runtime errors, we will run a few checks in our source code before executing it. pyflakes is a tool that looks for errors in our source code by parsing it. Since pyflakes does not execute code, it is limited in terms of how many errors it can find but it is very useful to find simple errors that would otherwise be detected at runtime. For the full list of errors pyflakes can detect, see this.

Let's see how it works:

py_code = """
import time

time.sleep(3600)

x = 1
y = 2

# z is never defined!
x + y + z

print('Variables. x: {}. y: {}'.format(x))
"""

_ = pyflakes_check(py_code, filename='my_file.py')

Console output: (1/1):

my_file.py:10:9 undefined name 'z'
my_file.py:12:7 '...'.format(...) is missing argument(s) for placeholder(s): 1

pyflakes found that variable 'z' is used but never defined, had we executed this notebook, we'd have find out about the error after waiting for one hour.

There are other projects similar to pyflakes, such as pylint. pylint is able to find more errors that pyflakes but it also flags style errors (such as inconsistent indentation), we probably don't want to prevent notebook execution due to style issues, so we'd have to filter out some messages. pyflakes works just fine for our purposes.

Parametrized notebooks with papermill

papermill can parametrize notebooks which allows you to use them as templates. Say you have a notebook called yearly_template.ipynb that takes a year as a parameter and generates a summary for data generated in that year, you could execute it from the command line using papermill like this:

papermill yearly_template.ipynb report_2019.ipynb -p year 2019

.ipynb files support cell tags, when you execute a notebook, papermill will inject a new cell with your parameters just below a cell tagged with "parameters". Although we are not dealing with .ipynb files anymore, we can still tag cells using jupytext syntax. Let's define a simple notebook:

# + tags=['parameters']
year = None


# +
print('the year is {}'.format(year))

If you convert the code above to .ipynb and then execute it using papermill, papermill will execute the following cells:

# Cell 1: cell tagged with "parameters"
year = None


# Cell 2: injected by papermill
year = 2019


# Cell 3
print('the year is {}'.format(year))

papermill limits itself to inject the passed parameters and execute the notebook, it does not perform any kind of validation. Adding a simple validation logic can help us prevent runtime errors before execution.

Extracting declared parameters with parso

I want parametrized notebooks to behave more like functions: they should refuse to run if any parameter is missing or if anything is passed but not declared. To enable this feature we have to analyze the "parameters" cell and compare it with the parameters passed via papermill. parso is a package that parses Python code and allows us to do exactly that.

params_cell = """
# + tags=['parameters']
a = 1
b = 2
c = 3
"""

# parse "parameters" cell, find which variables are defined
module = parso.parse(params_cell)
print('\nDefined variables: ', list(module.get_used_names()))

Console output: (1/1):

Defined variables:  ['a', 'b', 'c']

We see that parso detected the three variables, we can use this information to validate input parameters against declared ones.

Note: I recently discovered that finding declared variables can also be done with the ast module, which is part of the standard library.

Putting it all together

We now implement the logic in a single function, we'll take Python source code as input and validate using pyflakes and parso.

def check_notebook_source(nb_source, params, filename='notebook'):
    """
    Perform static analysis on a Jupyter notebook source raises
    an exception if validation fails

    Parameters
    ----------
    nb_source : str
        Jupyter notebook source code in jupytext's py format,
        must have a cell with the tag "parameters"

    params : dict
        Parameter that will be added to the notebook source

    filename : str
        Filename to identify pyflakes warnings and errors
    """
    # parse the JSON string and convert it to a notebook object using jupytext
    nb = jupytext.reads(nb_source, fmt='py')

    # add a new cell just below the cell tagged with "parameters"
    # this emulates the notebook that papermill will run
    nb, params_cell = add_passed_parameters(nb, params)

    # run pyflakes and collect errors
    res = check_source(nb, filename=filename)
    error_message = '\n'

    # pyflakes returns "warnings" and "errors", collect them separately
    if res['warnings']:
        error_message += 'pyflakes warnings:\n' + res['warnings']

    if res['errors']:
        error_message += 'pyflakes errors:\n' + res['errors']

    # compare passed parameters with declared
    # parameters. This will make our notebook behave more
    # like a "function", if any parameter is passed but not
    # declared, this will return an error message, if any parameter
    # is declared but not passed, a warning is shown
    res_params = check_params(params_cell['source'], params)
    error_message += res_params

    # if any errors were returned, raise an exception
    if error_message != '\n':
        raise ValueError(error_message)

    return True

Let's now see the implementation of the functions used above:

def check_params(params_source, params):
    """
    Compare the parameters cell's source with the passed parameters, warn
    on missing parameter and raise error if an extra parameter was passed.
    """
    # params are keys in "params" dictionary
    params = set(params)

    # use parso to parse the "parameters" cell source code and get all variable names declared
    declared = set(parso.parse(params_source).get_used_names().keys())

    # now act depending on missing variables and/or extra variables

    missing = declared - params
    extra = params - declared

    if missing:
        warnings.warn(
            'Missing parameters: {}, will use default value'.format(missing))

    if extra:
        return 'Passed non-declared parameters: {}'.format(extra)
    else:
        return ''

def check_source(nb, filename):
    """
    Run pyflakes on a notebook, wil catch errors such as missing passed
    parameters that do not have default values
    """
    # concatenate all cell's source code in a single string
    source = '\n'.join([c['source'] for c in nb.cells])

    # this objects are needed to capture pyflakes output
    warn = StringIO()
    err = StringIO()
    reporter = Reporter(warn, err)

    # run pyflakes.api.check on the source code
    pyflakes_check(source, filename=filename, reporter=reporter)

    warn.seek(0)
    err.seek(0)

    # return any error messages returned by pyflakes
    return {'warnings': '\n'.join(warn.readlines()),
            'errors': '\n'.join(err.readlines())}

def add_passed_parameters(nb, params):
    """
    Insert a cell just below the one tagged with "parameters"

    Notes
    -----
    Insert a code cell with params, to simulate the notebook papermill
    will run. This is a simple implementation, for the actual one see:
    https://github.com/nteract/papermill/blob/master/papermill/parameterize.py
    """
    # find "parameters" cell
    idx, params_cell = _get_parameters_cell(nb)

    # convert the parameters passed to valid python code
    # e.g {'a': 1, 'b': 'hi'} to:
    # a = 1
    # b = 'hi'
    params_as_code = '\n'.join([_parse_token(k, v) for k, v in params.items()])

    # insert the cell with the passed parameters
    nb.cells.insert(idx + 1, {'cell_type': 'code', 'metadata': {},
                              'execution_count': None,
                              'source': params_as_code,
                              'outputs': []})
    return nb, params_cell


def _get_parameters_cell(nb):
    """
    Iterate over cells, return the index and cell content
    for the first cell tagged "parameters", if not cell
    is found raise a ValueError
    """
    for i, c in enumerate(nb.cells):
        cell_tags = c.metadata.get('tags')
        if cell_tags:
            if 'parameters' in cell_tags:
                return i, c

    raise ValueError('Notebook does not have a cell tagged "parameters"')


def _parse_token(k, v):
    """
    Convert parameters to their Python code representation

    Notes
    -----
    This is a very simple way of doing it, for a more complete implementation,
    check out papermill's source code:
    https://github.com/nteract/papermill/blob/master/papermill/translators.py
    """
    return '{} = {}'.format(k, repr(v))

Testing our `check_notebook_source` function

Here we show some use cases for our validation function.

Raise error if "parameters" cell does not exist:

notebook_no_parameters_tag = """
a + b
"""

try:
    check_notebook_source(notebook_no_parameters_tag, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/1):

Raised exception: Notebook does not have a cell tagged "parameters"

Do not raise errors if "parameters" cell exist and passed parameters match:

notebook_ab = """
# + tags=['parameters']
a = 1
b = 2

# +
a + b
"""

assert check_notebook_source(notebook_ab, {'a': 1, 'b': 2})

Warn if using a default value:

_ = check_notebook_source(notebook_ab, {'a': 1})

Console output: (1/1):

/Users/Edu/miniconda3/envs/blog/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Missing parameters: {'b'}, will use default value

Raise an error if passing and undeclared parameter:

try:
    check_notebook_source(notebook_ab, {'a': 1, 'b': 2, 'c': 3})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/2):

Raised exception:

Console output: (2/2):

Passed non-declared parameters: {'c'}

Raise an error if a variable is used but never declared:

notebook_w_warning = """
# + tags=['parameters']
a = 1
b = 2

# +
# variable "c" is used but never declared!
a + b + c
"""

try:
    check_notebook_source(notebook_w_warning, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/1):

Raised exception: 
pyflakes warnings:
notebook:7:9 undefined name 'c'

Catch syntax error:

notebook_w_error = """
# + tags=['parameters']
a = 1
b = 2

# +
if
"""

try:
    check_notebook_source(notebook_w_error, {'a': 1, 'b': 2})
except Exception as e:
    print('Raised exception:', e)

Console output: (1/2):

Raised exception:

Console output: (2/2):

pyflakes errors:
notebook:6:3: invalid syntax

if

  ^

Automating the workflow using ploomber

To implement this workflow effectively, we have to make sure that our validation function is always run, then we have to convert the .py to .ipynb and finally, execute it using papermill. ploomber can automate this workflow easily, it can even convert the final output to several formats such as HTML. We only have to pass the source code and place our check_notebook_source inside the on_render hook.

Note: we include the notebook's source code as strings in the following example for simplicity, in a real project, it is better to load them from a file.

# source code for report 1
nb = """
# # My report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x + y =', x + y)
"""

# source code for report 2
nb_another = """
# # Another report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x - y =', x - y)
"""

# on render hook: run before executing the notebooks
def on_render(task):
    # task.params are read-only, get a copy
    params = task.params.to_dict()

    # papermill (the library ploomber uses to execute notebooks) only supports
    # parameters that are JSON serializable, to test what papermill will
    # actually run, we have to check on the product in its serializable form
    params['product'] = params['product'].to_json_serializable()
    check_notebook_source(str(task.source), params)


# store all reports under output/
out_dir = Path('output')
out_dir.mkdir(exist_ok=True)

# ploomber gives you parallel execution for free
dag = DAG(executor='parallel')

# ploomber supports exporting ipynb files to several formats
# using the official nbconvert package, we convert our
# reports to HTML here by just adding the .html extension
t1 = NotebookRunner(nb, File(out_dir / 'out.html'), dag,
                    name='t1',
                    ext_in='py',
                    kernelspec_name='python3',
                    params={'x': 1})
t1.on_render = on_render

t2 = NotebookRunner(nb_another, File(out_dir / 'another.html'), dag,
                    name='t2',
                    ext_in='py',
                    kernelspec_name='python3',
                    params={'x': 10, 'y': 50})
t2.on_render = on_render

# run the pipeline. No errors are raised but note that a warning is shown
dag.build()

Console output: (1/1):

/Users/Edu/dev/ploomber/src/ploomber/dag.py:469: UserWarning: Task "NotebookRunner: t1 -> File(output/out.html)" had the following warnings:

Missing parameters: {'y'}, will use default value
  warnings.warn(warning)

Summary

Jupyter notebooks (.ipynb) is a great output format, but using it under version control causes a lot of trouble, by using simple .py files and leveraging jupytext, we get the best of both worlds: we edit simple Python source code files but our generated reports are executed as Jupyter notebooks which allows them to contain rich output such as tables and charts. To save time, we developed a function that validates our input notebook and catches errors before the notebook is executed. Finally, by using ploomber, we were able to create a clean and efficient workflow: HTML reports are transparently generated from plain .py files.

Found an error in this post? Click here to let us know.

Originally posted at ploomber.io

Setting up reproducible Python environments for Data Science

Eduardo Blancas — Wed, 01 Apr 2020 21:41:03 +0000

Setting up a Python environment for Data Science is hard. Throughout my projects, I've experienced a recurring pattern when attempting to set up Python packages and their dependencies:

You start with a clean Python environment, usually using conda, venv or virtualenv
As you make progress, you start adding dependencies via pip install
Most of the time, it just works, but when it doesn't, a painful trial and error process follows
And things keep working, until you have to reproduce your environment on a different machine...

In this blog post, you'll learn how to set up reproducible Python environments for Data Science that are robust across operating systems and guidelines for troubleshooting installation errors.

Why does `pip install` fail?

Most pip install failures are due to missing dependencies. For example, some database drivers such as psycopg2 are just bindings of another library (in this case libpq), if you try to install psycopg2 without having libpq, it will fail. The key is to know which dependencies are missing and how to install them.

Before diving into more details, let's first give some background on how Python packages are distributed.

Source distributions and built distributions

There are two primary ways of distributing Python packages (distribution just means making a Python package available to anyone who wants to use it). The first one is a source distribution (.tar.gz files), the second one is a built distribution (.whl files), also known as wheels.

As the name implies, source distributions contain all the source code you need to build a package (building is a prerequisite to installing a package). The recipe to build is usually declared in a setup.py file. This is the equivalent to having all the raw ingredients and instructions for cooking something.

On the other hand, built distributions are generated by having source distributions go through the build process. They are "already cooked" packages whose files only need to be moved to the correct location for you to use them. Built distributions are OS-specific, which means that you need a version compatible with your current operating system. It is the equivalent to having the dish ready and only take it to your table.

There are many nuances to this, the bottom line is that built distributions are easier and faster to install (you just have to move files). If you want to know more about the differences, this is a good place to start. Let's go back to our pip install discussion.

What happens when I run `pip install [package]`?

When you execute pip install [package], pip will try to find a package with that name in the Python Package Index (or pypi). If it finds the package, it will first try to find a wheel for your OS, if it cannot find it, it will fetch the source distribution. Making wheels available for different OSs is up to the developer, popular packages usually do this, see for example numpy available files on pypi.

pip install will also install any dependencies required by the package you requested; however, it has some limitations and can only install dependencies that can also be installed via pip. It is important to emphasize that these limitations are by design: pip is a Python package manager, it is not designed to deal with non-Python packages.

Since pip is not designed to handle arbitrary dependencies, it will ask the OS for dependencies it cannot install such as compilers (this happens often with Python packages with parts written in C). This implicit process makes environments managed by pip harder to reproduce: if you take your requirements.txt to a different system, it might break if a non-Python dependency that existed in the previous environment does not exist in the new one.

Given that pip install [package] triggers the installation of [package] plus all its dependencies it has to fetch (built or source) distributions for all of them and depending on how many and in which format these are obtained (built distributions are easier to install), the process will vary.

Build and runtime dependencies

Sometimes Python packages need other non-Python packages to build. As I mentioned before, packages that have C code need a compiler (such as gcc) at build time, once C source code is compiled, gcc is no longer needed, that's why they are called build dependencies.

Other packages have non-Python dependencies to run, for example, psycopg2 requires the PostgreSQL library libpq to submit queries to the database. This is called a runtime dependency.

This difference leads to the following:

When installing from source (.tar.gz file) you need build + runtime dependencies
When installing from a wheel (.whl file) you only need runtime dependencies

Why can't `pip install` just install all dependencies for me?

pip's purpose is to handle Python dependencies, hence, installing things such as a compiler is out of its scope, and it will just request them to the system. These limitations are well-known, which is the reason why conda exists. conda is also a package manager, but unlike pip, its scope is not restricted to Python (for example, it can install the gcc compiler), this makes conda install more flexible since it can handle dependencies than pip cannot.

Note: when we refer to conda, we mean the command-line tool (also known as miniconda), not to the whole Anaconda distribution. This is true for the rest of this article. For an article describing some conda misconceptions, click here.

Using `conda install [package]` for robust installations

But conda is not only a package manager but an environment manager as well, this is key to understand the operational difference between pip install and conda install. When using pip, packages will install in the current active Python environment, whatever that is. This could be a system-wide installation or, more often, a local virtual environment created using tools such as venv or vitualenv, but still, any non-Python dependencies will be requested to the system.

In contrast, conda is a package manager and an environment manager, using conda install will install dependencies in the current local conda environment, at first glance, conda is very similar to using pip + venv, but conda can install non-Python dependencies, which provides a higher level of isolation.

Downsides of using `conda install`

There are a few downsides to using conda, though. For conda install [package] to work, someone has to write a conda recipe; sometimes developers maintain these recipes but other times recipe maintainers are third-parties, in such case, they might become outdated and conda install will yield an older version than pip install. Fortunately, well-known packages such as numpy, tensorflow or pytorch, have high-quality recipes and installation through conda is reliable.

The second downside is that many packages are not available in conda, which means we have no option but to use pip install, fortunately, with a few precautions we can safely use it inside a conda environment. The conda + pip combination, gives a robust way of setting up Python environments.

Note: there is a way to access more packages when using conda install by adding channels, which are locations where conda searches for packages. Only add channels from sources you trust. The most popular community-driven channel is conda-forge.

Using `pip install` and `conda install` inside a conda environment

At the time of writing, using pip inside a conda environment has a few problems, you can read the details here. Since sometimes we have no other way but to use pip to install dependencies not available through conda, here's my recommended workflow:

Start with a clean conda environment
Install as many packages as you can using conda install
Install the rest of your dependencies using pip install
Manually keep a list of conda dependencies using an environment.yml file and pip dependencies using a requirements.txt (See note below)
If you need to install a new package via conda, after you've used pip, re-create the conda environment
Make environment creation part of your testing procedure. Use tools such as nox to run your tests in a clean conda environment, this way you'll make sure your environment is reproducible

If you follow this procedure, anyone looking to reproduce your results only needs two files: environment.yml and requirements.txt.

Note: The reason I recommend keeping a manual list is to be conscious about each dependency, if we decide to experiment with some library but end up not using it, it is a good practice to remove it from our dependencies. If we rely on auto-generated files such as pip freeze we might end up including dependencies that we don't need.

Debugging installation errors

While using conda is a more reliable way to install packages with complex dependencies, there is no guarantee that things will just work; furthermore, if you need a package only available through pip via a source distribution, you are more likely to encounter installation issues. Here are some examples of troubleshooting installation errors.

Note: All the following tests were performed using a clean Ubuntu 18.04.4 image with miniconda3

Example 1: `impyla`

When we try to install impyla (an Apache Hive driver) using pip install impyla, we get a long error output. When fixing installation issues is important to skim through it to spot the missing dependency, these are the important lines:

unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for bitarray

...
...
...


unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for thriftpy2

bitarray and thriftpy2 are impyla dependencies. Wheels are not available, hence pip had to use source distributions, we can confirm this in the first output lines (look at the .tar.gz extension):

Collecting bitarray
  Downloading bitarray-1.2.1.tar.gz (48 kB)
     |################################| 48 kB 5.6 MB/s
Collecting thrift>=0.9.3
  Downloading thrift-0.13.0.tar.gz (59 kB)
     |################################| 59 kB 7.7 MB/s

But why did these dependencies fail to install? We see in the log that both dependencies tried to use gcc but they could not find it. Installing it (e.g. apt install gcc) and trying pip install impyla again fixes the issue. But you can also do conda install impyla which has the advantage of not installing gcc system-wide. Using conda is often the easiest way to fix installation issues.

Example 2: `pycopg2`

Let's first see what happens with pip install psycopg2:

Error: pg_config executable not found.

As in the previous case, we are missing one dependency. The tricky part is that pg_config is not a standalone executable; it is installed by another package, which is what you'll find after some online digging. If using apt, you can get this to work by doing apt install libpq-dev before using pip. But again, conda install psycopg2 works out of the box. This is the output:

krb5               pkgs/main/linux-64::krb5-1.16.4-h173b8e3_0
libpq              pkgs/main/linux-64::libpq-11.2-h20c2e04_0
psycopg2           pkgs/main/linux-64::psycopg2-2.8.4-py36h1ba5d50_0

We see that conda will install libpq along with psycopg2, but unlike using a system package manager (e.g. apt), it will do it locally, which is good for isolating our environment.

Example 3: `numpy`

Numpy is one most widely used packages. pip install numpy works reliably since developers upload wheels for the most popular operating systems. But this doesn't mean using pip is the best we can do.

Taken from the docs:

NumPy does not require any external linear algebra libraries to be installed. However, if these are available, NumPy’s setup script can detect them and use them for building.

In other words, depending on the availability of external linear algebra libraries, your numpy installation will be different. Let's see what happens when we run conda install numpy:

package                    |            build
---------------------------|-----------------
blas-1.0                   |              mkl           6 KB
intel-openmp-2020.0        |              166         756 KB
libgfortran-ng-7.3.0       |       hdf63c60_0        1006 KB
mkl-2020.0                 |              166       128.9 MB
mkl-service-2.3.0          |   py36he904b0f_0         219 KB
mkl_fft-1.0.15             |   py36ha843d7b_0         155 KB
mkl_random-1.1.0           |   py36hd6b4f25_0         324 KB
numpy-1.18.1               |   py36h4f9e942_0           5 KB
numpy-base-1.18.1          |   py36hde5b4d6_1         4.2 MB
------------------------------------------------------------
                                       Total:       135.5 MB

Along with numpy, conda will also install mkl, which is a library for optimizing math routines in systems with Intel processors. By using conda install, you get this for free, if using pip, you'd only get a vanilla numpy installation.

What about containers?

Containerization technologies such as Docker provide a higher level of isolation than a conda environment, but I think they are used in Data Science projects earlier than they should be. Once you have to run your pipeline in a production environment, containers are a natural choice, but for development purposes, a conda environment goes a long way.

Closing remarks

Getting a Python environment up and running is an error-prone process, and nobody likes spending time fixing installation issues. Understanding the basics of how Python packages are built and distributed, plus using the right tool for the job is a huge improvement over trial and error.

Furthermore, it is not enough to set up our environment once, if we want others to reproduce our work or make the transition to production simple, we have to ensure that there is an automated way to set up our environment from scratch, including environment creation as part of our testing process will let us know when things break.

Found an error in this post? Click here to let us know.

Originally posted at ploomber.io

Model selection with scikit-learn and ploomber

Eduardo Blancas — Tue, 24 Mar 2020 21:12:02 +0000

Model selection is an important part of any Machine Learning task. Since each model encodes their own inductive bias, it is important to compare them to understand their subtleties and choose the best one for the problem at hand. While knowing each learning algorithm in detail is important to have an intuition about which ones to try, it is always helpful to visualize actual results in our data.

Note: This blog post assumes you are familiar with the model selection framework via nested cross-validation and with the following scikit-learn modules (click for documentation): GridSearchCV, cross_val_predict and Pipeline.

The quick and dirty approach for model selection would be to have a long Jupyter notebook, where we train all models and output charts for each one. In this post we will show how to achieve this in a cleaner way by using scikit-learn and ploomber.

Project layout

We split the code in three files:

pipelines.py. Contains functions to instantiate scikit-learn pipelines
report.py. Contains the source code that performs hyperparameter tuning and model evaluation, imports pipelines defined in pipelines.py
main.py. Contains the loop that executes report.py for each pipeline using ploomber

Unless otherwise noted, the snippets shown in this post belong to main.py.

Functions to instantiate pipelines (`pipelines.py`)

We start declaring each of our model pipelines, which are just functions that return a scikit-learn Pipeline instance, we will use this in a nested cross-validation loop to choose the best hyperparameters and estimate generalization performance.

# Content of pipelines.py
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.svm import NuSVR


def ridge():
    return Pipeline([('scaler', StandardScaler()),
                     ('reg', Ridge())])


def nusvr():
    return Pipeline([('scaler', StandardScaler()),
                     ('reg', NuSVR())])

We have one factory for NuSVR and another one Ridge Regression. Since these two models are sensitive to scaling, we include them in a scikit-learn pipeline that scales all features before feeding the data into the model.

Hyperparameter tuning and performance estimation (`report.py`)

We will process each model separately, generating three HTML reports in total, the reports will be generated using the following source code:

# Content of report.py
from IPython.display import Markdown
import importlib
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_predict, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# + tags=["parameters"]
m_init = None
m_params = None
# -

Markdown('# Report for {}'.format(m_init))

print('Params: ', m_params)

# +
# m_init is module.sub_module.constructor import it from the string
parts = m_init.split('.')
mod_str, constructor = '.'.join(parts[:-1]), parts[-1]
mod = importlib.import_module(mod_str)

# instantiate it
model = getattr(mod, constructor)()
print(model)
# -

# load data
dataset = load_boston()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = dataset.target

# +
# Perform grid search over the passed parameters
grid = GridSearchCV(model, m_params, n_jobs=-1)

# We want to estimate generalization performance *and* tune hyperparameters
# so we are using nested cross-validation
y_pred = cross_val_predict(grid, X, y)
# -

# prev vs actual scatter plot
fig, ax = plt.subplots()
fig.set_size_inches(6, 6)
ax.scatter(y_pred, y)
ax.grid()
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

# residuals
fig, ax = plt.subplots()
fig.set_size_inches(6, 6)
res = y - y_pred
ax.scatter(np.arange(len(res)), res)
ax.grid()
ax.set_ylabel('Residual')

# residuals distribution
fig, ax = plt.subplots()
fig.set_size_inches(8, 6)
sns.distplot(res, ax=ax)
ax.grid()
ax.set_title('Residual distribution')

# print metrics
mae = np.abs(y - y_pred).mean()
mse = ((y - y_pred) ** 2).mean()
print(f'MAE: {mae:.2f}')
print(f'MSE: {mse:.2f}')

Running the execution loop (`main.py`)

We now turn our attention to main script that will take the model pipelines, the report source code and execute them. First we have to define the parameters we want to try for each model. We define one dictionary for each, the key m_init has the pipeline location (we will dynamically import this using the importlib library, finally, the m_params key contains the hyperparameters to try, not that for Ridge Regression and NuSVR, we have to add a ref__ prefix to each parameter, this is because the factories return scikit-learn Pipeline objects and we need to specify to which step the parameters belong to.

from pathlib import Path

from ploomber.tasks import NotebookRunner
from ploomber.products import File
from ploomber import DAG

# Ridge Regression grid
params_ridge = {
    'm_init': 'pipelines.ridge',
    'm_params': {
        'reg__alpha': [0.5, 1.0, 1.5, 2.0, 3.0]
    }
}

# Random Forest Regression grid
params_rf = {
    'm_init': 'sklearn.ensemble.RandomForestRegressor',
    'm_params': {
        'n_estimators': [5, 50, 100],
        'min_samples_leaf': [5, 10, 20],
    }
}

# Nu Support Vector Regression grid
params_nusvr = {
    'm_init': 'pipelines.nusvr',
    'm_params': {
        'reg__nu': [0.3, 0.5, 0.8],
        'reg__C': [0.5, 1.0, 1.5, 2.0],
        'reg__kernel': ['rbf', 'sigmoid']
    }
}

Note that we do not have a pipeline for RandomForestRegressor, Random Forest is not sensitive to scaling so we use the model directly.

We now add the execution loop, we will execute it using ploomber. We just have to tell ploomber where to load the source code from, which parameters to use on each iteration and where to save the output:

# load report source code
notebook = Path('report.py').read_text()

# we will save all notebooks in the artifacts/ folder
out = Path('artifacts')
out.mkdir(exist_ok=True)

params_all = {'ridge': params_ridge, 'rf': params_rf, 'nusvr': params_nusvr}

dag = DAG()

# loop over params and create one notebook task for each...
for name, params in params_all.items():
    # NotebookRunner is able to execute ipynb files using
    # papermill under the hood, if the input file has a
    # different extension (like in our case), it will first
    # convert it to an ipynb file using jupytext
    NotebookRunner(notebook,
                   # save it in artifacts/{name}.html
                   # NotebookRunner will generate ipynb files by
                   # default, but you can choose other formats,
                   # any format supported by the official nbconvert
                   # package is supported here
                   product=File(out / (name + '.html')),
                   dag=dag,
                   name=name,
                   # pass the parameters
                   params=params,
                   ext_in='py',
                   kernelspec_name='python3')

Build the DAG:

dag.build()

# Output:
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
nusvr   True          6.95555       27.8197
rf      True         11.6961        46.78
ridge   True          6.35066       25.4003

That's it. After building the DAG, each model will generate one report, you can see them here: Ridge, Random Forest and NuSVR.

Splitting logic into separate files improves readability and maintainability, if we want to add another model we only have to add a new dictionary with the parameter grid, if preprocessing is needed, we just add a factory in pipelines.py.

Using ploomber provides a concise and clean framework for generating reports, in just a few lines of code, we generated all our reports, however, we made a big simplifications in our report.py file: we are loading, training and evaluating in a single source file, if we made even a small change to our charts we would have to re-train every model again. A better approach is to split that logic in several steps, and that scenario is where ploomber is very effective:

Clean raw data (save clean dataset)
Train model and predict (save predictions)
Evaluate predictions

If we split each model pipeline in three steps, and run build, we will obtain the same results, now let's say you want to add a new chart, so you modify step 3. All you have to do to update your reports is dag.build(), ploomber will figure out that it does not have to re-run steps 1-2 and overwrite the old reports with the new ones.

Closing remarks

Developing Machine Learning model is an iterative process, by breaking down the entire pipeline logic in small steps and maximizing code reusability, we can develop short and maintainable pipelines. Jupyter is a superb tool (I use it every day and I'm actually writing this blog post from Jupyter), but do not fall into the habit of coding everything in a big notebook, which inevitably leads to unmaintainable code, prefer many short notebooks (or .py files) over a big single one.

Source code for this post is available here.

Found an error in this post? Click here to let us know.

This blog post was generated using package versions:

# Output:
matplotlib==3.1.3
numpy==1.18.1
pandas==1.0.1
scikit-learn==0.22.2
seaborn==0.10.0

DEV Community: Eduardo Blancas

How to Access Google Sheets from a Jupyter Notebook

Create a project

Enable Google Drive API and Google Sheets API

Create credentials

Connecting to the Google Sheet

Analyze and plot 5.5M records in 20s with BigQuery and Ploomber

Introduction

Architecture overview

Setup

Cloud Storage

BigQuery

Local setup

Project structure

Configure clients.py

Running the pipeline

Incremental builds

Closing remarks

Tips and Tricks to Use Jupyter Notebooks Effectively

New cell

Cell output

Magic commands

Multiple outputs

Shortcuts

Learning about a function

Credits

Machine Learning Model Selection with Nested Cross-Validation

Three Tools for Executing Jupyter Notebooks

Introduction

Ploomber

Execute Notebooks via Python API

Execute Notebooks via YAML API

Execute Notebooks on Cloud

Papermill

Execute Notebooks via Python API

Execute Notebooks via CLI

NBClient

Execute Notebooks via Python API

Execute Notebooks via CLI

Summary

References

The 10 Trending Python Repositories on GitHub (May 2022)

From Jupyter to Kubernetes: Refactoring and Deploying Notebooks Using Open-Source Tools

Cleaning up the notebook

Refactoring the notebook

Testing our pipeline

Deployment

Final remarks

Collaborative Data Science with Ploomber

The did you upload the latest data version yet? nightmare

Introducing the Ploomber Spec API

Try it!

Inferring dependencies and injecting products

Embracing Jupyter notebooks as an output format

Seamlessly mix Python and (templated) SQL

Interactive development and debugging

Closing remarks

Rethinking Continuous Integration for Data Science

Prelude: Software development practice in Data Science

Summary

What is Continuous Integration?

Continuous Integration for Data Science: Ideal workflow

Software testing

Testing levels

Unit testing

Integration testing

Effective testing

Effective testing for data pipelines

Unit testing for data pipelines

Integration testing for data pipelines

The testing tradeoff in Data Science

Debugging data pipelines

Running integration tests in production

Revisited workflow

Implementation details

Extension: Machine Learning pipeline

Model evaluation as part of the CI workflow

The next frontier: CD for Data Science

Closing remarks

Robust Jupyter report generation using static analysis

Configure `clients.py`

Part I: To ease git integration replace `.ipynb` notebooks with `.py` files

Testing our `check_notebook_source` function

Why does `pip install` fail?

What happens when I run `pip install [package]`?

Why can't `pip install` just install all dependencies for me?

Using `conda install [package]` for robust installations

Downsides of using `conda install`

Using `pip install` and `conda install` inside a conda environment

Example 1: `impyla`

Example 2: `pycopg2`

Example 3: `numpy`

Functions to instantiate pipelines (`pipelines.py`)

Hyperparameter tuning and performance estimation (`report.py`)

Running the execution loop (`main.py`)