4 tools to boost data science reproducibility

#datascience #machinelearning #tooling #python

In a data science project, it's typically important to have reproducibility and minimize manual work. That involves a number of things from config to parameterization. Here are a few tools to improve that workflow.

Pipenv

Before pipenv, the main way to set up a project would be activating a virtual environment, put all the needed packages in a requirements.txt and it might also incur some manual work, as pip freeze > requirements.txt would bring in a lot of extra packages.

The elegant thing about pipenv, in its own words, is that it "automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages". It feels like npm in the python world.

So instead of the usual pip install one would now use pipenv install within the virtual environments, and there's no need to manually maintain a requirements.txt file anymore.

However, it goes without saying that if you didn't specify the package version during installation, pipfile won't carry the version either. Instead, it will show a generic one like this:

[packages]
pandas = "*"
sklearn = "*"

Note: this is not to be confused with pyenv which is another library for switching between different versions of python.

Makefile

If there are multiple repeated steps in a project, such as setup, format, test, lint, deploy, then a Makefile would concurrently save time for the project owner and make it user-friendly for the users.

A sample snippet of makefile in combination with the aforementioned pipenv would look like this

setup:
    pip install pipenv
    pipenv install
    pipenv shell
format:
    black *.py
    pylint --disable=R,C sample 
all: setup format

And by running one line of script make all, one can do everything in the makefile – set up a virtual environment, install all the dependencies, auto-format the code, and lint the script to show any unused libraries/variables, etc.

config.yml + argparse

Sometimes one wants to run multiple experiments in a machine learning project or conduct the same analysis on different datasets. In this case, it would be good to keep the different configurations for each of the inputs instead of overwriting them. One can achieve that by having a configs folder with several config.yml, each containing a set of inputs. And in the main script, one could use argparse to load the config.yml as a dictionary and then refer to the specific parameters in the config

parser = argparse.ArgumentParse(description=__doc__)
parser.add_argument("-f", "--config_file",
    default = "configs/config1.yml", help="Configuration file to load.")
ARGS = parser.parse_args()
with open(ARGS.config_file, 'r') as f:
    config = yaml.full_load(f)
print(f"Loaded configuration file {ARGS.config_file}")

In this way, each of the settings is preserved rather than overwritten, and it becomes easier to track the outcome.

papermill

If the setup above helps boost productivity and reproducibility for python scripts, then this library is for ipython notebooks, which might be used at the initial stages of the project due to its interactive nature. By defining all the variables/parameters in a cell in ipynb, and create a runner.ipynb, one can specify new parameters and generate a new notebook for each set of these new parameters.

One can generate a new notebook like this

import papermill as pm
pm.execute_notebook(
   'path/to/input.ipynb',
   'path/to/output.ipynb',
   parameters=dict(alpha=0.6, ratio=0.1)
)

This can be handy and time-saving compared to the alternative of duplicate-altering-renaming a set of highly similar, especially a large number of notebooks different only in some variables.

Overall I find these tools useful in maintaining dependencies, tracking inputs, and reducing repetition.

Image credit: unsplash