As an information technology graduate looking to kickstart my career, I spent the first half of 2020 seeking opportunities to expand my skills and gain relevant practical experience.
After a few twists and turns, I started my internship as an engineer at Faethm with one big goal:
How can we make code reviews easier for our data science team?
Faethm is an AI platform built on the work of data scientists, and for them this is a very important problem to solve. Today, Faethm’s data science team works primarily with Jupyter notebooks, and these are managed in various internal GitHub repositories.
Jupyter notebooks are a popular productivity tool among data scientists, and for good reason.
You can execute data science workflows in cells, output tables and charts and keep documentation inline. Despite their popularity, it is cumbersome to manage changes made to Jupyter notebooks using version control systems like Git.
Tools like nbdime help a little. nbdime allows notebook users to highlight changes made between notebook versions on the command line or even within a Jupyter instance.
However, GitHub's built-in source code tools are not designed for Jupyter notebooks. A notebook captures code, outputs and metadata as a JSON document. When a data scientist executes a notebook with modifications, the JSON data changes at the cell-level to reflect the new code, updates the metadata and captures the new output.
Solving this problem became the essence of my internship.
Since many of these technologies are new to me, I knew it would be a challenge. With a combination of self-learning, persistence and support from Faethm’s engineers I’m happy to be able to present this solution.
jupydiff is a GitHub Action that allows data scientists to quickly compare changes made to Jupyter notebooks in GitHub repositories.
It works with regular commits and pull requests. When a change is made, jupydiff computes the code additions and deletions within each notebook, and summarises these as a comment on the associated commit or pull request.
jupydiff helps you streamline data science code reviews.
Without jupydiff, to compute the exact code difference between two Jupyter notebooks a reviewer would need to clone the repository, download and install nbdime and then run
nbdime diff on the command line. Alternatively, observing the regular diff in a code editor, version control tool or on GitHub itself involved interpreting lines of underlying JSON Jupyter notebook structure.
Setting up jupydiff to work with your Jupyter notebook project is simple. Since jupydiff is a GitHub Action, it works with both public and private repositories.
You can read all the details about configuring jupydiff in the jupydiff repository on GitHub, but I’ll cover the essentials here.
You’ll need to create a new GitHub Action workflow in your repository at
/.github/workflows/jupydiff.yml with the following contents:
name: jupydiff on: [ push, pull_request ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 with: fetch-depth: 2 - uses: Faethm-ai/jupydiff@v1
GitHub Actions runs jupydiff on your repository for each commit pushed or pull request opened. jupydiff computes the changes made with the latest commit, and leave a comment on the commit or pull request highlighting the differences in the code.
With no more JSON mess, data scientists are free to continue with their code review right on GitHub.
You’ve been reading a post from the Faethm AI engineering blog. We’re hiring, too! If share our passion for the future of work and want to pioneer world-leading data science and engineering projects, we’d love to hear from you. See our current openings: https://faethm.ai/careers