Overview
One of the most time-consuming tasks on my workflows is the solving, download and installation of Anaconda environments. In some cases, just solving the dependencies can take up to 10 minutes depending on the platform you are building on.
That's why I'm always looking for ways to increase the speed of my workflows. For example, a very well known method is using the blazing-fast mamba
package manager instead of conda
.
mamba
is written in C++, download files in parallel, and uses libsolv
(a state of the art library used in the RPM package manager of Red Hat, Fedora and OpenSUSE) for much faster dependency solving.
But usually this is not enough fast for me. Also, I find it a waste of resources downloading the packages every time a collaborator pushes a commit to a pull request. For example, in the open source project I collaborate, the CI pipeline can be triggered more than a hundred times in a single day.
That's why always wanted to cache the Anaconda environment, but didn't have the time to solve the issue, until now.
The documentation of the actions/cache
task includes examples for many package managers, but not for Anaconda. On the other hand, the documentation of the setup-miniconda
action describes a way to cache the downloaded packages, but currently that makes the pipeline even slower.
The cache
action
It's important to understand the scope of the cache
action. From GitHub's documentation:
A workflow can access and restore a cache created in the current branch, the base branch (including base branches of forked repositories), or the default branch (usually
main
). For example, a cache created on the default branch would be accessible from any pull request. Also, if the branchfeature-b
has the base branchfeature-a
, a workflow triggered onfeature-b
would have access to caches created in the default branch (main
),feature-a
, andfeature-b
.
My workflow
In this example I'm going to show how to write an example CI pipeline with the following features:
- Runs on the three major operating systems (Linux, macOS and Windows)
- Updates cache every 24 hours
- Updates cache when
environment.yml
is modified - Cache can be reset manually
Let's get started!
Triggers
We want a pipeline that is triggered when:
- A commit is pushed to any branch of the main repository
- A commit is pushed to a pull request
- Every day at 00:00 UTC
name: ci
on:
push:
branches:
- '*'
pull_request:
branches:
- '*'
schedule:
- cron: '0 0 * * *'
env:
CACHE_NUMBER: 0 # increase to reset cache manually
The CACHE_NUMBER
variable is going to be used later.
Prefixes
We need to set up matrix
to handle the different installation paths of Mambaforge*:
jobs:
build:
strategy:
matrix:
include:
- os: ubuntu-latest
label: linux-64
prefix: /usr/share/miniconda3/envs/my-env
- os: macos-latest
label: osx-64
prefix: /Users/runner/miniconda3/envs/my-env
- os: windows-latest
label: win-64
prefix: C:\Miniconda3\envs\my-env
- Mambaforge is a custom build of Miniconda with
mamba
package manager pre-installed andconda-forge
as default channel.
Install Mambaforge
At the step level, we install Mambaforge without specifying a YAML environment file.
name: ${{ matrix.label }}
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v2
- name: Setup Mambaforge
uses: conda-incubator/setup-miniconda@v2
with:
miniforge-variant: Mambaforge
miniforge-version: latest
activate-environment: my-env
use-mamba: true
Cache
The cache task work with keys. When the task is executed, looks for a saved cache that matches the key and retrieves the data.
Cache is specific for every OS. Also, I set up the key in a way that will update the cache every 24 hours or if the environment has changed.
The CACHE_NUMBER
variable defined above is meant to reset the cache manually.
- name: Set cache date
run: echo "DATE=$(date +'%Y%m%d')" >> $GITHUB_ENV
- uses: actions/cache@v2
with:
path: ${{ matrix.prefix }}
key: ${{ matrix.label }}-conda-${{ hashFiles('environment.yml') }}-${{ env.DATE }}-${{ env.CACHE_NUMBER }}
id: cache
Update the environment
Finally, if the cache is not available, update the environment according to the YAML environment file, and run the tests.
- name: Update environment
run: mamba env update -n my-env -f environment.yml
if: steps.cache.outputs.cache-hit != 'true'
- name: Run tests
shell: bash -l {0}
run: pytest ./tests
Results
Despite our environment.yml
file is very simple, we saved 5 minutes on average on every run.
Get the code
The code is available here:
epassaro / cache-conda-envs
Speed up your builds by caching Anaconda environments on GitHub Actions
cache-conda-envs π β‘
Speed up your builds by caching Anaconda environments on GitHub Actions
Top comments (0)