Our ongoing work to run DeepCell on GCP Batch produces a very large container: 5 GB compressed. Most of it is the Python & binaries required to run TensorFlow and all associated GPU code. It took ~13 minutes to build on GCP Cloud Build.
By leveraging Docker's cache better, we brought that down to ~4 minutes, a roughly 70% improvement.
Before | After | Delta |
---|---|---|
13min | 4min | -9min (-70%) |
Docker builds containers by creating a layer for each build command. The layers "stack" onto each other, adding or changing what's in the container so far. Loosely speaking, the layers are like snapshots of the container contents.
Docker can cache layers in the build process. Unless the build instruction changes, like updating the command or copying a different source file, the layer doesn't need to be rebuilt.
Our Dockerfile looked like this: (unabridged version here)
FROM <base_container>
RUN apt-get update -y && apt-get install -y <packages>
# Add the repo sha to the container as the version.
ADD https://api.github.com/repos/dchaley/deepcell-imaging/git/refs/heads/main version.json
# Clone the deepcell-imaging repo
RUN git clone https://github.com/dchaley/deepcell-imaging.git
# Switch into the repo directory
WORKDIR "/deepcell-imaging"
# Install python requirements
RUN pip install --user --upgrade -r requirements.txt
# Install our own module
RUN pip install .
When we added caching, we saw a smaller speedup, about 30%. We avoided reinstalling the apt-get
packages but we were still reinstalling Python dependencies … some of which (like TensorFlow) are very hefty, and many of which require compilation.
The full cache invalidation rules are a bit tricky. But the basic idea is simple. Layers are invalidated if the command changes or copied files change. If any layer is invalidated, all subsequent layers must be rebuilt.
In our case, by adding version.json
, we were invalidating everything below, in particular installing the Python dependencies from requirements.txt
. But these change quite rarely, compared to our application code!
Normally it's a GoodThing™️ to force a rebuild if code changes. But we don't want to lose the Python dependencies cache. To stop invalidating the cache for dependencies, we explicitly pulled in just requirements.txt
, installed those, and then pulled in the overall source code. This means we still rebuild dependencies if they change, but if they don't … we don't!
Our new Dockerfile looks like this: (unabridged version here)
FROM <base_container>
RUN apt-get update -y && apt-get install -y <packages>
# Fetch the Python dependencies
ADD https://raw.githubusercontent.com/dchaley/deepcell-imaging/refs/heads/main/requirements.txt requirements.txt
# Install python requirements
RUN pip install --user --upgrade -r requirements.txt
# Add the repo sha to the container as the version.
ADD https://api.github.com/repos/dchaley/deepcell-imaging/git/refs/heads/main version.json
# Clone the deepcell-imaging repo
RUN git clone https://github.com/dchaley/deepcell-imaging.git
# Switch into the repo directory
WORKDIR "/deepcell-imaging"
# Install our own module
RUN pip install .
Then, we rebuilt the container after a small code change, and observed the fuller benefits of the cache- avoiding the needless rebuilding of Python dependencies.
It's been really interesting learning the various ways of building containers & their pros/cons. A lot of containers are built by copying files from the local directories into the container, rather than checking out from source. This has many advantages like you can build a test/dev container from whatever you currently have. I wanted to simplify and make sure the container was always built from main
. The double-edged sword of simplicity.
Top comments (0)