TLDR, building our DeepCell container from a base TensorFlow image is 50% faster to load and 60% smaller than using the Deep Learning container.
Deep Learning image | Base TF image | Reduction | |
---|---|---|---|
Uncompressed | 19.5 GB | 7.2 GB | 63% |
Compressed | 8.4 GB | 3.2 GB | 62% |
Batch job load time | 6 min | 3 min | 50% |
This post covers how we rebuilt our container on the smaller base image; and why the Deep Learning container is so big to begin with. The long and short of it is that you pay a steep price to have so many development tools available, and you typically don't need those for production tasks.
Optimizing our container
Our DeepCell journey began on Vertex AI. Google provides pre-built TensorFlow images as part of their Deep Learning Container Images.
These containers purport to let you:
Quickly prototype with a portable and consistent environment for developing, testing, and deploying your AI applications with Deep Learning Containers. These Docker images use popular frameworks and are performance optimized, compatibility tested, and ready to deploy.
Cool beans. Our DeepCell version uses TF2.8 so we picked this image from Google's list: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-gpu.2-8.py37
It runs Python 3.7 which fortunately is still supported by DeepCell. (I've had mixed experiences with python version support across bioinformatics tools)
Our initial container build was simple:
FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/tf2-gpu.2-8.py37
ADD https://api.github.com/repos/dchaley/deepcell-imaging/git/refs/heads/main version.json
RUN git clone https://github.com/dchaley/deepcell-imaging.git
WORKDIR "/deepcell-imaging"
RUN pip install --user --upgrade --quiet -r requirements.txt
ENTRYPOINT ["python", "benchmarking/deepcell-e2e/benchmark.py"]
Our requirements file is pretty simple. We verified in the build logs that it didn't reinstall TensorFlow; note that the packages to install do not include TF:
Requirement already satisfied: tensorflow~=2.8.0 in /opt/conda/lib/python3.7/site-packages (from deepcell==0.12.9->-r requirements.txt (line 1)) (2.8.4)
...
Installing collected packages: tensorflow-addons, snakeviz, smart_open, qtpy, opencv-python-headless, lxml, jupyter-core, iniconfig, imagecodecs, cython, pytest, google-api-core, deepcell-toolbox, qtconsole, jupyter-console, deepcell-tracking, google-cloud-notebooks, google-cloud-bigquery, spektral, google-cloud-aiplatform, jupyter, deepcell
This resulted in a whopping ~20 GB container 😩
The compressed artifact size was ~8.5 GB: this is the amount of data that must be transmitted before unpacking.
The impact of all this? A six minute start time for Google Batch jobs, as defined from starting the container download …
2024-04-30 14:56:20.896 PDT
gce: Pulling from deepcell-on-batch/deepcell-benchmarking-us-central1/benchmarking
… until executing the container:
2024-04-30 15:02:23.233 PDT
Executing runnable container:
I wasn't thrilled with a six-minute minimum feedback cycle 😤 We tried image streaming to reduce startup time but alas, the container was so large it couldn't run without provisioning additional boot disk space.
We figured we must be able to build a container from a slimmer TensorFlow base image. We knew the DeepCell team had done some work scaling DeepCell using Kubernetes on GKE. Their Dockerfile confirmed that; just use TF's image.
We switched our base to TF's, grabbed the apt
maintenance work they did, and updated our Dockerfile [diff].
The result; 7.2 GB uncompressed and 3.2 GB compressed. And ~3min time from starting to fetch the container to beginning to execute it.
Deep Learning image | Base TF image | Reduction | |
---|---|---|---|
Uncompressed | 19.5 GB | 7.2 GB | 63% |
Compressed | 8.4 GB | 3.2 GB | 62% |
Batch job load time | 6 min | 3 min | 50% |
That's better 😎 But I couldn't help but wonder … why?
Container size analysis
Let's deep dive on what's on the containers. The containers are too large to open in Cloud Shell 🫠so we'll do it the old fashioned way on local.
Let's use ncdu
to explore the file system.
Deep Learning
This container was built from the Deep Learning base. Let's boot it up & install ncdu
.
$ docker run -it --entrypoint bash us-central1-docker.pkg.dev/deepcell-on-batch/deepcell-benchmarking-us-central1/benchmarking@sha256:8cc9b89e5869a4d468d64810b2ae47e242cc106519b2b8d7c4a9daa07856bdde
root@55a486270459:/deepcell-imaging# apt update && apt install ncdu
Begin scanning the root directory:
root@55a486270459:/deepcell-imaging# ncdu /
It scans pretty quickly. Here's the summary:
So far this just tells us we have a lot in usr
and opt
(common places to install libraries). Let's start with usr
.
6.6 GiB [ 53.9%] /lib
4.9 GiB [ 39.6%] /local
363.3 MiB [ 2.9%] /share
276.8 MiB [ 2.2%] /bin
144.1 MiB [ 1.1%] /src
A bit odd to have stuff in both lib
and local
; but let's see. lib
is mostly CUDA Deep Neural Network:
--- /usr/lib -------------------------
/..
5.5 GiB [ 83.2%] /x86_64-linux-gnu
938.5 MiB [ 13.8%] /google-cloud-sdk
--- /usr/lib/x86_64-linux-gnu ----------------------------
/..
1.4 GiB [ 24.5%] libcudnn_static.a
956.8 MiB [ 16.9%] libnvinfer_builder_resource.so.8.6.1
839.4 MiB [ 14.8%] libcudnn_cnn_infer_static.a
675.1 MiB [ 11.9%] libcudnn_cnn_infer.so.8.2.0
271.8 MiB [ 4.8%] libcudnn_ops_infer.so.8.2.0
227.3 MiB [ 4.0%] libcudnn_cnn_train_static.a
225.5 MiB [ 4.0%] libnvinfer.so.8.6.1
Static libraries are used to compile from source. We aren't doing that. Maybe we need the dynamic libraries for inference, I'm not sure. But the static libraries here are over 2.5 GB…
Surprising also to see a gig in the cloud sdk… it looks like the sdk ships its own Python distro and some other stuff.
--- /usr/lib/google-cloud-sdk --
/..
382.3 MiB [ 40.7%] /lib
296.7 MiB [ 31.6%] /platform
169.5 MiB [ 18.1%] /bin
As for /usr/local
:
--- /usr/local ----------------
/..
3.4 GiB [ 70.0%] /cuda-11.3
850.0 MiB [ 17.0%] /share
603.9 MiB [ 12.1%] /cuda-12.2
Well… do we actually need 2 versions of CUDA? (Why is 12.2 so much smaller?) About half of the 11.3 version is static libraries again.
So far we're at ~4 GB of CUDA-related static libraries (which we don't need).
How about that /usr/local/share
directory…
--- /usr/local/share/.cache --
/..
850.0 MiB [100.0%] /yarn
A gig of yarn package caches 😑 ~5 GB of stuff we don't need.
Alright, bouncing back to /opt
(the other big directory, with 6 GB):
--- /opt --------------------
/..
4.8 GiB [ 79.1%] /conda
1.3 GiB [ 20.9%] /nvidia
Conda is a python distribution, let's check out what's in nvidia:
--- /opt/nvidia --------------------
/..
1.3 GiB [100.0%] /nsight-compute
--- /opt/nvidia/nsight-compute -----
/..
651.5 MiB [ 50.0%] /2021.1.1
651.3 MiB [ 50.0%] /2021.1.0
So we have half a gig on an old version. What is nsight anyhow?
NVIDIA Nsightâ„¢ Systems is a system-wide performance analysis tool
Well we don't need that … so, we're at ~6 GB stuff we don't need. Let's go back to /opt/conda
(~5 GB); as expected most of the stuff is in packages & libraries:
--- /opt/conda -----------
/..
4.5 GiB [ 94.0%] /pkgs
3.2 GiB [ 67.0%] /lib
Most of the 4.5 GB of pkgs
is in something called dlenv-tf-2-8-gpu-1.0.20230926-py37hab20f5e_0
which in turn is ~3 GB of libraries.
--- /opt/conda/pkgs/dlenv-tf-2-8-gpu-1.0.20230926-py37hab20f5e_0 -----
/..
2.9 GiB [ 81.3%] /lib
623.1 MiB [ 17.3%] /share
The libraries are Python 3.7 site-packages
, mostly Tensorflow (1 GB), and a bunch of small Python libraries. We presumably need this stuff!
--- /opt/conda/pkgs/dlenv-tf-2-8...e_0/lib/python3.7/site-packages ---
/..
1.1 GiB [ 39.5%] /tensorflow
282.5 MiB [ 9.7%] /ray
116.9 MiB [ 4.0%] /pyarrow
98.2 MiB [ 3.4%] /llvmlite
84.3 MiB [ 2.9%] /scipy
83.9 MiB [ 2.9%] /sklearn
78.8 MiB [ 2.7%] /plotly
69.5 MiB [ 2.4%] /tensorflow_io
58.6 MiB [ 2.0%] /clang
50.3 MiB [ 1.7%] /apache_beam
46.6 MiB [ 1.6%] /google
How about share
?
--- /opt/conda/pkgs/dlenv-tf-2-8...0.20230926-py37hab20f5e_0/share ---
/..
621.3 MiB [ 99.7%] /jupyter
--- /opt/conda/pkgs/dlenv-tf-2-8...f5e_0/share/jupyter/lab/staging ---
/..
480.7 MiB [ 88.5%] /node_modules
57.1 MiB [ 10.5%] /build
Half a gig for Jupyter's JS dependencies & build files. So, ~6.5 unused stuff.
How about the lib
sibling to pkgs
(3.2 GB) ? Almost all of it is … another Python distribution?
--- /opt/conda/lib/python3.7 ------
/..
2.9 GiB [ 98.5%] /site-packages
--- /opt/conda/lib/python3.7/site-packages ---
/..
1.1 GiB [ 38.3%] /tensorflow
282.5 MiB [ 9.4%] /ray
117.0 MiB [ 3.9%] /pyarrow
98.2 MiB [ 3.3%] /llvmlite
84.3 MiB [ 2.8%] /scipy
83.9 MiB [ 2.8%] /sklearn
78.8 MiB [ 2.6%] /plotly
69.5 MiB [ 2.3%] /tensorflow_io
58.6 MiB [ 1.9%] /clang
50.8 MiB [ 1.7%] /google
50.3 MiB [ 1.7%] /apache_beam
These appear to be the same packages as the dlenv-etc
folder… ~3 GB of duplication, bringing our unused total to ~9.5 GB.
Since that's nearly all of our ~12 GB difference I stopped here.
Container size analysis: TensorFlow base
Let's do a quick scan of the container built off the base TensorFlow image.
Let's open up the container. Ooh, fancy...
root@2317ea736b48:/deepcell-imaging# apt update && apt install ncdu
root@2317ea736b48:/deepcell-imaging# ncdu /
This time most of the contents are in usr
and root
--- / --------------------
5.5 GiB [ 79.7%] /usr
1.3 GiB [ 19.0%] /root
Most of root
is Python 3.8 libraries, which is a lot of small libraries:
--- /root/.local/lib/python3.8 ----
/..
1.0 GiB [100.0%] /site-packages
--- /root/.local/lib/python3.8/site-packages ----
/..
85.5 MiB [ 8.5%] /scipy
83.6 MiB [ 8.3%] /google
74.5 MiB [ 7.4%] /imagecodecs
72.3 MiB [ 7.2%] /cv2
62.1 MiB [ 6.1%] /opencv_python_headless.libs
61.9 MiB [ 6.1%] /pandas
45.7 MiB [ 4.5%] /sklearn
whereas /usr
looks like this:
--- /usr ------------------
/..
3.1 GiB [ 57.2%] /local
2.2 GiB [ 40.1%] /lib
Almost all of lib
is CUDA DNN:
--- /usr/lib/x86_64-linux-gnu -------------------
/..
757.3 MiB [ 36.9%] libcudnn_cnn_infer.so.8.1.0
442.8 MiB [ 21.6%] libnvinfer.so.7.2.2
267.4 MiB [ 13.0%] libcudnn_ops_infer.so.8.1.0
whereas local
is split across more CUDA + python files:
--- /usr/local ----------------
/..
1.7 GiB [ 55.4%] /cuda-11.2
1.4 GiB [ 43.7%] /lib
--- /usr/local/cuda-11.2/targets/x86_64-linux/lib ---
/..
382.7 MiB [ 25.4%] libcusolver.so.11.1.0.152
219.6 MiB [ 14.6%] libcusparse.so.11.4.1.1152
186.6 MiB [ 12.4%] libcusolverMg.so.11.1.0.152
181.3 MiB [ 12.0%] libcufft.so.10.4.1.152
176.7 MiB [ 11.7%] libcublasLt.so.11.4.1.1043
--- /usr/local/lib/python3.8/dist-packages ---
/..
1.1 GiB [ 84.0%] /tensorflow
It looks like the CUDA DNN files in /usr/lib
are different from the CUDA files in /usr/local
.
Conclusions
The Deep Learning container seems better suited for:
- compiling tools from source
- training, not just predicting
- using notebooks for iterative development
- overall development tasks
The TensorFlow base image seems better suited for:
- running the specific thing you want to run once you've figured out how to run it.
Future work?
Google has optimized container images for VertexAI. We'd use: us-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-8:latest
I get the sense from the docs these only work on Vertex AI & need you to train the model on Vertex AI as well:
The optimization occurs when Vertex AI uploads a model, before it runs.
At some point it may be worth investigating the cost of predicting via Vertex AI online models, vs, predicting with an open-source container on Batch. But, if the container is so large again because of training code, we may lose whatever benefits we gained…
Top comments (0)