Table of contents:
- Context
- My process
- Summary of method & results
TLDR:
DeepCell + tensorflow-2.8.4 | DeepCell + tensorflow-2.8.4-redux | Delta | |
---|---|---|---|
Compressed size | 3.2GB | 4.0 GB | +0.8 GB (+25%) |
VULNs | 553 | 125 | -428 (-77%) |
Critical | 1 | 1 | 0 (0%) |
High | 80 | 29 | -51 (-63%) |
Medium | 349 | 53 | -296 (-85%) |
Low | 123 | 42 | -81 (-66%) |
Read on for the how, why, wherefore, and finally.
Context & motivation
Previously we switched from the DeepLearning container to the base TensorFlow container.
Unfortunately the container has 553 security of vulnerabilities according to Google's scanner:
The 553 issues break down this way:
- 1 critical [vuln]
- 80 high
- 349 medium
- 123 low
The official 2.8.4 container was published in Nov 2022. That's 1.5 years of OS updates at least. I looked up the 2.8.4 source and found that it's using Ubuntu 20.04 as the base OS. Of note, we're using the x86_64 architecture according to the container image layer: ENV NVARCH=x86_64
.
So the obvious thing to do is to switch to the most recent Ubuntu version 24.04 right? Well no, that's a short party: NVIDIA doesn't have CUDA packages for 24.04 in their repository. So it's off to 22.04 – still two years more recent, and more importantly with CUDA packages.
My process, as I did it
I wouldn't do it this way again, but this is how I did it.
Updating the base Ubuntu image + dependencies.
First, I forked the tensorflow repository. I did a master-only clone so I needed to fetch the tag information after clone. Then, I could reset to the 2.8.4 version.
# Add upstream branch.
git remote add upstream https://github.com/tensorflow/tensorflow.git
git fetch upstream
# Reset master branch to 2.8.4
git checkout master
git reset --hard v2.8.4
git push --force
# Clean out local copy (everything after 2.8.4)
git gc
Then, I updated the build steps. Here's what I did, following the instructions in the containers readme.
1. Build the tf-tools
build tools container:
cd tensorflow/tools/dockerfiles
docker build -t tf-tools -f tools.Dockerfile .
2. Set up aliases:
alias asm_dockerfiles="docker run --rm -u $(id -u):$(id -g) -v $(pwd):/tf tf-tools python3 assembler.py "
alias asm_images="docker run --rm -v $(pwd):/tf -v /var/run/docker.sock:/var/run/docker.sock tf-tools python3 assembler.py "
3. Update build settings. I started with changing the file partials/ubuntu/version.partial.Dockerfile
to use Ubuntu 22.04.
4. Regenerate the dockerfiles.
asm_dockerfiles --release dockerfiles --construct_dockerfiles
5. Rebuild the desired TF-2.8 image. This builds a container tagged with the 2.8.4-rebuilt
version, which causes the build system to tag the GPU-accelerated container 2.8.4-rebuilt-gpu
.
asm_images --release versioned --arg _TAG_PREFIX=2.8.4-rebuilt --build_images --only_tags_matching="^2.8.4-rebuilt-gpu$"
6. Done, or need to fix. Fix build errors & loop to step 3.
Dependency updates
Following this process here's what I fixed at first:
- Downgrade requests & urllib libraries (see github bug)
- Update base Ubuntu to 22.04.
- Update CUDA from 11.2.1 to 11.8.0
- Parameterize CUDA patch level (to support
.0
instead of.1
) - Update CUDNN from
8.1.0.77-1
to8.6.0.163-1
- Update
libvinfer
from7.2.2-1
to8.5.3-1
.- I didn't love the major version update. But things seem fine.
At this point the container built, and I could run DeepCell. It output a segmentation image that seems plausible.
However a new error message popped up in the logs…
2024-05-28 19:38:34.423093: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2024-05-28 19:38:34.423903: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-05-28 19:38:34.424006: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
Is this an error? Is it a problem to rely on the driver? I don't know, but I wanted to clear out the error.
Finding ptxas
I found a GitHub issue that seemed similar (missing ptxas) and saw a suggestion to install nvidia-cuda-toolkit. Alright: but that exploded the container size from 6.5 GB to 12.13 GB … unacceptable 😤 (Incidentally, this is too large for Cloud Shell to build on its limited persistent disk.)
At this point I struggled for a couple hours. The nvidia-cuda-toolkit
package info says it uses CUDA 11.5. But the prebuilt containers had 11.7 and 11.8 not 11.5 (I'd previously selected 11.8). The 11.5 packages weren't available in NVIDIA's Ubuntu 22.04 package repo.
Along the way, I switched the base container from NVIDIA's nvidia/cuda:11.8.0-base-ubuntu22.04
to nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
. Rather than pick the versions myself, I figured going with an official NVIDIA container with the files I was installing anyhow made sense.
I eventually found this "ptxas version issue" linked from a TensorFlow discussion asking whether to worry about a version mismatch warning. Not the same as our message that it's missing, but close enough.
This part caught my eye:
You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
Interesting idea. I cherry-picked the binary by launching a container from the rebuilt image, and installing the very large nvidia-cuda-toolkit
:
apt-get install nvidia-cuda-toolkit
Need to get 1603 MB of archives.
After this operation, 4505 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Gulp. One very long download later, I had a ptxas binary.
root@33eda96a19a0:/# which ptxas
/usr/bin/ptxas
Now to copy it back to the host, so I can add it to the redux repo for direct insertion into the container. Back on the host:
docker cp 33eda96a19a0:/usr/bin/ptxas .
Then I installed it into /usr/bin/ptxas
in the dockerfile.
Lo and behold: no more ptxas error when running DeepCell.
Summary
The container was rebuilt by:
- Forking TensorFlow 2.8.4 from source.
- Switching to the
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
base image. - Cherry-picking ptxas from
nvidia-cuda-toolkit
(avoids ~4GB unnecessary files).
The container rebuild yielded these changes:
DeepCell + tensorflow-2.8.4 | DeepCell + tensorflow-2.8.4-redux | Delta | |
---|---|---|---|
Compressed size | 3.2GB | 4.0 GB | +0.8 GB (+25%) |
VULNs | 553 | 125 | -428 (-77%) |
Critical | 1 | 1 | 0 (0%) |
High | 80 | 29 | -51 (-63%) |
Medium | 349 | 53 | -296 (-85%) |
Low | 123 | 42 | -81 (-66%) |
It's too bad we added 25% to the container size. This may be because I moved away from the TensorFlow container build's selective dependencies to the full runtime package.
Still, 77% reduction in VULNs (and 63% for the highs) is very good.
The critical VULN is in TensorFlow pre 2.11.1. It allows malicious users running custom TensorFlow python code to access memory that's not theirs in some cases. Since we're running our own Python code, and DeepCell's, we're safe as long as nobody sticks in naughty code in those layers. But, we're also stuck with 2.8.4 and can't upgrade to 2.11 so the rationalization is rationalized.
If I were to do it again, I'd skip hand-picking library versions & move straight to an official NVIDIA runtime container.
Appendix
Helpful command to get into the TF container shell to poke around for files:
docker run --user $(id -u):$(id -g) -it -v $(pwd):/tf tensorflow:2.8.4-rebuilt-gpu bash
I ran out of disk space on cloud shell a few times. Clear out the docker cache like so:
⚠️ Don't run these as-is if you have other containers/images you want to keep!
docker system prune
# Delete the previously build image
# (make room for new one)
docker image rm tensorflow:2.8.4-rebuilt-gpu
In the end, Cloud Shell (which has a limited disk) became a hassle for iterating on builds. I considered a Cloud Workstation however there's a fixed $0.20/hr cost whether or not you have a workstation running … and I really just need a place to run Docker with disk space, so I used my local computer (a mac). The downloads weren't as fast as on cloud, but hey.
Side note: I'm super impressed with how easy it was to rebuild TF from source. Nice job y'all 🤩
Tools used in the rebuild:
- GCP Cloud Shell
- GCP Artifact Registry container scanner
- Docker (local + cloud shell)
- git & GitHub
- TensorFlow
- apt-file (to look up which package installed a file)
Top comments (0)