DEV Community

Cover image for Rebuilding TensorFlow 2.8.4 on Ubuntu 22.04 to patch vulnerabilities
David Haley
David Haley

Posted on

Rebuilding TensorFlow 2.8.4 on Ubuntu 22.04 to patch vulnerabilities

Table of contents:

  1. Context
  2. My process
  3. Summary of method & results

TLDR:

DeepCell + tensorflow-2.8.4 DeepCell + tensorflow-2.8.4-redux Delta
Compressed size 3.2GB 4.0 GB +0.8 GB (+25%)
VULNs 553 125 -428 (-77%)
Critical 1 1 0 (0%)
High 80 29 -51 (-63%)
Medium 349 53 -296 (-85%)
Low 123 42 -81 (-66%)

Read on for the how, why, wherefore, and finally.

Context & motivation

Previously we switched from the DeepLearning container to the base TensorFlow container.

Unfortunately the container has 553 security of vulnerabilities according to Google's scanner:

Screenshot of the Google container vulnerability scanner

The 553 issues break down this way:

  • 1 critical [vuln]
  • 80 high
  • 349 medium
  • 123 low

The official 2.8.4 container was published in Nov 2022. That's 1.5 years of OS updates at least. I looked up the 2.8.4 source and found that it's using Ubuntu 20.04 as the base OS. Of note, we're using the x86_64 architecture according to the container image layer: ENV NVARCH=x86_64.

So the obvious thing to do is to switch to the most recent Ubuntu version 24.04 right? Well no, that's a short party: NVIDIA doesn't have CUDA packages for 24.04 in their repository. So it's off to 22.04 – still two years more recent, and more importantly with CUDA packages.

My process, as I did it

I wouldn't do it this way again, but this is how I did it.

Updating the base Ubuntu image + dependencies.

First, I forked the tensorflow repository. I did a master-only clone so I needed to fetch the tag information after clone. Then, I could reset to the 2.8.4 version.

# Add upstream branch.
git remote add upstream https://github.com/tensorflow/tensorflow.git
git fetch upstream
# Reset master branch to 2.8.4
git checkout master
git reset --hard v2.8.4
git push --force
# Clean out local copy (everything after 2.8.4)
git gc
Enter fullscreen mode Exit fullscreen mode

Then, I updated the build steps. Here's what I did, following the instructions in the containers readme.

1. Build the tf-tools build tools container:

cd tensorflow/tools/dockerfiles
docker build -t tf-tools -f tools.Dockerfile .
Enter fullscreen mode Exit fullscreen mode

2. Set up aliases:

alias asm_dockerfiles="docker run --rm -u $(id -u):$(id -g) -v $(pwd):/tf tf-tools python3 assembler.py "
alias asm_images="docker run --rm -v $(pwd):/tf -v /var/run/docker.sock:/var/run/docker.sock tf-tools python3 assembler.py "
Enter fullscreen mode Exit fullscreen mode

3. Update build settings. I started with changing the file partials/ubuntu/version.partial.Dockerfile to use Ubuntu 22.04.

4. Regenerate the dockerfiles.

asm_dockerfiles --release dockerfiles --construct_dockerfiles
Enter fullscreen mode Exit fullscreen mode

5. Rebuild the desired TF-2.8 image. This builds a container tagged with the 2.8.4-rebuilt version, which causes the build system to tag the GPU-accelerated container 2.8.4-rebuilt-gpu.

asm_images --release versioned --arg _TAG_PREFIX=2.8.4-rebuilt --build_images --only_tags_matching="^2.8.4-rebuilt-gpu$"
Enter fullscreen mode Exit fullscreen mode

6. Done, or need to fix. Fix build errors & loop to step 3.

Dependency updates

Following this process here's what I fixed at first:

  • Downgrade requests & urllib libraries (see github bug)
  • Update base Ubuntu to 22.04.
  • Update CUDA from 11.2.1 to 11.8.0
  • Parameterize CUDA patch level (to support .0 instead of .1)
  • Update CUDNN from 8.1.0.77-1 to 8.6.0.163-1
  • Update libvinfer from 7.2.2-1 to 8.5.3-1.
    • I didn't love the major version update. But things seem fine.

At this point the container built, and I could run DeepCell. It output a segmentation image that seems plausible.

However a new error message popped up in the logs…

2024-05-28 19:38:34.423093: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2024-05-28 19:38:34.423903: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-05-28 19:38:34.424006: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
Enter fullscreen mode Exit fullscreen mode

Is this an error? Is it a problem to rely on the driver? I don't know, but I wanted to clear out the error.

Finding ptxas

I found a GitHub issue that seemed similar (missing ptxas) and saw a suggestion to install nvidia-cuda-toolkit. Alright: but that exploded the container size from 6.5 GB to 12.13 GB … unacceptable 😤 (Incidentally, this is too large for Cloud Shell to build on its limited persistent disk.)

Screenshot of the Docker desktop app showing two versions of the image

At this point I struggled for a couple hours. The nvidia-cuda-toolkit package info says it uses CUDA 11.5. But the prebuilt containers had 11.7 and 11.8 not 11.5 (I'd previously selected 11.8). The 11.5 packages weren't available in NVIDIA's Ubuntu 22.04 package repo.

Along the way, I switched the base container from NVIDIA's nvidia/cuda:11.8.0-base-ubuntu22.04 to nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04. Rather than pick the versions myself, I figured going with an official NVIDIA container with the files I was installing anyhow made sense.

I eventually found this "ptxas version issue" linked from a TensorFlow discussion asking whether to worry about a version mismatch warning. Not the same as our message that it's missing, but close enough.

This part caught my eye:

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.
Enter fullscreen mode Exit fullscreen mode

Interesting idea. I cherry-picked the binary by launching a container from the rebuilt image, and installing the very large nvidia-cuda-toolkit:

apt-get install nvidia-cuda-toolkit
Need to get 1603 MB of archives.
After this operation, 4505 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Enter fullscreen mode Exit fullscreen mode

Gulp. One very long download later, I had a ptxas binary.

root@33eda96a19a0:/# which ptxas
/usr/bin/ptxas
Enter fullscreen mode Exit fullscreen mode

Now to copy it back to the host, so I can add it to the redux repo for direct insertion into the container. Back on the host:

docker cp 33eda96a19a0:/usr/bin/ptxas .
Enter fullscreen mode Exit fullscreen mode

Then I installed it into /usr/bin/ptxas in the dockerfile.

Lo and behold: no more ptxas error when running DeepCell.


Summary

The container was rebuilt by:

  • Forking TensorFlow 2.8.4 from source.
  • Switching to the nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 base image.
  • Cherry-picking ptxas from nvidia-cuda-toolkit (avoids ~4GB unnecessary files).

The container rebuild yielded these changes:

DeepCell + tensorflow-2.8.4 DeepCell + tensorflow-2.8.4-redux Delta
Compressed size 3.2GB 4.0 GB +0.8 GB (+25%)
VULNs 553 125 -428 (-77%)
Critical 1 1 0 (0%)
High 80 29 -51 (-63%)
Medium 349 53 -296 (-85%)
Low 123 42 -81 (-66%)

It's too bad we added 25% to the container size. This may be because I moved away from the TensorFlow container build's selective dependencies to the full runtime package.

Still, 77% reduction in VULNs (and 63% for the highs) is very good.

The critical VULN is in TensorFlow pre 2.11.1. It allows malicious users running custom TensorFlow python code to access memory that's not theirs in some cases. Since we're running our own Python code, and DeepCell's, we're safe as long as nobody sticks in naughty code in those layers. But, we're also stuck with 2.8.4 and can't upgrade to 2.11 so the rationalization is rationalized.

If I were to do it again, I'd skip hand-picking library versions & move straight to an official NVIDIA runtime container.


Appendix

Helpful command to get into the TF container shell to poke around for files:

docker run --user $(id -u):$(id -g) -it -v $(pwd):/tf tensorflow:2.8.4-rebuilt-gpu bash
Enter fullscreen mode Exit fullscreen mode

I ran out of disk space on cloud shell a few times. Clear out the docker cache like so:

⚠️ Don't run these as-is if you have other containers/images you want to keep!

docker system prune

# Delete the previously build image
# (make room for new one)
docker image rm tensorflow:2.8.4-rebuilt-gpu
Enter fullscreen mode Exit fullscreen mode

In the end, Cloud Shell (which has a limited disk) became a hassle for iterating on builds. I considered a Cloud Workstation however there's a fixed $0.20/hr cost whether or not you have a workstation running … and I really just need a place to run Docker with disk space, so I used my local computer (a mac). The downloads weren't as fast as on cloud, but hey.

Side note: I'm super impressed with how easy it was to rebuild TF from source. Nice job y'all 🤩

Tools used in the rebuild:

  • GCP Cloud Shell
  • GCP Artifact Registry container scanner
  • Docker (local + cloud shell)
  • git & GitHub
  • TensorFlow
  • apt-file (to look up which package installed a file)

Top comments (0)