DEV Community

Burak Yigit Kaya
Burak Yigit Kaya

Posted on • Originally published at byk.im on

Docker Volume Caching on GitHub Actions

I joined Sentry to exclusively work on their self-hosted product in 2019. Back then, Sentry was just using a few services: Postgres, Memcached, Redis, and Sentry itself. But it was on the cusp of becoming a multi-service application with the introduction of Snuba and along with that Kafka, Relay, Symbolicator and others. Because it was supposed to be simple, self-hosted (or onpremise as it was called back then) did not have any tests or even any automation: just a bunch of instructions and commands to run in the README. With the rapid increase in the number of engineers working on Sentry and the changes being made, it was clear that we needed to automate the testing and setup of the self-hosted repository.

To summarize about a year’s worth of work: we created an install script based in bash (as that was the most common denominator across all platforms), and a very cursory test suite which ran the install script, tried to ingest an event, and read it back. The entire test suite took about 5-6 minutes to run and about half of that time was spent on running Django migrations, from scratch, on a fresh database, over, and over, and over. The thing is we didn’t even add migrations frequently but we still had to run them all to get the service up and running.

The solution was obviously caching but caching Docker volumes was not really a thing that seemed feasible back then. Remember, this is 2019-2020, GitHub Actions was still in its infancy. I was also barely getting comfortable with all that Bash and Docker stuff. Then I got distracted by other things, changed jobs, and eventually came back to Sentry to see that this was still a problem. So I decided to tackle it head-on. I was going to cache the hell out of those Docker volumes for our databases. We already had actions/cache now so how hard could it be? Famous last words.

I have spent about 2 weeks to completely figure this out. About 50% of this was my ignorance about basic Linux tools such as tar, file/directory permissions, and Docker’s way of storing volumes. About 30% was me not trying things locally properly and just pushing to CI and waiting for the results. The remaining 20% was the actual hard parts to figure out, mostly thanks to StackOverflow (yeah, still not on that “ChatGPT for everything” bandwagon1). I’ll summarize some of the findings here so you don’t have to go through the same pain as I did:

  1. Docker volumes are stored under /var/lib/docker/volumes (by default, and please don’t change it)
  2. You cannot stat a directory or anything under it if you don’t have x permission on the directory itself (╯°□°)╯︵ ┻━┻
  3. tar does preserve permissions and ownership by default but only if you are running it as root (or with sudo) (╯°□°)╯︵ ┻━┻ x 2
  4. tar preserves ownership information as names and not as IDs so if your Docker container uses a user id like 1000, GLHF 2 (╯°□°)╯︵ ┻━┻ x 3
  5. Linux (Unix?) fs permissions are not just rwx but there’s also an s you can set on executables to allow them to set ownership of other things3 \(〇_o)/
  6. Not only GitHub Actions doesn’t run tar with sudo, and not only it refuses to do this, it also doesn’t allow you to run tar with --same-owner or --numeric-owner (╯°□°)╯︵ ┻━┻ x 4
  7. Bonus: there are these awesome tools called getfacl and setfacl that lets you backup and restore ACLs BUT NOT OWNERSHIP INFORMATION (╯°□°)╯︵ ┻━┻ x 5
  8. Bonus 2: mv would happily overwrite your target without even mentioning, especially if you use sudo.

So, with all this information, what is needed to cache Docker volumes on GitHub Actions and restore them properly? Let’s see:

  1. Set +x permission on /var/lib/docker
  2. Set +rx permission on /var/lib/docker/volumes
  3. Set u+s permission on tar
  4. Use tar --numeric-owner to create the archive — oh wait, you can’t because actions/cache doesn’t let you (╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻(╯°□°)╯︵ ┻━┻

Side quest: Hacking tar on GitHub Actions

Once I realized that I had to change the options passed to tar, I very reluctantly decided to “wrap” the actual tar executable:

sudo cp /usr/bin/tar /usr/bin/tar.orig
sudo echo 'exec tar.orig --numeric-owner -p --same-owner "$@"' > /usr/bin/tar
Enter fullscreen mode Exit fullscreen mode

Oh, but wait, you cannot sudo redirect output to a file as sudo just runs the command and redirection is done by the shell which you are not running as root. Let’s try that again:

sudo cp /usr/bin/tar /usr/bin/tar.orig
echo 'exec /usr/bin/tar.orig --numeric-owner -p --same-owner "$@"' | sudo tee /usr/bin/tar > /dev/null
Enter fullscreen mode Exit fullscreen mode

Once I added this monstrosity, my GitHub Actions runs… started to hang indefinitely. Can you see the issue? ಠಿ_ಠ Well, I couldn’t. I spent about 2 hours trying to figure out why this was happening. I suspected exec might be the culprit and when I removed it, the runs at least started crashing with an error: cannot fork. What? Well, see I was doing this both in my restore and save actions. So, when the restore action ran, it wrapped/replaced tar but then did not restore the original back. After some time, save action ran trying to do the same. Now remember our “Bonus 2” learning from above: when save also backed up tar (which was actually my wrapper script) to /usr/bin/tar.orig, mv didn’t even flinch when tar.orig already existed. Now I had 2 copies of my wrapper script where the second one just execed itself. Nice fork bomb there, me4.

A smiling bomb with a fork stuck to it.

Once the fork bomb was defused, I was able to run actions/cache and viola! My volumes were cached and restored properly. Space time is saved Marty!

Final boss

After all this, I was still not very happy as it made all action/cache calls in my workflow doubled, and with the same hack repeated in both parts. So I decided to create a GitHub Action that would contain the chaos, the madness, the fork bomb minefield, and all the other ugliness. Both from my sight and others’. Please enjoy BYK/docker-volume-cache-action and cache responsibly.

A repeated CI run which took about 13 minutes versus 16 minutes without the cache.

Footnotes

  1. That said all images for this article was generated by DeepAI Image Generator

  2. Looking at you confluentinc/cp-kafka

  3. Yes, yes, there are even more. Can you believe it? I couldn’t either. But I digress.

  4. Me when I realized this: mother forking shirt balls!

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

👋 Kindness is contagious

DEV is better (more customized, reading settings like dark mode etc) when you're signed in!

Okay