Docker Ate 56 GB of Disk in a Day: Building a Cleanup Automation

#docker #disk #incident #systemd

"no posts for hours" — the message I got

I noticed it in the evening — my hourly content-generate cron hadn't completed a single successful run since morning. The pipeline-health monitor hadn't fired its state-change email yet (the 4-hour threshold hadn't been hit), but the GitHub Actions panel was bright red.

The last successful run finished at 2026-05-04 12:11 UTC. More than 5 hours had passed. Zero new content. The single most common reason this blog goes down is resource starvation — disk or RAM. I quickly figured out which one.

A line that jumped out at me from the run log:

##[error] System.IO.IOException: No space left on device
  : '/home/github-runner/runner-mustafaerbay/_diag/pages/...log'

The runner couldn't write its own log file — no space on disk. At that point it hadn't even reached the validate step; the runner's own _diag layer was dead. Each cron tick retried and blew up at the exact same place.

SSH into the VPS and see

$ ssh vps 'df -h /'
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        72G   72G   11M 100% /

72 GB disk with 11 MB free. You can't even write a single log line.

The second query was more interesting:

$ ssh vps 'sudo du -hx --max-depth=2 / 2>/dev/null | sort -hr | head -10'
72G  /
54G  /var
39G  /var/lib
15G  /var/www
7.2G /home
5.3G /home/github-runner
4.0G /usr
2.9G /opt
2.6G /usr/lib
2.3G /opt/mustafaerbay

/var/lib was 39 GB. /var/www was 15 GB. This isn't a personal blog VPS — it has 6 different projects on it. My eye went straight to /var/lib because that's where Docker lives.

$ ssh vps 'sudo du -hx --max-depth=1 /var/lib | sort -hr | head'
39G  /var/lib
38G  /var/lib/docker  <- HERE
169M /var/lib/dkms
164M /var/lib/Acronis
140M /var/lib/apt
6.3M /var/lib/mustafaerbay

Docker on its own: 38 GB.

Crack open Docker's internals

$ ssh vps 'sudo docker system df'
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          33        9         27.5GB    23.27GB (84%)
Containers      13        13        1.192MB   0B (0%)
Local Volumes   8         8         387MB     0B (0%)
Build Cache     388       0         7.695GB   7.695GB

This table answers the question directly:

33 images exist, only 9 are active. 24 are "not in use but not deleted." 23.27 GB reclaimable.
388 build cache layers, 0 active. The whole 7.7 GB is up for deletion.
Containers and volumes are normal — I don't want to wipe those (postgres data, etc., lives there).

Total reclaimable: ~31 GB. Just freeing that would open up enough space.

See the running containers first, then cut

Don't rush. docker system prune -a deletes everything — you have to know the line between running and not running images. I checked the docker ps output: half a dozen different projects had containers on this VPS — a few of my own side products and some client work. 13 healthy containers in total: postgres, redis, Next.js apps, an Astro SSR service, and a few workers. The only things that can be reclaimed are unanchored images — older image versions not referenced by any running container.

I lined up two safe commands:

# 1. Build cache (old layers, nothing uses them)
sudo docker builder prune -af

# 2. Unused images (the ones not anchored to a running container)
sudo docker image prune -af

The -a flag also removes tagged-but-not-dangling images. Risky? I don't think so — anything anchored to an active container won't be removed anyway (Docker keeps a reference count). Only the "once built, used, then a newer version came along" old images go.

The result:

=== Docker build cache ===
Total reclaimed space: 33.48GB

=== Docker unused images ===
Total reclaimed space: 22.62GB

=== After ===
/dev/sda1   72G   40G   33G  56% /

I reclaimed 56 GB. Disk went from 100% to 56%. 33 GB free. All 13 containers kept running.

Now: let's automate this

This wasn't even the third time it happened — let it be a lesson:

"Once it happens twice, do a manual fix. The third time, automate it."

The disk-cleanup.sh script I wrote is simple but careful. A few principles:

#!/usr/bin/env bash
set -euo pipefail

echo "=== disk-cleanup starting ==="
echo "before: $(df -h / | tail -1)"

# 1) Docker build cache > 72h (newer cache survives)
echo "-- docker builder prune (>72h)"
docker builder prune -af --filter "until=72h"

# 2) Dangling docker images (no -a — tagged-but-unused IS PRESERVED)
echo "-- docker image prune (dangling only)"
docker image prune -f

# 3) journal > 7d
journalctl --vacuum-time=7d

# 4) APT cache (regenerable)
apt-get clean

# 5) mustafaerbay dist-old (deploy backup, regenerated each deploy)
[ -d /opt/mustafaerbay/dist-old ] && rm -rf /opt/mustafaerbay/dist-old

# 6) GitHub runner _diag log files > 14d (files only, LEAVE the directories alone)
find /home/github-runner -path '*/_diag/*' -type f -name '*.log' -mtime +14 -delete

echo "after:  $(df -h / | tail -1)"

Hooked it up to a daily timer running at 03:30 UTC:

[Timer]
OnCalendar=*-*-* 03:30:00
RandomizedDelaySec=10m
Persistent=true

RandomizedDelaySec=10m — so it doesn't collide with any other 03:30 cron jobs that might be on the system. Persistent=true — if the VPS rebooted, the run that got skipped will still happen.

⚠️ Be careful with the -a flag

During the manual recovery I used docker image prune -af (with the a flag)
because it was an emergency and I needed space immediately. But for the daily
cron I use prune -f (no a). Don't be too aggressive in an automated runner —
you can get bitten one day. Use -a for manual cleanups, dangling-only for
automation.

Conclusion: the one-line why

My disk filled up because Docker doesn't clean up at all. Every docker-compose build creates a new image; it doesn't keep the old one referenced, but it doesn't delete it either. A few months in, a year in, your disk explodes and you go "wow, did the AI grow this much?"

Nobody's growing it, actually. Docker is a hoarder. If you're not active about it, the disk fire teaches you that.

A two-hour manual recovery + a one-hour systemd timer setup = the guarantee I won't go through this again. That's the real lesson: turn an incident into the foundation for the next one.

Tomorrow disk-cleanup.timer will run for the first time. I'm watching.