"no posts for hours" — the message I got
I noticed it in the evening — my hourly content-generate cron hadn't completed a single successful run since morning. The pipeline-health monitor hadn't fired its state-change email yet (the 4-hour threshold hadn't been hit), but the GitHub Actions panel was bright red.
The last successful run finished at 2026-05-04 12:11 UTC. More than 5 hours had passed. Zero new content. The single most common reason this blog goes down is resource starvation — disk or RAM. I quickly figured out which one.
A line that jumped out at me from the run log:
##[error] System.IO.IOException: No space left on device
: '/home/github-runner/runner-mustafaerbay/_diag/pages/...log'
The runner couldn't write its own log file — no space on disk. At that point it hadn't even reached the validate step; the runner's own _diag layer was dead. Each cron tick retried and blew up at the exact same place.
SSH into the VPS and see
$ ssh vps 'df -h /'
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 72G 72G 11M 100% /
72 GB disk with 11 MB free. You can't even write a single log line.
The second query was more interesting:
$ ssh vps 'sudo du -hx --max-depth=2 / 2>/dev/null | sort -hr | head -10'
72G /
54G /var
39G /var/lib
15G /var/www
7.2G /home
5.3G /home/github-runner
4.0G /usr
2.9G /opt
2.6G /usr/lib
2.3G /opt/mustafaerbay
/var/lib was 39 GB. /var/www was 15 GB. This isn't a personal blog VPS — it has 6 different projects on it. My eye went straight to /var/lib because that's where Docker lives.
$ ssh vps 'sudo du -hx --max-depth=1 /var/lib | sort -hr | head'
39G /var/lib
38G /var/lib/docker <- HERE
169M /var/lib/dkms
164M /var/lib/Acronis
140M /var/lib/apt
6.3M /var/lib/mustafaerbay
Docker on its own: 38 GB.
Crack open Docker's internals
$ ssh vps 'sudo docker system df'
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 33 9 27.5GB 23.27GB (84%)
Containers 13 13 1.192MB 0B (0%)
Local Volumes 8 8 387MB 0B (0%)
Build Cache 388 0 7.695GB 7.695GB
This table answers the question directly:
- 33 images exist, only 9 are active. 24 are "not in use but not deleted." 23.27 GB reclaimable.
- 388 build cache layers, 0 active. The whole 7.7 GB is up for deletion.
- Containers and volumes are normal — I don't want to wipe those (postgres data, etc., lives there).
Total reclaimable: ~31 GB. Just freeing that would open up enough space.
See the running containers first, then cut
Don't rush. docker system prune -a deletes everything — you have to know the line between running and not running images. I checked the docker ps output: half a dozen different projects had containers on this VPS — a few of my own side products and some client work. 13 healthy containers in total: postgres, redis, Next.js apps, an Astro SSR service, and a few workers. The only things that can be reclaimed are unanchored images — older image versions not referenced by any running container.
I lined up two safe commands:
# 1. Build cache (old layers, nothing uses them)
sudo docker builder prune -af
# 2. Unused images (the ones not anchored to a running container)
sudo docker image prune -af
The -a flag also removes tagged-but-not-dangling images. Risky? I don't think so — anything anchored to an active container won't be removed anyway (Docker keeps a reference count). Only the "once built, used, then a newer version came along" old images go.
The result:
=== Docker build cache ===
Total reclaimed space: 33.48GB
=== Docker unused images ===
Total reclaimed space: 22.62GB
=== After ===
/dev/sda1 72G 40G 33G 56% /
I reclaimed 56 GB. Disk went from 100% to 56%. 33 GB free. All 13 containers kept running.
Now: let's automate this
This wasn't even the third time it happened — let it be a lesson:
"Once it happens twice, do a manual fix. The third time, automate it."
The disk-cleanup.sh script I wrote is simple but careful. A few principles:
#!/usr/bin/env bash
set -euo pipefail
echo "=== disk-cleanup starting ==="
echo "before: $(df -h / | tail -1)"
# 1) Docker build cache > 72h (newer cache survives)
echo "-- docker builder prune (>72h)"
docker builder prune -af --filter "until=72h"
# 2) Dangling docker images (no -a — tagged-but-unused IS PRESERVED)
echo "-- docker image prune (dangling only)"
docker image prune -f
# 3) journal > 7d
journalctl --vacuum-time=7d
# 4) APT cache (regenerable)
apt-get clean
# 5) mustafaerbay dist-old (deploy backup, regenerated each deploy)
[ -d /opt/mustafaerbay/dist-old ] && rm -rf /opt/mustafaerbay/dist-old
# 6) GitHub runner _diag log files > 14d (files only, LEAVE the directories alone)
find /home/github-runner -path '*/_diag/*' -type f -name '*.log' -mtime +14 -delete
echo "after: $(df -h / | tail -1)"
Hooked it up to a daily timer running at 03:30 UTC:
[Timer]
OnCalendar=*-*-* 03:30:00
RandomizedDelaySec=10m
Persistent=true
RandomizedDelaySec=10m — so it doesn't collide with any other 03:30 cron jobs that might be on the system. Persistent=true — if the VPS rebooted, the run that got skipped will still happen.
⚠️ Be careful with the -a flag
During the manual recovery I used
docker image prune -af(with the a flag)
because it was an emergency and I needed space immediately. But for the daily
cron I useprune -f(no a). Don't be too aggressive in an automated runner —
you can get bitten one day. Use -a for manual cleanups, dangling-only for
automation.
Conclusion: the one-line why
My disk filled up because Docker doesn't clean up at all. Every docker-compose build creates a new image; it doesn't keep the old one referenced, but it doesn't delete it either. A few months in, a year in, your disk explodes and you go "wow, did the AI grow this much?"
Nobody's growing it, actually. Docker is a hoarder. If you're not active about it, the disk fire teaches you that.
A two-hour manual recovery + a one-hour systemd timer setup = the guarantee I won't go through this again. That's the real lesson: turn an incident into the foundation for the next one.
Tomorrow disk-cleanup.timer will run for the first time. I'm watching.
Top comments (0)