DEV Community

Cover image for How I Recovered 35GB on a Production Server by Moving Docker Builds Off It
FOLASAYO SAMUEL OLAYEMI
FOLASAYO SAMUEL OLAYEMI

Posted on

How I Recovered 35GB on a Production Server by Moving Docker Builds Off It

And why your server should never be your build machine

It started with a simple task, deploy a new service on a DigitalOcean droplet already running a dozen containerized apps. But the moment I SSHed in, the MOTD stopped me cold:

Usage of /: 100.0% of 154.88GB
Enter fullscreen mode Exit fullscreen mode

100%. Not 92%. Not "you should look at this soon." A hundred percent. On a production server running over a dozen live services.

This is the story of how I diagnosed it, fixed it surgically without touching a single running service, and more importantly, how I changed the architectural pattern that caused it in the first place.

First, Understand the Terrain

Before running a single cleanup command, I did what every senior engineer does when something is wrong in production: I measured everything first.

docker system df -v
Enter fullscreen mode Exit fullscreen mode

This is the command that tells you the truth. Not df -h alone, not vibes this. It breaks down exactly how Docker is consuming your disk: images, containers, volumes, and build cache, with per-item detail.

Here's what came back:

Images space usage:     ~8GB total
Containers space usage: ~50MB
Local Volumes:          ~4.1GB (two anonymous volumes with 0 links)
Build cache usage:      29.22GB
Enter fullscreen mode Exit fullscreen mode

The containers themselves? Barely anything. The images? Manageable. The build cache? 29.22 gigabytes.

That single number told me everything I needed to know about what had been happening on this server.

What Is Docker Build Cache and Why Does It Silently Destroy You?

When Docker builds an image, it executes your Dockerfile instruction by instruction. Each instruction RUN, COPY, FROM produces a layer. Docker is smart: if a layer hasn't changed since the last build, it reuses the cached version instead of recomputing it.

This is what makes docker build fast on a developer machine or a CI runner. First build is slow. Second build? Docker says "I've seen this before" and skips to the parts that changed.

But here's the part nobody talks about enough: that cache lives on disk. Permanently. Until you explicitly remove it.

Every build you run on a server adds to it. Every time you push a new image and Docker re-resolves your base image layers, it caches them. Rebuild with a new Node.js version? Cached. Pull a new ubuntu:24.04? Cached. Run npm install after updating package.json? Every intermediate layer: cached.

Over weeks of active development across 10+ services, all building on the same server, you accumulate layers on top of layers, most of them orphaned by subsequent builds, none of them cleaned up automatically.

This is exactly what happened here. 29GB of silent accumulation.

The Deeper Problem: Your Server Is Not a Build Machine

When you build Docker images directly on the server that runs them, you are asking one machine to do two fundamentally different jobs simultaneously:

Job 1: Build environment, Compile code, resolve dependencies, execute multi-stage builds, cache intermediate layers, pull base images from registries.

Job 2: Runtime environment, Run containers reliably, serve real traffic, maintain uptime, stay lean and predictable.

These two jobs have opposing requirements.

A build environment benefits from cache, the more it stores, the faster subsequent builds are. A runtime environment suffers from cache, it's dead weight that grows without bound and competes with your running services for the disk space they need to function.

Putting both on the same machine is a design smell. It works, until it doesn't. And when it breaks, it breaks at 100% disk utilisation, which means everything on that machine is now at risk.

The Senior Move: Eliminate the Source, Not Just the Symptom

The junior response to this situation is: "Run docker builder prune and move on."

That would have recovered the 29GB. But three weeks later, the same builds would have filled it back up. You'd be doing this every month. Reactive. Whack-a-mole.

The senior response is: "Why is this cache here in the first place, and how do I make sure it can never accumulate to this scale again?"

The answer: move the build off the server entirely.

For one of our Next.js services, I set up a GitHub Actions pipeline that:

  1. Builds the Docker image in the CI runner (GitHub's infrastructure, not yours)
  2. Pushes the built image to GHCR (GitHub Container Registry)
  3. SSHes into the droplet and runs docker compose pull && docker compose up -d

The server never runs docker build. It only ever runs docker pull and docker run. It is a runtime environment again, which is exactly what it should be.

The docker-compose.yml reflects this cleanly:

services:
  web-app:
    image: ghcr.io/your-org/your-app:latest
    container_name: web-app
    ports:
      - "4700:4700"
    restart: unless-stopped
    env_file:
      - .env
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://127.0.0.1:4700"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
Enter fullscreen mode Exit fullscreen mode

No build: key. No context:. No Dockerfile reference. The server simply pulls a pre-built, verified image from the registry and runs it. The build machine (GitHub Actions runner) does the heavy lifting and discards its own cache after each job. The server's disk never sees a single build layer.

The Cleanup — Surgical, Not Nuclear

Once I had the architectural change in place, I cleaned up what had accumulated. The key here was being deliberate about what I removed and what I left alone. There are over a dozen services running on this server. The wrong command would have meant downtime.

Here's the actual order of operations:

Step 1: Build cache the biggest win, zero risk

docker builder prune -f
Enter fullscreen mode Exit fullscreen mode

This removes only the build cache. It touches nothing else, no running containers, no images, no volumes. Every service kept running. Freed: 29.22GB.

Worth understanding: the next time each remaining server-built service rebuilds, it will be slightly slower because the cache is gone. That's the only consequence. After the first post-prune build, cache starts accumulating again and speed returns.

Step 2: Journal logs the second biggest win

journalctl --vacuum-time=3d
Enter fullscreen mode Exit fullscreen mode

The journald logs had grown to 4.0GB. This command keeps the last 3 days and removes everything older. No service is affected logs are still being written, you're just trimming history.

Step 3: The old local image the obvious orphan

docker rmi your-app-local:latest
Enter fullscreen mode Exit fullscreen mode

docker system df -v showed a locally-built image consuming 1.554GB with 0 running containers. This was the artifact from the old way, back when the server was still building the image locally. It was superseded the moment the pipeline took over and GHCR became the source of truth. Dead weight.

Step 4: Anonymous volumes, verify before you delete

Two anonymous volumes had 0 LINKS, meaning no container was referencing them. Before removing them, I inspected both:

docker inspect <volume-id>
Enter fullscreen mode Exit fullscreen mode

Both came back with "Containers": {}. Orphaned. Removed. Another ~4GB recovered.

What I Did Not Touch

This is equally important.

  • docker system prune -a, not run. The -a flag removes ALL images not currently attached to a running container. Several services on this server are temporarily stopped but need their images intact for restart. This command would have deleted them.
  • docker volume prune, not run blindly. I inspected volumes individually first. One wrong deletion here and you're restoring a database from backup at 11pm.
  • Any running container, untouched throughout. Before and after, all services remained up.

The discipline of knowing what not to run matters as much as knowing what to run.

The Numbers

Cleanup action Space recovered
Build cache (docker builder prune -f) 29.22GB
Journal logs (journalctl --vacuum-time=3d) ~4.0GB
Old local image (superseded by pipeline) 1.554GB
Orphaned anonymous volumes ~4.1GB
Total ~38GB

Disk utilisation dropped from 100% to roughly 75%, and more critically, the architectural change means the build-cache portion can never grow back to that scale for the migrated service.

Going Forward: Keep It From Happening Again

For services still building on the server (while I migrate them to pipeline builds), a weekly cron prevents silent accumulation:

# crontab -e
# Prune build cache older than 7 days, every Sunday at 3am
0 3 * * 0 docker builder prune --filter "until=168h" -f >> /var/log/docker-prune.log 2>&1
Enter fullscreen mode Exit fullscreen mode

The --filter "until=168h" flag is the important detail here, it removes cache older than 7 days but preserves recent cache, so the next build isn't cold. It's maintenance, not surgery.

The real fix, though, is finishing the migration. Every service that moves to pipeline builds is a service whose build artefacts never touch the production server again. That's the direction everything is heading.

The Takeaway

If there's one principle to extract from all of this, it's this:

A production server's only job is to run containers. Any server that is also building containers is doing two jobs, and eventually, those jobs will conflict.

Build cache is invisible until it's not. It accumulates quietly, efficiently, and with the best of intentions, making your builds faster. But on a shared production server, it's borrowing disk space that belongs to your running services, and it will keep borrowing until there's nothing left.

The fix isn't a cleanup script on a cron job. It's a CI/CD pipeline that builds in an ephemeral environment, pushes to a registry, and lets your server do the one thing it's actually there to do.

Diagnose first. Operate surgically. Fix the root cause, not the symptom. And never let your server be its own build machine.

If you found this useful, I create contents about DevOps, infrastructure, and backend engineering on YouTube, Hashnode and Dev.to. Follow along if this is the kind of problem-solving you want more of.

Top comments (0)