I Locked Up the Server Because of Docker: A Lesson in Trust and

#docker #performance

I've encountered countless problems in my career, but more often than not, it wasn't the complexity of the solutions, but a simple oversight that put me in a more difficult situation. Once, while running the backend of a side product I developed on Docker, I completely locked up the server due to a small detail. At that moment, I thought "this can't be happening," but it turned out to be an experience that taught me a lot.

When I made this mistake, I was trying to bring up a critical service running on my own VPS. Docker's flexibility and ease had attracted me, but I realized too late that I had overlooked some underlying dynamics, and it cost me dearly.

How It All Started: My Trust in Docker

For years, I worked on bare-metal servers, in a setup where every service had its own dedicated systemd unit, and everything was manually optimized. Then Docker entered my life, with the allure of containerizing everything. It provided tremendous convenience in the development environment; with a single docker-compose up -d command, all dependencies would come up. This comfort led me to think, "how bad could it be in production?"

I was developing a backend service for my own side product. It was a simple architecture written with FastAPI, using a PostgreSQL database, and cached with Redis. Each component was designed to run in a separate Docker container. Tests were going great, resource consumption seemed reasonable, and everything was working perfectly. This system running smoothly for a few months created an unwarranted sense of confidence in me.

That Critical Night: Why the Server Didn't Respond

One Friday evening, around 11:47 PM, I woke up to a series of alarms from my monitoring system. At first, I thought it was a simple network outage, but when I tried to connect to the server via SSH, I got no response. Checking through the web interface, I saw that the server was completely frozen. CPU usage was 100%, all memory was consumed, and disk I/O was maxed out. It was as if it had suffered a DDoS attack, but it was coming from within.

After much effort, I managed to restart the server. The first thing I did was check journalctl -xe. The scene I encountered showed system logs filling up at hundreds of lines per second. Docker's logs, in particular, seemed to have taken over the entire system. Memory exhaustion (OOM-killed) warnings, disk full alarms, and countless I/O errors... It was utter chaos.

⚠️ Moment of Panic

In that moment of panic, I focused only on getting the system back up. I thought the time I'd spend finding the real root cause would increase the service's downtime. This isn't always a good strategy; sometimes, pausing to think brings faster and more permanent solutions.

Who Was the Culprit? Docker's Silent Theft

After the restart, I ran docker system df and docker volume ls commands. What I saw was this: Docker's overlay2 storage driver had consumed terabytes of space. It turned out that so many container images and intermediate layers had accumulated that the disk space was completely full. Moreover, container logs were also stored on disk by default with the json-file driver, and these logs had reached enormous sizes.

The problem stemmed from my negligence in regularly running commands like docker system prune or docker volume prune. While I frequently used these commands in the development environment, I hadn't touched them in production "because it was working stably." This allowed Docker's default configuration to silently consume resources over time. Since I hadn't set cgroup limits strictly enough, a container entering an endless loop and spewing logs had strained the entire system.

# A simplified command output summarizing my situation at the time
$ docker system df
TYPE                TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images              23        1         3.5GB     2.8GB (80%)
Containers          12        1         43GB      43GB (100%)
Local Volumes       27        1         87GB      87GB (100%) # This part exploded
Build Cache         0         0         0B        0B

The Local Volumes value above was much larger in my case and was the main problem filling my disk. As someone who works with production ERPs, I was angry at myself for this simple oversight, despite knowing how critical disk space is.

Lessons Learned and Next Steps

After this incident, I seriously reevaluated my Docker usage. The first thing I did was set the logging driver for all containers to syslog or journald instead of json-file. This way, logs would flow to a central system and wouldn't cause disk fullness. I also set max-size and max-file limits.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

By adding these settings to the daemon.json file, I restricted the log size of each container. Additionally, I automated running docker system prune -a and docker volume prune commands with regular cron jobs. Most importantly, I assigned CPU and memory (cgroup) limits to each container. This prevented a container from consuming resources senselessly and locking up the entire system.

ℹ️ An Important Reminder

Although Docker provides convenience, it's important to remember that default settings are not always the best option in a production environment. Resource limits, log management, and regular cleanup are indispensable for a stable system.

This experience showed me once again that no matter how advanced automation becomes, system administrators and architects must not overlook fundamental principles. The "ready-made, install and run" mentality can lead to major problems, especially in critical systems.

Do you have a similar experience where you thought, "was it really due to such a simple mistake?" Please share in the comments!