Mustafa ERBAY

Posted on May 13 • Originally published at mustafaerbay.com.tr

Guide to Detecting and Limiting Resource-Hog Containers on a VPS

#docker #container #vps #linux

Introduction: The Hidden Danger of Resource-Hog Containers

On my own VPS or in client projects, I've always relied on the flexibility and ease of deployment that Docker containers offer. However, this convenience sometimes comes with an overlooked risk: resource consumption. An uncontrolled container hogging CPU, memory, or disk I/O can destabilize the entire system. When this situation affected my other critical services and led to unexpected outages, I once again understood how vital detection and intervention mechanisms are.

Recently, I noticed a system-wide slowdown on the VPS I use for the backend of one of my side products. Even my SSH connections were responding with delays. At first, I thought there might be a network issue, but upon closer inspection, I realized the real problem was with one of the containers. In scenarios like this, I'll explain step-by-step what I do, how I detect resource-hog containers, and how I apply limits to them.

Recognizing the Symptoms: When Does a Container Hog Resources?

There are several common symptoms indicating that a container is excessively consuming resources. Early detection of these signs is critical to prevent bigger problems. I usually start ringing the alarm bells in the following situations:

System-Wide Slowdown: The entire server becomes unresponsive, commands execute slowly, and network connections lag.
Application Errors or Delays: Applications I'm running (e.g., operator screens in a production ERP) respond slower than expected or generate timeout errors.
OOM-Killed Processes: Seeing Out of Memory (OOM) killer messages in journald logs or dmesg output, which are usually triggered by memory exhaustion.
High Disk I/O: Disk activity significantly exceeds normal levels, and a noticeable increase is observed in iostat output. This is particularly common in containers that write logs or perform intensive database operations.
Increased Error Rates: The error rates of a service behind an Nginx reverse proxy rise because the backend container cannot respond to incoming requests in time.

ℹ️ Error Logs Are Crucial

One of the first places I look when there's an issue is the journald logs. Examining detailed logs with the journalctl -xe command can provide important clues about which processes the OOM Killer terminated or why the system slowed down.

Methods for Detecting Resource Consumption

When I suspect a container is hogging resources, I first check the overall system status, then delve into the details with Docker-specific tools. This step-by-step approach makes it easier for me to find the root cause of the problem.

System-Level Checks

My first checks upon connecting to the server are:

top or htop: These tools show the real-time status of CPU, memory, and running processes. They are excellent for quickly seeing which processes are consuming the most resources. htop is more interactive and easier to understand due to its colorful interface.
```
# top command
top

# htop command (you might need to install it: sudo apt install htop)
htop
```
In the top output, I pay attention to the %CPU and %MEM columns. Abnormally high values indicate a potential source of problems.
free -h: Displays memory usage in a human-readable format. The total, used, free, buff/cache, and available columns are important.
```
free -h
```
A drop in the available value to critical levels is a sign that the system will soon start using swap or that the OOM Killer will intervene.
iostat -xz 1: Shows disk I/O activity. I particularly look at await, %util, r/s, and w/s values. A high %util value indicates that the disk is excessively busy.
```
iostat -xz 1
```
Unusually high r/s (read requests per second) and w/s (write requests per second) values can indicate an application heavily using the disk. I once identified a logging container completely saturating disk I/O using this command.
vmstat 1: I use this to monitor virtual memory statistics and overall system activity. The r (running processes), b (blocked processes), swpd (used swap), free (free memory), si (swap in), so (swap out) columns are important.
```
vmstat 1
```
Constantly high si and so values indicate that the system is exhausting its memory and heavily swapping to disk, which severely degrades performance.

Docker and Container-Level Monitoring

If general system checks indicate a problematic container, I turn directly to Docker tools:

docker stats: Displays real-time CPU, memory, network I/O, and disk I/O usage for a specific container or all containers. This command is the fastest way to pinpoint the resource hog.
```
docker stats
```
In the output, I focus on the CPU %, MEM %, MEM USAGE / LIMIT, and IO columns. Seeing a container's MEM USAGE value approaching or exceeding its LIMIT indicates that I need to intervene immediately.
docker inspect <container_id_or_name>: Shows detailed configuration information for a container, especially cgroup settings under HostConfig. This is important for understanding what limits are defined for the container.
```
docker inspect my_problematic_container | grep -i "memory\|cpu"
```
With this command, I can see settings like Memory, CpuShares, CpuQuota defined for the container. If these settings have not been made, the container has the potential for unlimited resource consumption.
journalctl -u docker.service: I examine the Docker daemon's own logs. I can find messages here about the OOM Killer terminating containers.
```
journalctl -u docker.service --since "1 hour ago" | grep -i "oom"
```
Sometimes, I see OOM errors during a build process or a container constantly restarting in these logs.
OOM Events in Kernel Logs: Examining kernel logs directly can also be useful.
```
grep -i "oom" /var/log/kern.log
# or
dmesg | grep -i "oom"
```
These logs more clearly show system-wide memory issues and which processes were targeted by the OOM Killer.

Cgroup Mechanism and Container Resource Limits

The Linux kernel's cgroup (control group) mechanism is a fundamental structure for managing and monitoring resource usage (CPU, memory, disk I/O, network) of process groups. Docker uses these cgroups to apply resource limits to containers. This means that the CPU or memory limits we set with Docker commands are actually passed to the kernel as cgroup settings in the background.

When we impose a limit on a container, we are essentially defining specific boundaries for the cgroup to which that container's processes belong. This ensures that no matter how aggressive a container is, it cannot exceed the defined limits and affect other system resources. In a production ERP system, I had to meticulously adjust cgroup limits to prevent an AI-driven production planning service from consuming uncontrolled amounts of memory and impacting other critical services.

💡 Cgroup File System

On Linux systems, you can find the cgroup virtual file system under /sys/fs/cgroup. For example, Docker's memory limits can be seen at /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes. Manually inspecting these files is useful for verifying whether limits are truly being applied.

Applying Limits to a Resource-Hog Container

The next step after identifying a resource-hog container is to apply appropriate limits to it. Docker allows us to define various resource limits using docker run or docker update commands.

Memory Limits

Memory limits are one of the most critical settings for controlling how much RAM a container can use.

--memory (or -m): Specifies the maximum amount of memory the container can use. This is a hard limit. If the container exceeds this limit, it will be terminated by the OOM Killer.
```
docker run -d --name my-app-limited --memory "512m" my-image
```
This command allows the my-app-limited container to use a maximum of 512 MB of RAM.
--memory-swap: Used in conjunction with --memory. It determines the total memory (RAM + swap space) the container can use. If --memory-swap is greater than --memory, the container can use swap space equal to the difference. If --memory-swap is equal to --memory, the container cannot use any swap. A value of -1 means unlimited swap.
```
# 512MB RAM, 512MB Swap (1GB total)
docker run -d --name my-app-swap --memory "512m" --memory-swap "1g" my-image

# 512MB RAM, no Swap usage
docker run -d --name my-app-no-swap --memory "512m" --memory-swap "512m" my-image
```
I once saw a Node.js application completely fill the system's swap space due to a memory leak. Carefully setting the --memory-swap limit prevented such situations.
--memory-swappiness: Controls the Linux kernel's swappiness setting at the container level (between 0 and 100). Lower values reduce swap usage, while higher values increase it.
```
docker run -d --name my-app-swappiness --memory "512m" --memory-swappiness 10 my-image
```
--memory-reservation: This is a soft limit (memory.high in cgroup). The container can exceed this value if there is no memory pressure on the system, but it will try to drop to this reservation level when the system needs memory.
```
docker run -d --name my-app-soft-limit --memory "1g" --memory-reservation "512m" my-image
```
This setting is very useful for managing sudden memory spikes in a container, especially when configuring connection pools for applications like PostgreSQL.

⚠️ Incorrect Limits Can Degrade Performance

Setting too low memory limits for containers can lead to the application constantly being terminated by the OOM Killer or becoming excessively slow. To find the correct limits, it's necessary to thoroughly analyze how much memory the application uses under load.

Processor (CPU) Limits

CPU limits control how much processing power a container can use.

--cpus: Directly specifies the number of CPU cores the container can use. For example, 1.5 means one and a half cores.
```
docker run -d --name my-cpu-app --cpus "0.5" my-image
```
This means the container can use 50% of the total CPU resources.

--cpu-shares: Relative weight for the CPU scheduler (default 1024). Higher values allow the container to receive more CPU time. This is a ratio, not an absolute limit.

# If one container runs with 1024 shares and another with 512, the first gets twice the CPU of the second.
docker run -d --name my-cpu-share-high --cpu-shares 1024 my-image
docker run -d --name my-cpu-share-low --cpu-shares 512 my-image

--cpu-period and --cpu-quota: These two are used together to limit CPU usage as a percentage. cpu-period (default 100000 microseconds) defines a time period, and cpu-quota defines how much CPU time the container can get within that period.
```
# The container can use 50ms of CPU every 100ms (100000 microseconds) (50% CPU)
docker run -d --name my-cpu-quota --cpu-period 100000 --cpu-quota 50000 my-image
```
This method provides an absolute limit similar to --cpus.
--cpuset-cpus: Specifies the particular CPU cores on which the container will run. This is useful especially for applications requiring CPU cache optimization or needing to run on specific hardware.
```
# Container should run only on CPU 0 and 1
docker run -d --name my-cpuset-app --cpuset-cpus "0,1" my-image
```
At one point, I used this setting when I needed to pin certain real-time workloads to specific cores.

Disk I/O Limits

Disk I/O limits control how intensively a container can use the disk. This is important for reducing wear-and-tear on SSDs or preventing other applications from affecting disk performance.

--blkio-weight: Sets the relative weight given to the container for I/O operations (between 10 and 1000, default 0). A higher weight means more I/O time.
```
docker run -d --name my-io-app --blkio-weight 400 my-image
```
--device-read-bps / --device-write-bps: Limits the read/write speed for a specific device in bytes per second (bps).
```
# Limit read speed from /dev/sda to 1MB/s
docker run -d --name my-read-limit --device-read-bps /dev/sda:1mb my-image

# Limit write speed to /dev/sda to 500KB/s
docker run -d --name my-write-limit --device-write-bps /dev/sda:500kb my-image
```
I used these limits on my own VPS to prevent a backup script container from excessively stressing the disk. Otherwise, my other services were slowing down due to waiting for disk I/O.

Changing Limits at Runtime (`docker update`)

You can also change resource limits without stopping and restarting a container. This is very useful for adjusting limits in a production environment without downtime.

# Update the memory limit of a running container to 1GB
docker update --memory "1g" my-problematic-container

# Update the CPU limit of a running container to 0.75 cores
docker update --cpus "0.75" my-problematic-container

This feature has been very helpful when I faced an immediate resource crunch and needed to intervene quickly.

Monitoring Limits and Fine-Tuning

Applying limits is only half the battle. The real challenge is monitoring the impact of these limits and finding the right balance.

Verification with docker stats: After applying limits, I run the docker stats command again to check if MEM USAGE / LIMIT and CPU % values remain within the expected boundaries.
```
docker stats my-problematic_container
```

Manual Inspection of the cgroup File System: Sometimes, Docker's interface might not be enough. I directly inspect the cgroup file system to confirm how limits are applied at the kernel level.

# To find the exact cgroup path of the container
CONTAINER_ID=$(docker inspect -f '{{.Id}}' my-problematic_container)
echo "/sys/fs/cgroup/memory/docker/$CONTAINER_ID"

# Check memory limit
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes

# Check CPU quota and period values
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us

Following Logs: Application logs and journald logs show how the application behaves under the limits. Seeing OOM Killer messages decrease or disappear entirely is a sign that I'm on the right track.
Fine-Tuning: It's difficult to set perfect limits in one go. I usually make gradual adjustments by observing the application's behavior under normal and heavy loads. For example, when deploying a new AI model in a production ERP, I closely monitored the model's memory and CPU consumption, and after a certain period, I tightened the limits a bit further. Trial and error and continuous observation are key in this process.

Challenges and Trade-offs Encountered

Setting resource limits is always a balancing act. Incorrect limits can lead to new problems.

Performance Degradation due to Incorrect Limits: If I allocate too few resources to a container, the application will constantly throttle, slow down, or crash. This directly impacts user experience. For example, when configuring connection pools for PostgreSQL, if I set the memory.high soft limit too low, I observed a decrease in database performance.
Hard Limit vs. Soft Limit Choices: A hard limit (--memory, --cpus) provides a guaranteed upper bound but restricts the application's flexibility during sudden needs. A soft limit (--memory-reservation, --cpu-shares), on the other hand, offers flexibility but can degrade application performance when there's memory pressure on the system. It's crucial to find the right balance based on the application's criticality and behavior.
Unexpected Effects of the OOM Killer: When a hard limit is exceeded, the OOM Killer intervenes and terminates the container. This can cause the application to stop suddenly and unexpectedly. Therefore, setting up monitoring and alerting mechanisms for critical applications is essential.
Cost of Allocating Excessive Resources: Especially on cloud-based VPSs, allocating more resources than necessary directly increases costs. The primary purpose of a VPS is to use resources efficiently. Therefore, allocating only what each container needs is critical for both cost and overall system efficiency. In my own side product, I rigorously track these limits to optimize the VPS cost.
Importance of cgroup memory.high Soft Limit: The memory.high (set with --memory-reservation in Docker) is very useful. A container tries to clear its page cache to fall below this limit, acting as a "warning" mechanism before the OOM Killer's harsh intervention. This allows the application to proactively reduce its memory usage and helps the system run more stably.

Conclusion: Continuous Observation for Stability and Efficiency

Detecting and limiting resource-hog containers on a VPS or any containerized environment is an indispensable step for system stability and efficiency. The steps and commands I've outlined in this guide are practical solutions I've gained from years of field experience. Ignoring resource management by being complacent with the ease of containers is an invitation to regret waking up in the middle of the night to a crashed system.

Remember, setting limits once and forgetting about them is not enough. Application behaviors and load profiles can change over time. Therefore, regular `monitoring`, `log` tracking, and fine-tuning when necessary should be part of a long-term strategy. Especially by setting up `alerting` systems to receive automatic notifications when a container approaches or exceeds defined limits, you can adopt a proactive approach. In future articles, I might delve into the details of how to set up these `monitoring` and `alerting` systems.

DEV Community

Guide to Detecting and Limiting Resource-Hog Containers on a VPS

Introduction: The Hidden Danger of Resource-Hog Containers

Recognizing the Symptoms: When Does a Container Hog Resources?

Methods for Detecting Resource Consumption

System-Level Checks

Docker and Container-Level Monitoring

Cgroup Mechanism and Container Resource Limits

Applying Limits to a Resource-Hog Container

Memory Limits

Processor (CPU) Limits

Disk I/O Limits

Changing Limits at Runtime (`docker update`)

Monitoring Limits and Fine-Tuning

Challenges and Trade-offs Encountered

Conclusion: Continuous Observation for Stability and Efficiency

Top comments (0)

Introduction: The Hidden Danger of Resource-Hog Containers

Recognizing the Symptoms: When Does a Container Hog Resources?

Methods for Detecting Resource Consumption

System-Level Checks

Docker and Container-Level Monitoring

Cgroup Mechanism and Container Resource Limits

Applying Limits to a Resource-Hog Container

Memory Limits

Processor (CPU) Limits

Disk I/O Limits

Changing Limits at Runtime (docker update)

Monitoring Limits and Fine-Tuning

Challenges and Trade-offs Encountered

Conclusion: Continuous Observation for Stability and Efficiency

Changing Limits at Runtime (`docker update`)