This post is part of the Ultimate Container Security Series, a structured, multi-part guide covering container security from foundational concepts to runtime protection. For an overview of the series structure, scope, and update schedule, see the series introduction post here.
When dozens or hundreds of applications share the same Linux system, managing their access to hardware resources, like CPU, memory, and disk I/O, becomes an absolute necessity. Without strict boundaries, a single misbehaving or compromised process can easily consume all available resources. This starves other applications, degrades system performance, and can even bring the entire host down.
From a security perspective, an attacker exploiting an unbounded application can intentionally cause this resource exhaustion, resulting in a severe Denial of Service (DoS). Because containers are ultimately just processes running on a shared host kernel, they are equally susceptible to this risk. To keep services stable and secure, we need a way to enforce fairness and strict isolation.
In this chapter, we will explore Linux Control Groups (cgroups), a powerful kernel feature that allows us to limit and isolate the resource usage of processes.
Introduction to cgroups
At its core, cgroups v2 is a Linux kernel mechanism that allows the system to organize processes into hierarchical groups and apply strict resource limits to them. With cgroups, administrators and container runtimes can precisely dictate how much CPU time, memory, and disk I/O throughput a specific set of processes is allowed to consume.
Understanding how cgroups operate is essential because they are the mechanism Linux uses to enforce resource fairness at the kernel level.
Consider a scenario where a process is allowed to consume unlimited memory. It will eventually starve other critical processes on the same host. This might happen inadvertently due to a bug, like a memory leak in a poorly written application. However, from a security perspective, an attacker can deliberately trigger or exploit this leak to perform a resource exhaustion attack. By strictly capping the memory and other resources a containerized process can access, you neutralize the blast radius of this kind of attack, ensuring the rest of the host system continues operating normally.
cgroups v1 vs. cgroups v2
Control groups have been around for a long time, but the ecosystem has fundamentally shifted. While version 2 of cgroups has been in the Linux kernel since 2016 (with Fedora leading the charge as the first major distro to default to it in mid-2019), it is now the undeniable standard for modern Linux systems and orchestration platforms.
The biggest architectural difference lies in how processes are grouped. In cgroups v1, controllers (the mechanisms that actually govern resources like memory or PIDs) were completely independent. A single process could belong to entirely different groups for different resources. For example, a process could simultaneously join /sys/fs/cgroup/memory/mygroup and /sys/fs/cgroup/pids/yourgroup. This fragmented design led to incredibly complex, confusing hierarchies that were hard to manage and secure.
Cgroups v2 fixes this by introducing a single unified hierarchy. The semantics are much cleaner: a process joins one specific group (e.g., /sys/fs/cgroup/ourgroup) and is automatically subject to all the active controllers configured for that group.
Beyond making resource management much easier to reason about, cgroups v2 brings several massive improvements to stability and security:
- Safer Sub-tree Delegation: It safely allows delegating cgroup management to less-privileged users. This is a crucial feature that makes rootless containers possible, allowing resource limits to be applied without requiring root privileges.
- Unified Memory Accounting: It properly accounts for different types of memory usage that v1 missed or handled poorly, including network memory, kernel memory, and non-immediate resource changes like page cache write-backs.
- Pressure Stall Information (PSI): A newer feature that provides rich, real-time metrics on system resource pressure, allowing systems to proactively detect and respond to resource shortages before a crash occurs.
- Enhanced Isolation: Better cross-resource allocation management prevents edge-case scenarios where high usage of one resource unexpectedly impacts another.
Throughout this guide, we will focus entirely on cgroups v2, as it is the modern implementation used by secure container environments.
Before we dive deeper, you should verify that your host system is actually running cgroups v2. You can easily check this by querying the filesystem type of the cgroup mount point:
user@container-security:~$ stat -fc %T /sys/fs/cgroup/
cgroup2fs # Note: If the output reads cgroup2fs, you are ready to go. If the output is tmpfs or cgroupfs, your system is still using the legacy cgroups v1 hierarchy.
Exploring Cgroups
⚠️ Warning: Always run the commands in this guide on a disposable Virtual Machine (VM) and never on your personal host machine. Playing with kernel resource limits can easily freeze or crash your system! The examples in this course were run on an Ubuntu Server 24.04 VM.
The core idea behind cgroups is elegantly simple: Processes are organized into hierarchical groups, and each group is assigned specific resource limits.
In Linux, "everything is a file," and cgroups are no exception. There is no special CLI tool you must use to interact with them. Instead, cgroups are exposed directly through a virtual filesystem, usually mounted at /sys/fs/cgroup. Inside this directory, groups are represented as folders, and resource limits are represented as plain text files.
Writing values into these text files directly changes the kernel's behavior.
Let's look at the root of the cgroups v2 filesystem with running ls /sys/fs/cgroup/:
This directory is the root control group. Every single process running on your Linux machine belongs to this root group by default.
In modern Linux systems that use systemd, the cgroups v2 filesystem mounted at /sys/fs/cgroup forms a hierarchical tree where processes are organized and managed. The root directory represents the root control group, and systemd automatically creates subgroups such as init.scope (which contains the system’s PID 1 process), system.slice (which holds system services and daemons), and user.slice (which organizes user sessions). Because systemd manages most services on the system, container runtimes like Docker or orchestration platforms like Kubernetes typically run as system services under system.slice. As a result, the containers they start appear as nested cgroup directories beneath those services, for example, under system.slice/docker.service/docker-container.scope. This means containers are still part of the same overall cgroup hierarchy, just placed deeper in the tree according to the service that created them:
/sys/fs/cgroup (root)
│
├── init.scope
│
├── system.slice
│ ├── docker.service
│ │ └── docker-container.scope
│ │
│ └── ssh.service
│
└── user.slice
└── user-1000.slice
Whenever a new subdirectory is created here, it represents a new child cgroup that inherits from its parent.
If you look closely at the files in a cgroup directory, you'll notice a strict naming convention. Files are divided into two main categories: Core files and Controller files.
-
Core Files (
cgroup.*): Files prefixed withcgroup.manage the mechanics of the cgroup hierarchy itself, rather than specific hardware resources.-
cgroup.procs: The most important file. It contains a list of Process IDs (PIDs) that belong to this group. To move a process into a cgroup, you simply echo its PID into this file. -
cgroup.controllers: A read-only file showing which resource controllers (cpu, memory, io) are currently available to this specific group. -
cgroup.kill: A v2 feature that lets you instantly kill all processes within the cgroup by writing 1 to it.
-
-
Controller Files (
cpu.,memory.,pids.*, etc.): Controllers are the actual engines that distribute and limit system resources. Files prefixed with a controller name dictate how that specific resource is managed. Furthermore, these files generally fall into two types:-
Configuration (Read-Write): Files you modify to set limits. (e.g.,
memory.max) -
Status (Read-Only): Files you read to get live metrics. (e.g.,
memory.stat). For examplewatch cat /sys/fs/cgroup/memory.statwill show you real-time memory usage stats for that cgroup.
-
Configuration (Read-Write): Files you modify to set limits. (e.g.,
Key Controllers Files
While the Linux kernel supports many controllers, a few are absolutely critical for securing containerized workloads against resource exhaustion and Denial of Service (DoS) attacks.
-
Memory (
memory.*): Regulates RAM usage.-
memory.maxsets an absolute hard limit. If the processes in the cgroup try to use more memory than this, the kernel's Out-Of-Memory (OOM) killer will step in and terminate them. -
memory.highis a softer throttle limit. If breached, the kernel heavily throttles the processes and forces them to reclaim memory, but avoids outright killing them.
-
-
CPU (
cpu.*): Regulates processor time.-
cpu.maxlimits the absolute maximum amount of CPU time the group can use (bandwidth). -
cpu.weightdictates proportional share. If the system is busy, a cgroup with a higher weight gets priority over one with a lower weight.
-
-
PIDs (
pids.*): Regulates process creation.-
pids.maxsets a hard limit on how many processes can exist inside the cgroup. From a security standpoint, this is your primary defense against a Fork Bomb attack, where a malicious script rapidly clones itself to crash the host.
-
-
Block I/O (
io.*): Regulates disk read/write bandwidth.-
io.maxcan prevent a compromised container from thrashing the host's storage drives and starving other containers of database reads or log writes.
-
For highly specialized workloads, cgroups v2 offers several other controllers. While you might not interact with these daily, it's good to know they exist:
-
Cpuset (
cpuset.*): Pins tasks to specific CPU cores and Memory Nodes. This is crucial for high-performance computing on NUMA architectures where memory access latency matters. -
Devices: Controls which device nodes (like
/dev/sdaor/dev/random) a cgroup can access. In v2, this is actually implemented using eBPF programs rather than standard text files. -
HugeTLB (
hugetlb.*): Limits the usage of Huge Pages (large blocks of memory) to prevent a single group from exhausting them. -
RDMA (
rdma.*): Manages Remote Direct Memory Access resources, often used in high-speed clustered networking.
Creating cgroups
Now that we understand how the cgroups filesystem works, let's create a custom cgroup hierarchy.
⚠️ Warning: As mentioned earlier, do not run these commands on your host machine. Use a VM (examples work with Ubuntu Server 24.04).
Most of the commands we are about to run require root privileges. Let's switch to the root user and install cgroup-tools, which provides useful utilities like cgcreate.
sudo su
apt update && apt -y install cgroup-tools
Next, let's export some environment variables to make our commands easier to read. We are going to create:
- a parent cgroup called
scripts(A parent cgroup is the higher-level group that can contain one or more subgroups. It usually defines the overall resource limits that apply to everything inside it.) - a child cgroup called
production(A child cgroup is a subgroup created inside the parent group. Processes can be placed into the child group, and it can have its own additional limits, but it can never exceed the limits set by its parent.)
export PARENT_CGROUP="scripts"
export CHILD_CGROUP="production"
If the parent scripts group had a limit of 2 GB of memory, then the child production group could only use up to that 2 GB, even if it tried to set a higher limit. The child can further restrict resources, but it cannot escape the limits of its parent.
So the structure will look like this:
/sys/fs/cgroup/
└── scripts (parent cgroup)
└── production (child cgroup)
While you can create cgroups using standard Linux commands (e.g., mkdir /sys/fs/cgroup/scripts), using the cgcreate utility allows us to explicitly request which controllers we want to enable.
Let's create our parent cgroup and request only the memory and cpu controllers:
cgcreate -g memory,cpu:/${PARENT_CGROUP}
If the command returns no output, it was successful. Let's look inside the newly created directory: ls /sys/fs/cgroup/${PARENT_CGROUP}
You will see a large list of files representing the parameters and statistics for this new group. However, if you look closely at the active controllers, you might notice something unexpected:
root@container-security:/home/user# cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.controllers
cpu memory pids
The pids controller is active, even though we only requested memory and cpu.
To understand why pids showed up, we need to look at the root cgroup (/sys/fs/cgroup/). Run this command: cat /sys/fs/cgroup/cgroup.subtree_control
root@container-security:~# cat /sys/fs/cgroup/cgroup.subtree_control
cpu memory pids
In cgroups v2, resource controllers are strictly delegated top-down. The cgroup.subtree_control file dictates which controllers are passed down to a group's immediate children. Because the root cgroup is configured to delegate cpu, memory, and pids, our new ${PARENT_CGROUP} automatically inherited all three.
The
pidscontroller in cgroups limits the number of processes (PIDs) that a group can create. A PID is simply a process identifier used by the Linux kernel to track running processes. It is usually enabled by default to prevent fork bombs and runaway process creation. Without it, cgroups could limit CPU and memory, but not process count, which is a safety risk.
Before we create our child cgroup, there is a crucial cgroups v2 rule you must know: The No Internal Process Constraint.
In v2, a cgroup can either have processes assigned to it, OR it can delegate controllers to child cgroups, it cannot do both. (The only exception is the root cgroup).
Because our ${PARENT_CGROUP} is going to delegate cpu and memory to its children, the kernel will refuse to let you assign any running processes directly to ${PARENT_CGROUP}. Instead, processes must be assigned to the leaf nodes of the tree (the final child directories).
Let's create the child cgroup where our actual demo processes will live:
cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}
Although we created the parent and child cgroups in two separate steps, this was mainly for demonstration purposes. In practice, the first
cgcreatecommand is technically redundant because running the second command (cgcreate -g memory,cpu:/${PARENT_CGROUP}/${CHILD_CGROUP}) would automatically create both the parent (scripts) and the child (production) cgroups if the parent does not already exist.
When we ran this, cgcreate automatically updated the cgroup.subtree_control file in the parent directory to delegate the requested controllers down to the child. We can verify this:
root@container-security:/home/user# cat /sys/fs/cgroup/${PARENT_CGROUP}/cgroup.subtree_control
cpu memory
Finally, let's look inside our new child cgroup: ls /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}
If you check the files here, you will see cpu.* and memory.* files, but absolutely no pids.* or io.* files. We now have a perfectly isolated, highly specific leaf cgroup ready to constrain our applications.
Setting Resource Limits
Having created our isolated cgroup hierarchy, it is time to actually enforce some boundaries. This is where the core security value of cgroups shines: by setting strict resource limits, we protect the host system from resource exhaustion attacks and ensure predictable performance.
While you can configure these limits by directly writing to the files with echo (e.g., echo "20000 50000" > /sys/fs/cgroup/my_group/cpu.max), we will use the cgset utility from the cgroup-tools package we installed earlier, as it provides a cleaner syntax for setting multiple limits at once.
Before we apply the limits to our cgroup, let's understand exactly what we are controlling.
CPU Throttling (
cpu.max): In cgroups v2, CPU limits use a simple quota-based model formatted as$MAX $PERIOD. If you set the value to 100000 1000000, you are telling the kernel: For every 1,000,000 microseconds (1 second) of time, this group is allowed to use the CPU for 100,000 microseconds (a tenth of a second). This effectively limits the cgroup to 10% of a single CPU core. Security Note: Unlike memory limits, CPU limits act as a throttle. If a process hits its CPU limit, the kernel simply pauses it until the next period begins. CPU throttling slows applications down, but it never outright kills them.Memory Limits (
memory.max&memory.swap.max): Memory limits set an absolute ceiling on RAM usage. If a cgroup exceeds the value inmemory.max, the kernel initiates heavy throttling. It will aggressively try to reclaim memory by dropping cached data or swapping memory pages out to disk. However, if the process continues demanding memory and the kernel cannot reclaim enough (or if swap is also exhausted), the kernel triggers the Out-Of-Memory (OOM) killer. It calculates an OOM score and terminates the most offending process within that cgroup to protect the rest of the host system.
For the tests, we want to intentionally induce an early OOM kill. To guarantee this happens, we need to strictly limit both the physical memory and the swap memory. Otherwise, the kernel might just push our runaway process into swap space, delaying the crash.
Let's apply a 15% CPU limit and a roughly 200MB limit for both RAM and swap to our production child cgroup:
cgset -r memory.max=200000000 ${PARENT_CGROUP}/${CHILD_CGROUP} # (Note: Memory values here are in bytes, but you could also use suffixes like 100M or 1G.)
cgset -r memory.swap.max=200000000 ${PARENT_CGROUP}/${CHILD_CGROUP}
cgset -r cpu.max="150000 1000000" ${PARENT_CGROUP}/${CHILD_CGROUP}
Let’s verify that the kernel accepted our new limits by reading the files directly:
cat /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/{memory,cpu,memory.swap}.max
You should see an output similar to this:
199999488
150000 1000000
199999488
You might be wondering why the 200000000 bytes we assigned for memory suddenly changed to 199999488.
The kernel manages memory in fixed-size blocks called "pages." On most standard systems, a memory page is exactly 4096 bytes (you can verify your system's page size by running getconf PAGE_SIZE).
When you request a memory limit, the kernel rounds your request down to the nearest whole page. If you divide our requested 200,000,000 bytes by 4096, you get roughly 48,828.125 pages. The kernel drops the decimal, granting you exactly 48,828 pages. Multiply 48,828 by 4096, and you get 199,999,488, the exact byte limit the kernel applied.
Testing and Managing Cgroup Processes
Now that our resource limits are strictly defined in our production cgroup, it’s time to put them to the test. We will observe how cgroups throttle CPU usage, how they handle memory exhaustion, and how we can use built-in tools to manage these processes.
Stressing the CPU
Let's start by establishing a baseline. We will run a command that is notorious for hogging 100% of a CPU core: copying an infinite stream of zeros into the void. Run this command directly on your host (outside our restricted cgroup):
dd if=/dev/zero of=/dev/null &
sleep 2
ps -p $! -o %cpu
Because this process has no bounds, the output will show it consuming nearly 100% of the CPU:
%CPU
98.0
Run kill $! to stop the process before we moving on.
Now, let's run that exact same command, but this time we will use cgexec to launch it directly inside our restricted child cgroup:
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} dd if=/dev/zero of=/dev/null &
sleep 2
ps -p $! -o %cpu
Check the output now:
%CPU
15.3
Run kill $! to stop the process.
The CPU usage hovers right around the 15% limit we defined earlier! If you watch this process in a live monitor like htop, you will see it consistently stay at or below that threshold. The kernel is aggressively pausing and resuming the process to enforce our quota.
Filling Up the Memory (Triggering an OOM Kill)
Let's see what happens when a process refuses to stay within its memory limits. We are going to launch a bash process inside our cgroup that continuously appends 10MB of random data to a variable every half-second until it crashes. This script will quickly breach the roughly 200MB limit we imposed. Because we also limited swap space, the kernel won't be able to page the data to disk.
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} \
bash -c 'a=(); while true; do a+=("$(head -c 10M /dev/zero | tr "\0" "A")"); sleep 1; done' &
You can watch the memory footprint (RSS - Resident Set Size) grow rapidly in real-time using the watch command:
watch ps -p $! -o rss,sz
Within a few seconds, the cgroup will run completely out of memory, and the kernel's Out-Of-Memory (OOM) killer will intervene to protect the host. You will see an output like this:
[1]+ Killed cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} bash -c 'a=(); while true; do a+=("$(head -c 10M /dev/zero | tr "\0" "A")"); sleep 1; done'
In this setup, memory.max was used, which acts as a hard limit and triggers the OOM killer when exceeded. A softer and safer approach is to use memory.high instead. When a process reaches memory.high, the kernel heavily throttles the process and applies strong memory reclaim pressure. This forces the process to slow down and release memory, acting more like a “speed bump” than a hard stop. This behavior provides monitoring systems and administrators time to react and take action before the application is terminated by the OOM killer.
Monitoring with systemd-cgtop and systemd-cgls
Just as you use top and ls to view standard processes, Linux provides systemd-cgtop and systemd-cgls specifically for monitoring cgroups.
First, let's populate our cgroup with a few sleeping background processes so we have something to look at:
for p in {1..5} ; do cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} sleep 2000 & done
cgexec -g memory,cpu:${PARENT_CGROUP}/${CHILD_CGROUP} dd if=/dev/zero of=/dev/null &
Now, run: systemd-cgtop
You will get a clean, live-updating table showing the resource consumption aggregated by cgroup:
If you want a hierarchical tree view of exactly which PIDs belong to which groups, use systemd-cgls:
root@container-security:/home/user# systemd-cgls /scripts
CGroup /scripts:
└─production
├─2142 sleep 2000
├─2143 sleep 2000
├─2144 sleep 2000
├─2145 sleep 2000
├─2146 sleep 2000
└─2147 dd if=/dev/zero of=/dev/null
Killing All Processes in a Cgroup
One of the best new features in cgroups v2 is the cgroup.kill file. Instead of hunting down individual PIDs, you can instantly terminate everything inside a cgroup by writing a 1 to this file:
echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill
If you press enter a couple of times, you will see the terminal report that all the sleep processes we spawned earlier have been instantly killed. Checking systemd-cgls /scripts will now show an empty group.
Moving an Already-Running Process (cgclassify)
So far, we have been launching new processes directly into our cgroup using cgexec. But what if a runaway process is already running on the host, and you want to lock it down on the fly?
We can use the cgclassify command for this. Let's start our CPU hog on the host system without limits:
dd if=/dev/zero of=/dev/null &
It is currently consuming 100% of a core. Time to cage it. We use cgclassify and pass it the PID (using $! for the last background process):
cgclassify -g cpu,memory:${PARENT_CGROUP}/${CHILD_CGROUP} $!
If you run
ps -p $! -o %cpuright after classifying the process, you might notice something strange. It might say the CPU usage is 75% or 50%, slowly ticking down, rather than an instant 10%. Why? This is because thepscommand does not show instantaneous CPU usage. It calculates the average CPU usage over the entire lifetime of the process. Because the process ran at 100% for a few seconds before we caged it, that lifetime average takes a while to drop! If you look at the process inhtoporsystemd-cgtopinstead, you will see that its actual, real-time usage dropped to 10% the exact millisecond you ran thecgclassifycommand.
Kill the process with: echo 1 > /sys/fs/cgroup/${PARENT_CGROUP}/${CHILD_CGROUP}/cgroup.kill
Viewing Configuration with cgget
If you ever need to audit a cgroup to see exactly how it is configured and what its current stats are, cgget is your go-to command:
cgget ${PARENT_CGROUP}/${CHILD_CGROUP}
This dumps the contents of all the controller files into an easy-to-read list, showing you your max limits, current usage metrics, and even how many times the OOM killer has been triggered (oom_kill).
Cleaning Up
To keep your system clean, you can recursively delete the cgroups we just created:
cgdelete -r -g cpu:/${PARENT_CGROUP}
You might wonder why commands like
cgexecandcgdeleterequire you to specify a controller (like-g cpu:) even though cgroups v2 uses a unified hierarchy. This is simply a quirk for backward compatibility with cgroups v1 syntax. The command requires it to run, but in a v2 environment, the process is applied to the unified group regardless of which specific controller you type here.
Containers and Cgroups
Throughout this chapter, we manually created cgroups, configured resource limits, and assigned processes to them. While this is the best way to learn how the Linux kernel enforces resource distribution, you rarely have to do this by hand in the real world. You don't have to be using containers to take advantage of cgroups, but modern container runtimes provide an incredibly convenient abstraction layer over them.
When you run a containerized application, runtimes like Docker or containerd automatically interact with the cgroups filesystem on your behalf. Behind the scenes, the runtime creates a dedicated cgroup hierarchy specifically for that container (typically using the long container ID as the directory name).
When you pass a flag like --memory 100M to a Docker run command, or define a CPU limit in a Kubernetes Pod specification, the container engine translates those human-readable requests directly into the memory.max and cpu.max files we explored earlier.
From a security standpoint, understanding this underlying mechanism is critical. Constraining resources provides a powerful layer of protection against resource exhaustion.
Whether an attacker deliberately exploits an application to consume excess memory, or a simple bug causes an accidental CPU spike, an unbounded container can easily starve legitimate applications running on the same host. By setting explicit memory and CPU limits on your container deployments, you ensure that the kernel's cgroups will throttle or kill the offending process before it can bring down your entire infrastructure.
This article is one piece of the Ultimate Container Security Series, an ongoing effort to organize and explain container security concepts in a practical way. If you want to explore related topics or see what’s coming next, the series introduction post provides the complete roadmap.




Top comments (0)