DEV Community

Cover image for How Docker Uses cgroups (a /sys/fs/cgroup Walkthrough)
Henry Osei
Henry Osei

Posted on

How Docker Uses cgroups (a /sys/fs/cgroup Walkthrough)

At some point Docker stops feeling like magic. You run:

docker run --memory=256m --cpus=0.5 nginx
Enter fullscreen mode Exit fullscreen mode

and somehow the container gets exactly 256 MB of memory and half a CPU. Where does that rule actually live? Not in the image, not in the Dockerfile, and not in some private Docker database. It ends up as files under:

/sys/fs/cgroup
Enter fullscreen mode Exit fullscreen mode

That path is one of the places where Docker stops being a product and becomes plain Linux.

/sys/fs/cgroup is not a normal folder

List it and it looks like a regular directory:

ls /sys/fs/cgroup
Enter fullscreen mode Exit fullscreen mode

On a modern system using cgroup v2 you'll see entries like:

cgroup.controllers
cgroup.procs
cpu.max
cpu.stat
memory.current
memory.max
pids.current
pids.max
io.stat
Enter fullscreen mode Exit fullscreen mode

None of these sit on disk. /sys/fs/cgroup is a virtual filesystem backed by the kernel: reading a file asks the kernel for cgroup state, and writing one changes how the kernel controls processes.

This is worth pausing on, because it changes how you should picture Docker. Docker isn't watching your container from outside and politely asking it to behave. It writes rules into these files once, and from then on the kernel does the enforcing. Docker is the manager; the kernel is the bouncer.

A container is just a process in a cgroup

A Docker container is not a tiny VM. At runtime it's one or more ordinary Linux processes, and you can prove it yourself:

docker run -d --name web nginx
Enter fullscreen mode Exit fullscreen mode

Ask Docker for the container's main process ID:

docker inspect web --format '{{.State.Pid}}'
Enter fullscreen mode Exit fullscreen mode

Say it returns:

18432
Enter fullscreen mode Exit fullscreen mode

From the host's point of view, that container is a normal Linux process with PID 18432. Now check which cgroup it belongs to:

cat /proc/18432/cgroup
Enter fullscreen mode Exit fullscreen mode

On a cgroup v2 system you'll see something like:

0::/system.slice/docker-<container-id>.scope
Enter fullscreen mode Exit fullscreen mode

That path maps to a directory, usually:

/sys/fs/cgroup/system.slice/docker-<container-id>.scope
Enter fullscreen mode Exit fullscreen mode

and inside it are the control files for this specific container. Everything Docker configured with those --memory and --cpus flags is sitting in there as readable files.

What happens when you set a memory limit

Run a container with a memory limit:

docker run -d \
  --name limited-nginx \
  --memory=256m \
  nginx
Enter fullscreen mode Exit fullscreen mode

Docker does not inject special memory-limiting code into Nginx. The flow is roughly:

docker CLI
  ↓
dockerd
  ↓
containerd
  ↓
runc
  ↓
create process
  ↓
place process into cgroup
  ↓
write memory limit into /sys/fs/cgroup
  ↓
kernel enforces the limit
Enter fullscreen mode Exit fullscreen mode

On cgroup v2 the memory limit lives in memory.max. If the container's cgroup path is:

/sys/fs/cgroup/system.slice/docker-abc123.scope
Enter fullscreen mode Exit fullscreen mode

then Docker writes 268435456 (256 MB in bytes) into:

/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
Enter fullscreen mode Exit fullscreen mode

You can look at both the limit and the current usage:

cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current
Enter fullscreen mode Exit fullscreen mode

Docker wrote memory.max; the kernel keeps memory.current up to date. When the cgroup crosses the limit, the kernel's out-of-memory handling kills a process inside it. If you've ever stared at an OOM-killed container while free -h on the host showed gigabytes of headroom, this is the explanation you were missing: the host had memory, but the cgroup didn't.

What happens when you set a CPU limit

Now run:

docker run -d \
  --name cpu-limited-nginx \
  --cpus=0.5 \
  nginx
Enter fullscreen mode Exit fullscreen mode

On cgroup v2, CPU quota is controlled by cpu.max:

cat /sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max
Enter fullscreen mode Exit fullscreen mode

which might show:

50000 100000
Enter fullscreen mode Exit fullscreen mode

The first number is the quota and the second is the period, both in microseconds. This cgroup may use 50,000 microseconds of CPU time in every 100,000-microsecond window, which works out to half a CPU. So docker run --cpus=0.5 nginx eventually becomes a scheduler rule of cpu.max = 50000 100000, enforced by the kernel on every scheduling decision. Nobody is pausing Nginx by hand.

What about process limits

This command controls how many processes the container can create:

docker run -d \
  --name pid-limited \
  --pids-limit=100 \
  nginx
Enter fullscreen mode Exit fullscreen mode

On cgroup v2 it maps to pids.max, which would hold 100 here. Two files tell the story:

cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.current
Enter fullscreen mode Exit fullscreen mode

The first is the maximum allowed, the second is how many exist right now. This is your fork-bomb protection. Without a PID limit, a broken or malicious container could keep spawning processes until the host became unstable; with one, the kernel refuses the fork the moment the cgroup hits its cap. Cgroups don't just count things, they say no.

The important file: cgroup.procs

Every cgroup needs to know which processes belong to it, and cgroup.procs records exactly that: the PIDs attached to the cgroup. Docker creates a cgroup for the container, then places the container's process into it. Once the process is inside, every rule in that cgroup applies to it and to all of its children:

container process
  └── child process
      └── child process
Enter fullscreen mode Exit fullscreen mode

The whole tree is charged to the same resource-control group. If Nginx forks a dozen workers, they all draw from the same 256 MB, which is exactly what you want from something calling itself a container.

Why the path looks different on different machines

On one system a container shows up under:

/sys/fs/cgroup/system.slice/docker-<container-id>.scope
Enter fullscreen mode Exit fullscreen mode

On another it's:

/sys/fs/cgroup/docker/<container-id>
Enter fullscreen mode Exit fullscreen mode

On Kubernetes you'll see paths involving:

kubepods.slice
Enter fullscreen mode Exit fullscreen mode

The difference comes down to a few variables:

cgroup v1 or cgroup v2
systemd cgroup driver or cgroupfs driver
Docker directly or Kubernetes/containerd
rootful or rootless containers
Enter fullscreen mode Exit fullscreen mode

Don't let the varying paths throw you. Whatever the layout, there's a cgroup directory somewhere, the container's processes are attached to it, and the kernel enforces whatever limits were written there. Find the directory and everything in this post still applies.

Docker does not work alone

"Docker uses cgroups" compresses a whole runtime stack into one sentence. In practice Docker delegates most of the low-level work:

docker CLI
  talks to
dockerd
  talks to
containerd
  talks to
runc
  talks to
Linux kernel
Enter fullscreen mode Exit fullscreen mode

runc deserves special mention. It's the low-level OCI runtime that actually builds the container process out of Linux primitives: it sets up namespaces, configures mounts, applies capabilities, configures seccomp, and places the process into the right cgroups. Docker gives you a pleasant command-line experience on top, but if you want to know where a "container" actually comes into existence, runc is the place to look.

Namespaces isolate what the container sees

Cgroups control what the container can use; namespaces control what it can see. Namespaces answer questions like:

Which processes can this container see?
Which network interfaces can it see?
What hostname does it see?
What filesystem root does it see?
Enter fullscreen mode Exit fullscreen mode

Cgroups answer a different set:

How much memory can this container use?
How much CPU can it consume?
How many processes can it create?
How much I/O can it perform?
Enter fullscreen mode Exit fullscreen mode

Docker needs both. Namespaces create the illusion of a separate machine, and cgroups stop that illusion from eating the whole host. Take away namespaces and the container isn't isolated; take away cgroups and it isn't contained.

A practical inspection flow

To see all of this on your own machine, run:

docker run -d \
  --name cgroup-demo \
  --memory=128m \
  --cpus=0.5 \
  --pids-limit=50 \
  nginx
Enter fullscreen mode Exit fullscreen mode

Get the container PID:

PID=$(docker inspect cgroup-demo --format '{{.State.Pid}}')
echo $PID
Enter fullscreen mode Exit fullscreen mode

Check its cgroup membership:

cat /proc/$PID/cgroup
Enter fullscreen mode Exit fullscreen mode

If it shows:

0::/system.slice/docker-abc123.scope
Enter fullscreen mode Exit fullscreen mode

then the cgroup directory is:

/sys/fs/cgroup/system.slice/docker-abc123.scope
Enter fullscreen mode Exit fullscreen mode

Now inspect the resource files:

cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.current
Enter fullscreen mode Exit fullscreen mode

That's the actual kernel interface Docker uses to control the container, sitting right there as plain readable files. Once you've cat-ted memory.max yourself, docker run --memory never feels mysterious again.

The mental model

The cleanest way to think about it:

Docker starts Linux processes.

Namespaces make those processes see an isolated world.

Cgroups limit how much of the real machine those processes can consume.

`/sys/fs/cgroup` is the filesystem interface Docker uses to configure those limits.
Enter fullscreen mode Exit fullscreen mode

So when you type:

docker run --memory=512m --cpus=1 nginx
Enter fullscreen mode Exit fullscreen mode

you're really asking Docker to create a process and tell the kernel:

Put this process in its own resource group.
Allow it up to 512 MB of memory.
Allow it up to 1 CPU worth of time.
Track its usage.
Kill or throttle it if it crosses the line.
Enter fullscreen mode Exit fullscreen mode

Containers aren't powerful because Docker invented isolation from scratch. They're powerful because Docker took primitives the kernel already had and packaged them into a workflow developers actually want to use. /sys/fs/cgroup is one of the clearest places to see that with your own eyes, one cat command at a time.

Top comments (0)