At some point Docker stops feeling like magic. You run:
docker run --memory=256m --cpus=0.5 nginx
and somehow the container gets exactly 256 MB of memory and half a CPU. Where does that rule actually live? Not in the image, not in the Dockerfile, and not in some private Docker database. It ends up as files under:
/sys/fs/cgroup
That path is one of the places where Docker stops being a product and becomes plain Linux.
/sys/fs/cgroup is not a normal folder
List it and it looks like a regular directory:
ls /sys/fs/cgroup
On a modern system using cgroup v2 you'll see entries like:
cgroup.controllers
cgroup.procs
cpu.max
cpu.stat
memory.current
memory.max
pids.current
pids.max
io.stat
None of these sit on disk. /sys/fs/cgroup is a virtual filesystem backed by the kernel: reading a file asks the kernel for cgroup state, and writing one changes how the kernel controls processes.
This is worth pausing on, because it changes how you should picture Docker. Docker isn't watching your container from outside and politely asking it to behave. It writes rules into these files once, and from then on the kernel does the enforcing. Docker is the manager; the kernel is the bouncer.
A container is just a process in a cgroup
A Docker container is not a tiny VM. At runtime it's one or more ordinary Linux processes, and you can prove it yourself:
docker run -d --name web nginx
Ask Docker for the container's main process ID:
docker inspect web --format '{{.State.Pid}}'
Say it returns:
18432
From the host's point of view, that container is a normal Linux process with PID 18432. Now check which cgroup it belongs to:
cat /proc/18432/cgroup
On a cgroup v2 system you'll see something like:
0::/system.slice/docker-<container-id>.scope
That path maps to a directory, usually:
/sys/fs/cgroup/system.slice/docker-<container-id>.scope
and inside it are the control files for this specific container. Everything Docker configured with those --memory and --cpus flags is sitting in there as readable files.
What happens when you set a memory limit
Run a container with a memory limit:
docker run -d \
--name limited-nginx \
--memory=256m \
nginx
Docker does not inject special memory-limiting code into Nginx. The flow is roughly:
docker CLI
↓
dockerd
↓
containerd
↓
runc
↓
create process
↓
place process into cgroup
↓
write memory limit into /sys/fs/cgroup
↓
kernel enforces the limit
On cgroup v2 the memory limit lives in memory.max. If the container's cgroup path is:
/sys/fs/cgroup/system.slice/docker-abc123.scope
then Docker writes 268435456 (256 MB in bytes) into:
/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
You can look at both the limit and the current usage:
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current
Docker wrote memory.max; the kernel keeps memory.current up to date. When the cgroup crosses the limit, the kernel's out-of-memory handling kills a process inside it. If you've ever stared at an OOM-killed container while free -h on the host showed gigabytes of headroom, this is the explanation you were missing: the host had memory, but the cgroup didn't.
What happens when you set a CPU limit
Now run:
docker run -d \
--name cpu-limited-nginx \
--cpus=0.5 \
nginx
On cgroup v2, CPU quota is controlled by cpu.max:
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max
which might show:
50000 100000
The first number is the quota and the second is the period, both in microseconds. This cgroup may use 50,000 microseconds of CPU time in every 100,000-microsecond window, which works out to half a CPU. So docker run --cpus=0.5 nginx eventually becomes a scheduler rule of cpu.max = 50000 100000, enforced by the kernel on every scheduling decision. Nobody is pausing Nginx by hand.
What about process limits
This command controls how many processes the container can create:
docker run -d \
--name pid-limited \
--pids-limit=100 \
nginx
On cgroup v2 it maps to pids.max, which would hold 100 here. Two files tell the story:
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.current
The first is the maximum allowed, the second is how many exist right now. This is your fork-bomb protection. Without a PID limit, a broken or malicious container could keep spawning processes until the host became unstable; with one, the kernel refuses the fork the moment the cgroup hits its cap. Cgroups don't just count things, they say no.
The important file: cgroup.procs
Every cgroup needs to know which processes belong to it, and cgroup.procs records exactly that: the PIDs attached to the cgroup. Docker creates a cgroup for the container, then places the container's process into it. Once the process is inside, every rule in that cgroup applies to it and to all of its children:
container process
└── child process
└── child process
The whole tree is charged to the same resource-control group. If Nginx forks a dozen workers, they all draw from the same 256 MB, which is exactly what you want from something calling itself a container.
Why the path looks different on different machines
On one system a container shows up under:
/sys/fs/cgroup/system.slice/docker-<container-id>.scope
On another it's:
/sys/fs/cgroup/docker/<container-id>
On Kubernetes you'll see paths involving:
kubepods.slice
The difference comes down to a few variables:
cgroup v1 or cgroup v2
systemd cgroup driver or cgroupfs driver
Docker directly or Kubernetes/containerd
rootful or rootless containers
Don't let the varying paths throw you. Whatever the layout, there's a cgroup directory somewhere, the container's processes are attached to it, and the kernel enforces whatever limits were written there. Find the directory and everything in this post still applies.
Docker does not work alone
"Docker uses cgroups" compresses a whole runtime stack into one sentence. In practice Docker delegates most of the low-level work:
docker CLI
talks to
dockerd
talks to
containerd
talks to
runc
talks to
Linux kernel
runc deserves special mention. It's the low-level OCI runtime that actually builds the container process out of Linux primitives: it sets up namespaces, configures mounts, applies capabilities, configures seccomp, and places the process into the right cgroups. Docker gives you a pleasant command-line experience on top, but if you want to know where a "container" actually comes into existence, runc is the place to look.
Namespaces isolate what the container sees
Cgroups control what the container can use; namespaces control what it can see. Namespaces answer questions like:
Which processes can this container see?
Which network interfaces can it see?
What hostname does it see?
What filesystem root does it see?
Cgroups answer a different set:
How much memory can this container use?
How much CPU can it consume?
How many processes can it create?
How much I/O can it perform?
Docker needs both. Namespaces create the illusion of a separate machine, and cgroups stop that illusion from eating the whole host. Take away namespaces and the container isn't isolated; take away cgroups and it isn't contained.
A practical inspection flow
To see all of this on your own machine, run:
docker run -d \
--name cgroup-demo \
--memory=128m \
--cpus=0.5 \
--pids-limit=50 \
nginx
Get the container PID:
PID=$(docker inspect cgroup-demo --format '{{.State.Pid}}')
echo $PID
Check its cgroup membership:
cat /proc/$PID/cgroup
If it shows:
0::/system.slice/docker-abc123.scope
then the cgroup directory is:
/sys/fs/cgroup/system.slice/docker-abc123.scope
Now inspect the resource files:
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.max
cat /sys/fs/cgroup/system.slice/docker-abc123.scope/pids.current
That's the actual kernel interface Docker uses to control the container, sitting right there as plain readable files. Once you've cat-ted memory.max yourself, docker run --memory never feels mysterious again.
The mental model
The cleanest way to think about it:
Docker starts Linux processes.
Namespaces make those processes see an isolated world.
Cgroups limit how much of the real machine those processes can consume.
`/sys/fs/cgroup` is the filesystem interface Docker uses to configure those limits.
So when you type:
docker run --memory=512m --cpus=1 nginx
you're really asking Docker to create a process and tell the kernel:
Put this process in its own resource group.
Allow it up to 512 MB of memory.
Allow it up to 1 CPU worth of time.
Track its usage.
Kill or throttle it if it crosses the line.
Containers aren't powerful because Docker invented isolation from scratch. They're powerful because Docker took primitives the kernel already had and packaged them into a workflow developers actually want to use. /sys/fs/cgroup is one of the clearest places to see that with your own eyes, one cat command at a time.
Top comments (0)