William Kwabena Akoto

Posted on May 30

cgroups and Namespaces — The Linux Kernel's Building Blocks Behind Containers

#containers #cgroups #namespaces #linux

Every container you have ever run is, at its core, a process with a restricted view of the world and a capped share of the machine's resources. That restriction and that cap come from exactly two kernel features: namespaces and cgroups. Everything else is plumbing.

The Problem They Solved

Long before Docker, system administrators wrestled with a deceptively simple question: how do you run multiple workloads on the same machine without them interfering with each other?

The classical answers were clunky. Separate physical machines were expensive. Full virtual machines were slow to boot and heavy on resources. chroot jails, introduced in Unix back in 1979, changed a process's view of the filesystem but nothing else. A chrooted process still shared the same process table, network stack, and user database as everything else on the host. The walls only went up in one direction.

What the industry needed was something built into the kernel itself, some that is lightweight enough to start in milliseconds, safe enough for multi-tenant workloads. Linux delivered that in two independent features merged roughly a year apart: namespaces (first type in 2002, full suite by 2013) and cgroups (2007).

Namespaces — A Selective Blindfold

A namespace wraps a global system resource and gives each participating process its own isolated instance of it. The process believes it has exclusive ownership. The kernel is quietly managing multiple independent copies underneath.

There are eight namespace types, each isolating a different dimension of the system:

Mount (CLONE_NEWNS) — The first and oldest. Isolates the filesystem mount table. A container can have its own /etc/hosts, its own /proc, its own root filesystem and none of it visible to the host or other containers.

PID (CLONE_NEWPID) — The first process in a new PID namespace gets PID 1, regardless of its real PID on the host. This is why containers need an init process (tini, dumb-init), PID 1 is responsible for reaping orphaned children, and a naive application as PID 1 will silently accumulate zombie processes.

Network (CLONE_NEWNET) — Isolates the entire network stack: interfaces, routing tables, iptables rules, port bindings. Each new network namespace starts with only a loopback interface. Container runtimes wire it to the outside world via veth pairs, thus, virtual Ethernet cables with one end inside the container and one on the host. This is why two containers can both listen on port 8080 without conflict.

User (CLONE_NEWUSER) — The most powerful and the last to be merged (Linux 3.8, 2013). Maps UIDs inside the namespace to different UIDs on the host. A process can be root (UID 0) inside a container but map to an unprivileged UID like 100000 on the host. This is the foundation of rootless containers, no setuid binaries, no elevated privileges required.

UTS, IPC, Cgroup, Time — Isolate hostname, System V IPC objects, the cgroup hierarchy view, and monotonic clocks respectively. Less glamorous, all essential.

Three syscalls drive everything: clone() creates a process in new namespaces, unshare() moves the calling process into new ones, and setns() joins an existing namespace by file descriptor. The nsenter tool is a thin wrapper over setns() and that's exactly how docker exec attaches to a running container's namespaces.

cgroups — The Resource Accountant

Namespaces answer "what can a process see?" cgroups answer "what can a process consume?"

The project started at Google in 2006. Engineers Paul Menage and Rohit Seth were running enormous shared Linux clusters which were the internal system that eventually became Borg and then Kubernetes. ulimit which is a built in command in Unix/Linux shells to check, set or restrict the amount of system resources a process or user can consume, wasn't cutting it. ulimit is per-process. There was no way to say: this entire group of processes, however many they fork, shall not exceed 2 GB of RAM.

The initial name was "process containers", which caused enough confusion with the broader container concept that Paul Menage renamed the feature to control groups (cgroups) before merging.

version 1: Powerful but Fragmented

cgroups version 1 (Linux 2.6.24, 2008) lets you organise processes into a hierarchy and attach controllers, a subsystems that enforce limits on specific resources. The hierarchy is a pseudo-filesystem under /sys/fs/cgroup/.

The critical innovation was group-level accounting. When a memory-limited cgroup hits its ceiling, the OOM killer fires within that group, not across the whole machine. Noisy neighbours became a solved problem.

But version 1 had real issues: each controller could have its own independent hierarchy tree, making it impossible to atomically apply a consistent resource profile. The interfaces were inconsistent across controllers. Fixes accumulated as improvised additions.

version 2: One Hierarchy to Rule Them All

cgroups version 2 (Linux 4.5, 2016) made one foundational change: a single unified hierarchy with all controllers attached to it. The fragmentation was gone.

version 2 also added PSI (Pressure Stall Information) which is a Facebook contribution that exposes how much time processes are spending waiting for CPU, memory, or I/O rather than running. Raw usage numbers tell you what's consumed; PSI tells you when a resource is actually causing slowdowns.

The Controllers That Matter

CPU: Two modes. Weights (cpu.weight) give proportional shares of CPU during contention which is useful for fairness. Bandwidth (cpu.max) is a hard cap expressed as quota/period:

# Allow 50% of one CPU (50ms out of every 100ms)
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max

Memory: The soft limit (memory.high) slows processes down and triggers reclaim when crossed. The hard limit (memory.max) triggers the OOM killer within the cgroup. These two knobs together let you apply back-pressure gracefully before resorting to killing.

PID: A simple ceiling on total processes+threads a cgroup can create. The defense against fork bombs in containers.

Block I/O: Per-device read/write bandwidth and IOPS limits — critical for preventing a log-spewing service from saturating shared disk.

How They Work Together

Namespaces and cgroups are orthogonal but designed to combine. A container is their intersection:

Concern	Mechanism
What processes are visible	PID namespace
What filesystem is visible	Mount namespace
What network exists	Network namespace
How much CPU can be used	cgroup cpu controller
How much memory can be used	cgroup memory controller
How many processes can exist	cgroup pids controller

The Open Container Initiative Runtime Specification formalises this combination. A container bundle is a root filesystem plus a config.json that declares which namespaces to create and what resource limits to apply. runc, the reference runtime, translates this spec into the actual kernel calls clone(), setns(), writes to /sys/fs/cgroup/ and then execs the entrypoint.

When you run docker run, the call chain is:

Docker CLI → dockerd → containerd → containerd-shim → runc → kernel syscalls

runc's entire job is configuring namespaces and cgroups, then handing off to your application.

Kubernetes on Top

Kubernetes sits several layers above the kernel but its abstractions map directly back down to cgroups.

Each node's cgroup tree under /sys/fs/cgroup/kubepods/ mirrors the pod QoS hierarchy:

Guaranteed pods (requests == limits): hard cpu.max and memory.max set to the requested values.
Burstable pods (requests < limits): proportional CPU shares, memory limited to the limit value.
BestEffort pods (no requests or limits): lowest CPU shares, placed in a group that's evicted first under memory pressure.

When you write resources.limits.memory: 512Mi in a pod spec, the kubelet writes 536870912 to that container's memory.max file. When that container exceeds it, the OOM killer fires and that's the OOMKilled pod status you've seen in kubectl describe pod. It's the kernel doing its job, surfaced through the Kubernetes API.

Conclusion

The entire container ecosystem thus Docker, Kubernetes, Fargate, Lambda, GitHub Actions runners are built on two kernel primitives that are, conceptually, not complicated:

Namespaces change what a process can see. cgroups change what a process can consume.

That's it. Everything above, the image layers, the registries, the schedulers, the service meshes are tooling built to make working with those two primitives at scale more ergonomic.

When a pod OOM-kills, read memory.events. When a job is mysteriously slow, check cpu.stat for throttling. When containers can't talk to each other, trace the network namespace. The kernel tells you everything, you just need to know where to look.

DEV Community