Matheus

Posted on Feb 21 • Edited on Mar 18 • Originally published at releaserun.com

Container Escape Vulnerabilities: The CVEs That Shaped Docker and Kubernetes Security

#containers #docker #kubernetes #security

Why Container Escapes Matter

Containers are not virtual machines. A virtual machine runs its own kernel on emulated hardware, creating a strong isolation boundary. A container shares the host kernel with every other container on the system -- isolation comes from Linux kernel features (namespaces, cgroups, capabilities, seccomp filters), not from a hardware-enforced boundary.

When an attacker escapes a container, they break through those kernel-level abstractions and gain access to the host. From there, they can reach every other container on that node, access mounted secrets and credentials, and pivot deeper into the cluster. In a Kubernetes or Docker production environment, a single container escape can compromise an entire node and, in the worst case, the entire cluster.

This article covers the most significant container escape CVEs from 2017 through 2024: how each exploit worked, what made it possible, and how the ecosystem responded. The same classes of bugs keep resurfacing, and the defensive patterns developed in response form the foundation of modern container security.

CVE-2017-5123: The waitid Kernel Exploit

What Happened

In October 2017, a vulnerability was discovered in the Linux kernel's waitid() system call. During a refactor of the waitid code in kernel version 4.13, a critical check was accidentally removed: the access_ok() call that validates whether a user-supplied pointer actually points to user-space memory. Without this check, an unprivileged process could pass a pointer to kernel memory, and the kernel would happily write data to that location.

How the Exploit Worked

The bug allowed an attacker to write partially controlled data to an arbitrary kernel memory address. While the attacker could not fully control the content being written -- the kernel wrote a siginfo_t structure with fields determined by process state -- careful manipulation of which process was being waited on gave enough control to be dangerous.

The container escape leveraged this kernel write primitive to modify the calling process's capability structure in kernel memory. Docker containers run with a restricted set of Linux capabilities, which is one of the primary mechanisms preventing containerized processes from performing privileged operations on the host. By overwriting the capability bitmask, the attacker could grant themselves CAP_SYS_ADMIN and CAP_NET_ADMIN -- effectively breaking out of the container's capability restrictions and gaining host-level privileges.

Impact and Fix

This vulnerability affected Linux kernel 4.13 through 4.14.0-rc4. The fix was straightforward: re-adding the access_ok() check to validate that the user-provided pointer targets user-space memory. The bug was introduced on May 21, 2017 and patched on October 9, 2017.

CVE-2017-5123 demonstrated something fundamental: containers share the host kernel, and a kernel vulnerability is a container escape vulnerability. No amount of namespace isolation matters if the kernel itself can be tricked into overwriting its own security data structures.

CVE-2019-5736: The runc Overwrite

What Happened

Disclosed on February 11, 2019, CVE-2019-5736 was arguably the most impactful container escape vulnerability ever published. It affected runc, the low-level container runtime used by Docker, containerd, CRI-O, and essentially every OCI-compliant container platform. The vulnerability allowed a malicious process inside a container to overwrite the host's runc binary, gaining root-level code execution on the host.

How the Exploit Worked

The exploit took advantage of how Linux handles /proc/self/exe. This special file is a symbolic link that points to the binary of the currently running process. When runc executes a command inside a container (via docker exec or similar), there is a brief window where the container's process can access the runc binary through /proc/self/exe.

The attack worked in two stages:

Set the trap. The attacker replaces the container's /bin/sh (or another entrypoint binary) with a script containing #!/proc/self/exe. This tells the kernel to execute the binary that /proc/self/exe points to -- which, during a docker exec call, is the host's runc binary.
Overwrite runc. When runc enters the container and the tampered entrypoint executes, the process gets a file handle to the host's runc binary via /proc/self/exe. The attacker then writes a malicious payload to this file handle, overwriting the host's runc binary with attacker-controlled code.

The next time any container operation invokes runc on that host -- starting a container, running exec, or even performing a health check -- the attacker's payload executes with root privileges on the host.

Impact and Fix

The severity was enormous. The exploit required only UID 0 inside the container (which is the default for most container images) and worked with default Docker configurations. No special privileges, no host mounts, no unusual capabilities. It affected Docker, Kubernetes, and any platform using runc versions prior to 1.0-rc6.

The fix changed runc's behavior so that it creates a copy of itself as a sealed, read-only file descriptor (using memfd_create with F_SEAL flags) before entering the container. When the malicious process attempts to write to /proc/self/exe, the kernel blocks the write because the file descriptor is sealed.

CVE-2019-1002101: kubectl cp Directory Traversal

What Happened

While most container escape CVEs involve breaking out of a running container, CVE-2019-1002101 took a different approach: it targeted the operator's workstation. This vulnerability allowed a malicious container to write arbitrary files to the machine of any Kubernetes user who ran kubectl cp to copy files from that container.

How the Exploit Worked

The kubectl cp command works by creating a tar archive inside the target container, streaming it over the network to the user's machine, and extracting it locally. The vulnerability was a classic directory traversal: the tar archive created inside the container could include file paths containing ../ sequences, and kubectl did not sanitize these paths before extraction.

If an attacker controlled the tar binary inside a container, they could craft filenames like ../../../etc/cron.d/backdoor. When the unsuspecting operator ran kubectl cp mypod:/data ./local-dir, the malicious tar entries would be extracted outside the intended destination directory, writing files anywhere the user had permissions.

Impact and Fix

The fix added path validation to reject directory traversal sequences during tar extraction. The initial fix was incomplete -- follow-up CVEs (CVE-2019-11246 and CVE-2019-11249) addressed bypass techniques, highlighting how tricky path sanitization can be.

This vulnerability is a reminder that the attack surface of a Kubernetes environment extends beyond the cluster. Operator tools, CI/CD pipelines, and client-side utilities are all part of the security perimeter.

CVE-2020-15257: containerd Host Network Escape

What Happened

In November 2020, NCC Group disclosed CVE-2020-15257, a vulnerability in containerd that allowed containers running with host network access to escape to the host.

How the Exploit Worked

containerd uses a component called containerd-shim, which runs as a parent process for each container. The shim exposes an API over an abstract namespace Unix domain socket. The critical flaw was that this socket was accessible from the host's network namespace.

When a container was configured with --net=host (sharing the host's network namespace), a root process inside that container could connect to the containerd-shim's abstract Unix socket. From there, the attacker could use the shim API to:

Read and write files on the host filesystem.
Execute commands on the host as root.
Spin up new, fully privileged containers.

The attack required two conditions: the container had to be running with host networking, and the process inside had to be running as UID 0.

Impact and Fix

The fix switched the shim API from abstract Unix sockets to file-based Unix sockets under /run/containerd, which respect filesystem permissions and namespace boundaries. Important: containers running before the upgrade retained the old socket connections and had to be restarted.

CVE-2020-15257 reinforced a well-known principle: do not use host networking unless absolutely necessary.

CVE-2024-21626: Leaky Vessels

What Happened

In January 2024, Snyk researchers disclosed a set of vulnerabilities collectively named "Leaky Vessels," with CVE-2024-21626 being the most severe. This was another runc vulnerability -- five years after CVE-2019-5736. It carried a CVSS score of 8.6.

How the Exploit Worked

The vulnerability stemmed from an internal file descriptor leak in runc. When runc set up a new container, it inadvertently leaked file descriptors that pointed to the host filesystem.

Two primary attack vectors:

Malicious container image. A Dockerfile with a WORKDIR directive set to a path like /proc/self/fd/[leaked_fd] could cause the container process to start with its working directory pointing to a host filesystem location.
Crafted exec command. An attacker with the ability to run runc exec could specify a working directory that referenced the leaked file descriptor.

What made this especially concerning was the image-based attack vector. Unlike CVE-2019-5736, which required an attacker to already have code execution inside a container, CVE-2024-21626 could be triggered simply by building or running a malicious image pulled from a registry.

Impact and Fix

The fix in runc 1.1.12 ensured that all internal file descriptors are properly closed before the container process starts. The disclosure also included three other CVEs affecting Docker's BuildKit component, demonstrating that the container build pipeline -- not just runtime -- is a significant attack surface.

Other Notable Container Escape Vulnerabilities

Dirty COW (CVE-2016-5195)

A race condition in the Linux kernel's memory subsystem, present for nine years before discovery in October 2016. The vulnerability allowed an unprivileged process to write to read-only memory mappings. Researchers demonstrated container escape techniques using the vDSO (virtual Dynamic Shared Object) to inject shellcode that would execute in the context of any process on the host.

systemd-journald Exploits (CVE-2018-16865 and CVE-2018-16866)

Vulnerabilities in systemd-journald that, chained together, allowed a local attacker to obtain a root shell. Since journald runs as root and accepts log messages from containers, this created a path from containerized process to host root access through the logging infrastructure.

These bugs highlighted the risk of host services that accept input from containers. Any host daemon that processes container-generated data is a potential escape vector.

Patterns Across Container Escape CVEs

Several recurring patterns emerge:

Shared kernel, shared fate. CVE-2017-5123 and Dirty COW exploited kernel bugs that no amount of namespace isolation can defend against. This is the fundamental architectural limitation of containers versus virtual machines.
File descriptor and /proc leaks. CVE-2019-5736 and CVE-2024-21626 both exploited how runc handles file descriptors and /proc entries during container setup.
Host services extend the attack surface. CVE-2020-15257 and the systemd-journald exploits show that any host service that accepts container input is a potential escape path.
Client tools matter too. CVE-2019-1002101 weaponized kubectl to compromise operator workstations.

Modern Defenses Against Container Escapes

Seccomp Profiles

Seccomp restricts which system calls a containerized process can make. Docker's default profile blocks approximately 44 of the 300+ available system calls. Custom profiles tailored to your application's actual system call usage offer stronger protection.

AppArmor and SELinux

Mandatory Access Control (MAC) systems add restrictions beyond standard Linux permissions. SELinux in enforcing mode mitigated CVE-2019-5736 by blocking writes to the host's runc binary. AppArmor provides path-based controls.

Rootless Containers and User Namespaces

Many container escape exploits require UID 0 inside the container. Rootless containers address this by running the entire container runtime as an unprivileged user, using user namespaces to remap UID 0 inside the container to an unprivileged UID on the host.

With rootless mode, even a successful escape lands the attacker on the host as an unprivileged user. Docker supports rootless mode natively (since 20.10), Podman runs rootless by default, and Kubernetes user namespaces for pods reached beta in version 1.30.

Read-Only Root Filesystems

Running containers with read-only root filesystems (readOnlyRootFilesystem: true) prevents a compromised container from modifying its own filesystem, directly mitigating exploits like CVE-2019-5736.

Runtime Security: Falco and Tetragon

Falco, a CNCF graduated project, monitors system calls and container events against a rule engine. Tetragon, from the Cilium project, uses eBPF to enforce security policies directly in the kernel with less than 1% performance overhead.

Pod Security Standards

Kubernetes Pod Security Standards define three profiles -- Privileged, Baseline, and Restricted. The Restricted profile enforces non-root execution, drops all capabilities, disables privilege escalation, and requires a read-only root filesystem.

Image Scanning and Supply Chain Security

Image scanning tools (Trivy, Grype, Snyk Container) detect known vulnerable packages, image signing with Sigstore/cosign provides provenance verification, and admission controllers can enforce that only signed, scanned images are deployed.

Container Escape Prevention Checklist

Runtime Configuration

Run containers as non-root. Set runAsNonRoot: true and specify a runAsUser.
Drop all capabilities, add only what is needed. Use drop: ["ALL"].
Disable privilege escalation. Set allowPrivilegeEscalation: false.
Use read-only root filesystems. Set readOnlyRootFilesystem: true.
Avoid host namespaces. Do not use hostNetwork, hostPID, or hostIPC.
Never run privileged containers in production.

Infrastructure and Patching

Keep the host kernel updated. Kernel vulnerabilities bypass all container isolation.
Patch container runtimes promptly. runc, containerd, and CRI-O vulnerabilities are direct escape vectors.
Update client tools. kubectl and other client-side tools are part of the attack surface.
Enable user namespaces. Ensure UID 0 inside containers maps to an unprivileged host UID.

Detection and Monitoring

Deploy runtime security tooling. Use Falco, Tetragon, or similar tools.
Apply seccomp profiles. Start with defaults and customize based on your application.
Enable audit logging. Kubernetes audit logs, container runtime logs, and host-level audit provide forensic trails.

Supply Chain

Scan images for known CVEs. Run vulnerability scanners in your CI/CD pipeline.
Use minimal base images. Smaller images have fewer potential vulnerabilities.
Sign and verify images. Use cosign/Sigstore for image signing.
Pin image digests. Reference images by digest rather than mutable tags.

The Future of Container Isolation

Sandbox runtimes like gVisor and Kata Containers add stronger isolation boundaries. eBPF-based security enforcement is maturing rapidly. Confidential computing (AMD SEV, Intel TDX) is bringing hardware-level isolation to container workloads using encrypted memory enclaves.

For most teams today, defense in depth -- rootless containers, seccomp profiles, MAC policies, runtime security tools, and diligent patching -- provides strong protection. No single mechanism is a silver bullet, but the combination makes exploitation significantly harder and detection significantly faster.

Container escapes are not theoretical. They have been discovered repeatedly in the most critical infrastructure components, from the Linux kernel to runc to containerd to kubectl. The organizations that avoid becoming case studies are the ones that treat these vulnerabilities as inevitable, and build their defenses accordingly.

🔍 Related tool: Kubernetes YAML Security Linter — paste any K8s manifest and scan for 12 security issues with an A–F grade. Free, browser-based.