NTCTech

Posted on Mar 15 • Originally published at rack2cloud.com

containerd in Production: 5 Day-2 Failure Patterns at High Pod Density

#kubernetes #containers #devops #platformengineering

This post originally appeared on Rack2Cloud. All patterns were observed across production Kubernetes environments running 400–1,000 containers per node.

Your containerd metrics look healthy.

Pod density is climbing. Node CPU is stable. Memory pressure is low.

Then somewhere around 800–900 containers per node, something quiet happens: containerd-shim processes begin accumulating memory. 4 GB. 6 GB. Eventually the Linux OOM killer steps in and starts terminating containers that Kubernetes never asked it to kill.

Your dashboards still say the node is healthy. Your workloads disagree.

This is the Day-2 reality of containerd in production at scale. Not a configuration error. Not a software bug. A set of predictable failure patterns that appear after your cluster reaches the density thresholds that most documentation never discusses.

The containerd Runtime Stack: What Actually Runs Your Containers

Before diagnosing failures, the execution chain needs to be precise. When Kubernetes schedules a pod:

Kubelet → gRPC → containerd → spawns → containerd-shim (per container) → runc → container process

The shim is the architectural detail most Day-2 problems trace back to. Unlike containerd itself — a single long-running daemon — the shim is a per-container process that stays alive for the entire container lifecycle. It handles PID tracking, stdout/stderr relay, exit code capture, and signal forwarding.

At 10 containers, this is invisible. At 800 containers, you are running 800 shim processes. That changes the failure math entirely.

Failure Pattern #1: The containerd Shim Tax at High Pod Density

Trigger: Node density exceeds ~500 containers per node

Failure signature: OOM kills on containers that appear resource-compliant

Each containerd-shim process consumes approximately 10–15 MB of resident memory — static, regardless of what the container itself is doing:

Container Count	Shim Overhead (est.)	Available Node Headroom Lost
100 containers	~1.2 GB	Minor
300 containers	~3.5 GB	Moderate
500 containers	~6.0 GB	Significant
800 containers	~10.0 GB	Critical

Why dashboards miss it: Kubernetes resource accounting tracks container cgroup limits, not shim process overhead. A node can be technically under its container memory budget while the shim layer consumes enough unreserved memory to trigger OOM events.

Diagnostic:

# Count shim processes and total memory consumption
ps aux | grep containerd-shim | awk '{sum += $6} END {print "Total shim RSS: " sum/1024 " MB, Count: " NR}'

# Check OOM kill history
dmesg | grep -i "oom\|killed process" | tail -20

Mitigation: Reserve explicit non-container memory headroom using --system-reserved and --kube-reserved kubelet flags. A practical floor for high-density nodes is 2–4 GB above documented container limits. Evaluate crun as a replacement for runc — it reduces per-shim overhead by approximately 30–40% in high-density configurations.

Failure Pattern #2: Mixed cgroup v1/v2 Nodes Corrupt Resource Accounting

Trigger: Cluster nodes running mixed cgroup versions during OS or kernel upgrades

Failure signature: Pods report under CPU limits but exhibit throttling; containers OOM unexpectedly despite memory limits appearing enforced

Dimension	cgroup v1	cgroup v2
Hierarchy	Per-subsystem flat	Unified single hierarchy
CPU accounting	`cpu` and `cpuacct` separate	Combined in unified tree
Memory accounting	`memory` subsystem	Different semantics
Kubernetes support	Full	Requires kernel ≥ 5.8, containerd ≥ 1.4

In a mixed cluster, the Kubernetes scheduler has no visibility into cgroup version per node. Pods scheduled to cgroup v2 nodes running containerd configured for v1 semantics will have their resource limits enforced through the wrong accounting model.

Diagnostic:

# Check cgroup version on a node
stat -fc %T /sys/fs/cgroup/

# Check containerd's cgroup driver configuration
containerd config dump | grep cgroup

# Verify kubelet cgroup driver matches
systemctl show kubelet | grep cgroup

If stat returns cgroup2fs but containerd is configured with SystemdCgroup = false, resource accounting is broken.

Mitigation: Enforce explicit cgroup version consistency across all node pools. Use node taints during migrations to prevent mixed-version scheduling. Add cgroup version verification to your node provisioning pipeline as a gate — not a check.

Failure Pattern #3: OverlayFS Layer Debt in Long-Running Nodes

Trigger: Node uptime exceeding 60–90 days without image cleanup

Failure signature: Image pull times degrade from seconds to minutes; df -i shows inode exhaustion

OverlayFS layer debt accumulates through a predictable sequence: image pulls create new layer directories, image updates add layers without removing old ones, deleted containers leave dangling snapshot references, and node uptime compounds all of the above.

The critical failure mode: on nodes with high container churn running many distinct images, inode exhaustion occurs before disk capacity exhaustion — appearing as "No space left on device" errors with gigabytes of available disk space.

Diagnostic:

# Check snapshot count and disk usage
du -sh /var/lib/containerd/
find /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ -maxdepth 1 -type d | wc -l

# Check inode utilization
df -i /var/lib/containerd/

# Snapshot vs container count
ctr snapshots ls | wc -l

Threshold: When snapshot count exceeds ~2,000–3,000, pull performance begins degrading measurably. When inode utilization exceeds 85%, image pulls start failing intermittently.

Mitigation: Implement node-level image cleanup as a DaemonSet running crictl rmi --prune on a scheduled interval. Consider node rotation policies that rebuild nodes every 60–90 days. For high-churn clusters, configure containerd's GC parameters explicitly:

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
  GCPercent = 50

Failure Pattern #4: containerd Snapshot Garbage Collection Drift

Trigger: Long-running clusters with high container churn and no explicit GC tuning

Failure signature: Gradual degradation in image pull latency; storage consumption growing despite containers being deleted

Mechanism	Description	Detection
Dangling snapshots	No active container reference, GC skips due to reference counting errors	`ctr snapshots ls` vs `ctr containers ls` delta
Lease accumulation	containerd leases never explicitly released	`ctr leases ls` count growing unbound
Content store orphans	Image layers referenced by snapshots but not by active manifests	`ctr content ls` size vs expected

Diagnostic:

# Force a GC run
ctr content gc --verbose 2>&1 | tail -30

# Check lease accumulation
ctr leases ls | wc -l

# Snapshot-to-container ratio
echo "Snapshots: $(ctr snapshots ls | wc -l)"
echo "Containers: $(ctr containers ls | wc -l)"

A snapshot count more than 3–5× the container count warrants investigation.

Mitigation: Configure explicit GC thresholds and implement periodic forced GC via a privileged DaemonSet. For persistent drift, evaluate moving to a separate containerd data root on a dedicated volume.

Failure Pattern #5: containerd gRPC Socket Saturation Under Rapid Pod Cycling

Trigger: High pod churn — CI/CD clusters, ephemeral job runners, GitHub Actions nodes

Failure signature: Pod scheduling delays that appear as node pressure with no CPU or memory constraint; kubectl describe node shows KubeletHasTooManyPods without approaching the pod limit

The containerd gRPC socket (/run/containerd/containerd.sock) processes requests serially per operation type. In environments with rapid pod cycling, it becomes a serialization bottleneck:

CI/CD pipeline triggers 50+ pod creates in rapid succession
Each create requires: PullImage → CreateContainer → StartContainer gRPC calls
containerd processes these serially
Kubelet's CRI client begins queuing requests
Pod startup latency increases from seconds to minutes
Kubelet reports node pressure before any actual resource constraint exists

kubectl top nodes shows the node as underutilized. The scheduler sees pressure that doesn't match resource metrics.

Diagnostic:

# Measure pod startup latency
kubectl get events --field-selector reason=Started --sort-by='.lastTimestamp' | tail -20

# Check containerd task metrics if endpoint enabled
curl -s http://localhost:1338/v1/metrics | grep containerd_task

# Monitor socket utilization
ss -x | grep containerd

Mitigation: Isolate batch/CI workloads to dedicated node pools. Tune --max-pods conservatively for high-churn nodes — the default of 110 assumes steady-state workloads, not rapid cycling. For GitHub Actions runners, the actions-runner-controller project includes node pool isolation patterns that address this directly.

Diagnostic Summary: When to Expect These Failures

Failure Pattern	Primary Trigger	Secondary Signal
Shim Tax	>500 containers per node	OOM kills on compliant containers
cgroup v1/v2 Drift	Mixed-version node pool	CPU throttling without limit violation
OverlayFS Layer Debt	Node uptime >60–90 days	Image pull time degradation
Snapshot GC Drift	High churn + no GC tuning	Storage growth despite container deletion
Socket Saturation	>100 pod creates/hour	Scheduling delays without resource pressure