DEV Community

NTCTech
NTCTech

Posted on • Originally published at rack2cloud.com

containerd in Production: 5 Day-2 Failure Patterns at High Pod Density

This post originally appeared on Rack2Cloud. All patterns were observed across production Kubernetes environments running 400–1,000 containers per node.


containerd in production runtime stack diagram showing Kubelet gRPC containerd-shim accumulation at high pod density

Your containerd metrics look healthy.

Pod density is climbing. Node CPU is stable. Memory pressure is low.

Then somewhere around 800–900 containers per node, something quiet happens: containerd-shim processes begin accumulating memory. 4 GB. 6 GB. Eventually the Linux OOM killer steps in and starts terminating containers that Kubernetes never asked it to kill.

Your dashboards still say the node is healthy. Your workloads disagree.

This is the Day-2 reality of containerd in production at scale. Not a configuration error. Not a software bug. A set of predictable failure patterns that appear after your cluster reaches the density thresholds that most documentation never discusses.


The containerd Runtime Stack: What Actually Runs Your Containers

Before diagnosing failures, the execution chain needs to be precise. When Kubernetes schedules a pod:

Kubelet → gRPC → containerd → spawns → containerd-shim (per container) → runc → container process
Enter fullscreen mode Exit fullscreen mode

The shim is the architectural detail most Day-2 problems trace back to. Unlike containerd itself — a single long-running daemon — the shim is a per-container process that stays alive for the entire container lifecycle. It handles PID tracking, stdout/stderr relay, exit code capture, and signal forwarding.

At 10 containers, this is invisible. At 800 containers, you are running 800 shim processes. That changes the failure math entirely.


Failure Pattern #1: The containerd Shim Tax at High Pod Density

Trigger: Node density exceeds ~500 containers per node

Failure signature: OOM kills on containers that appear resource-compliant

Each containerd-shim process consumes approximately 10–15 MB of resident memory — static, regardless of what the container itself is doing:

Chart showing containerd-shim memory accumulation scaling from 1.2GB at 100 containers to over 10GB at 800 containers per node

Container Count Shim Overhead (est.) Available Node Headroom Lost
100 containers ~1.2 GB Minor
300 containers ~3.5 GB Moderate
500 containers ~6.0 GB Significant
800 containers ~10.0 GB Critical

Why dashboards miss it: Kubernetes resource accounting tracks container cgroup limits, not shim process overhead. A node can be technically under its container memory budget while the shim layer consumes enough unreserved memory to trigger OOM events.

Diagnostic:

# Count shim processes and total memory consumption
ps aux | grep containerd-shim | awk '{sum += $6} END {print "Total shim RSS: " sum/1024 " MB, Count: " NR}'

# Check OOM kill history
dmesg | grep -i "oom\|killed process" | tail -20
Enter fullscreen mode Exit fullscreen mode

Mitigation: Reserve explicit non-container memory headroom using --system-reserved and --kube-reserved kubelet flags. A practical floor for high-density nodes is 2–4 GB above documented container limits. Evaluate crun as a replacement for runc — it reduces per-shim overhead by approximately 30–40% in high-density configurations.


Failure Pattern #2: Mixed cgroup v1/v2 Nodes Corrupt Resource Accounting

Trigger: Cluster nodes running mixed cgroup versions during OS or kernel upgrades

Failure signature: Pods report under CPU limits but exhibit throttling; containers OOM unexpectedly despite memory limits appearing enforced

Comparison diagram of cgroup v1 flat hierarchy versus cgroup v2 unified hierarchy, showing resource accounting mismatch in mixed-node Kubernetes environments

Dimension cgroup v1 cgroup v2
Hierarchy Per-subsystem flat Unified single hierarchy
CPU accounting cpu and cpuacct separate Combined in unified tree
Memory accounting memory subsystem Different semantics
Kubernetes support Full Requires kernel ≥ 5.8, containerd ≥ 1.4

In a mixed cluster, the Kubernetes scheduler has no visibility into cgroup version per node. Pods scheduled to cgroup v2 nodes running containerd configured for v1 semantics will have their resource limits enforced through the wrong accounting model.

Diagnostic:

# Check cgroup version on a node
stat -fc %T /sys/fs/cgroup/

# Check containerd's cgroup driver configuration
containerd config dump | grep cgroup

# Verify kubelet cgroup driver matches
systemctl show kubelet | grep cgroup
Enter fullscreen mode Exit fullscreen mode

If stat returns cgroup2fs but containerd is configured with SystemdCgroup = false, resource accounting is broken.

Mitigation: Enforce explicit cgroup version consistency across all node pools. Use node taints during migrations to prevent mixed-version scheduling. Add cgroup version verification to your node provisioning pipeline as a gate — not a check.


Failure Pattern #3: OverlayFS Layer Debt in Long-Running Nodes

Trigger: Node uptime exceeding 60–90 days without image cleanup

Failure signature: Image pull times degrade from seconds to minutes; df -i shows inode exhaustion

Timeline diagram showing OverlayFS layer accumulation on a Kubernetes node over 90 days, leading to image pull degradation and inode exhaustion

OverlayFS layer debt accumulates through a predictable sequence: image pulls create new layer directories, image updates add layers without removing old ones, deleted containers leave dangling snapshot references, and node uptime compounds all of the above.

The critical failure mode: on nodes with high container churn running many distinct images, inode exhaustion occurs before disk capacity exhaustion — appearing as "No space left on device" errors with gigabytes of available disk space.

Diagnostic:

# Check snapshot count and disk usage
du -sh /var/lib/containerd/
find /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ -maxdepth 1 -type d | wc -l

# Check inode utilization
df -i /var/lib/containerd/

# Snapshot vs container count
ctr snapshots ls | wc -l
Enter fullscreen mode Exit fullscreen mode

Threshold: When snapshot count exceeds ~2,000–3,000, pull performance begins degrading measurably. When inode utilization exceeds 85%, image pulls start failing intermittently.

Mitigation: Implement node-level image cleanup as a DaemonSet running crictl rmi --prune on a scheduled interval. Consider node rotation policies that rebuild nodes every 60–90 days. For high-churn clusters, configure containerd's GC parameters explicitly:

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
  GCPercent = 50
Enter fullscreen mode Exit fullscreen mode

Failure Pattern #4: containerd Snapshot Garbage Collection Drift

Trigger: Long-running clusters with high container churn and no explicit GC tuning

Failure signature: Gradual degradation in image pull latency; storage consumption growing despite containers being deleted

Mechanism Description Detection
Dangling snapshots No active container reference, GC skips due to reference counting errors ctr snapshots ls vs ctr containers ls delta
Lease accumulation containerd leases never explicitly released ctr leases ls count growing unbound
Content store orphans Image layers referenced by snapshots but not by active manifests ctr content ls size vs expected

Diagnostic:

# Force a GC run
ctr content gc --verbose 2>&1 | tail -30

# Check lease accumulation
ctr leases ls | wc -l

# Snapshot-to-container ratio
echo "Snapshots: $(ctr snapshots ls | wc -l)"
echo "Containers: $(ctr containers ls | wc -l)"
Enter fullscreen mode Exit fullscreen mode

A snapshot count more than 3–5× the container count warrants investigation.

Mitigation: Configure explicit GC thresholds and implement periodic forced GC via a privileged DaemonSet. For persistent drift, evaluate moving to a separate containerd data root on a dedicated volume.


Failure Pattern #5: containerd gRPC Socket Saturation Under Rapid Pod Cycling

Trigger: High pod churn — CI/CD clusters, ephemeral job runners, GitHub Actions nodes

Failure signature: Pod scheduling delays that appear as node pressure with no CPU or memory constraint; kubectl describe node shows KubeletHasTooManyPods without approaching the pod limit

Diagram showing containerd gRPC socket saturation under rapid pod cycling load in CI/CD environments, with cascading scheduler delays

The containerd gRPC socket (/run/containerd/containerd.sock) processes requests serially per operation type. In environments with rapid pod cycling, it becomes a serialization bottleneck:

  1. CI/CD pipeline triggers 50+ pod creates in rapid succession
  2. Each create requires: PullImage → CreateContainer → StartContainer gRPC calls
  3. containerd processes these serially
  4. Kubelet's CRI client begins queuing requests
  5. Pod startup latency increases from seconds to minutes
  6. Kubelet reports node pressure before any actual resource constraint exists

kubectl top nodes shows the node as underutilized. The scheduler sees pressure that doesn't match resource metrics.

Diagnostic:

# Measure pod startup latency
kubectl get events --field-selector reason=Started --sort-by='.lastTimestamp' | tail -20

# Check containerd task metrics if endpoint enabled
curl -s http://localhost:1338/v1/metrics | grep containerd_task

# Monitor socket utilization
ss -x | grep containerd
Enter fullscreen mode Exit fullscreen mode

Mitigation: Isolate batch/CI workloads to dedicated node pools. Tune --max-pods conservatively for high-churn nodes — the default of 110 assumes steady-state workloads, not rapid cycling. For GitHub Actions runners, the actions-runner-controller project includes node pool isolation patterns that address this directly.


Diagnostic Summary: When to Expect These Failures

Failure Pattern Primary Trigger Secondary Signal
Shim Tax >500 containers per node OOM kills on compliant containers
cgroup v1/v2 Drift Mixed-version node pool CPU throttling without limit violation
OverlayFS Layer Debt Node uptime >60–90 days Image pull time degradation
Snapshot GC Drift High churn + no GC tuning Storage growth despite container deletion
Socket Saturation >100 pod creates/hour Scheduling delays without resource pressure

These patterns are predictable. The diagnostics are repeatable. The mitigations are operational decisions — not emergency responses.


Additional Resources


Originally published at rack2cloud.com — field-verified infrastructure architecture for engineers navigating the complexity gap.

Top comments (0)