This post originally appeared on Rack2Cloud. All patterns were observed across production Kubernetes environments running 400–1,000 containers per node.
Your containerd metrics look healthy.
Pod density is climbing. Node CPU is stable. Memory pressure is low.
Then somewhere around 800–900 containers per node, something quiet happens: containerd-shim processes begin accumulating memory. 4 GB. 6 GB. Eventually the Linux OOM killer steps in and starts terminating containers that Kubernetes never asked it to kill.
Your dashboards still say the node is healthy. Your workloads disagree.
This is the Day-2 reality of containerd in production at scale. Not a configuration error. Not a software bug. A set of predictable failure patterns that appear after your cluster reaches the density thresholds that most documentation never discusses.
The containerd Runtime Stack: What Actually Runs Your Containers
Before diagnosing failures, the execution chain needs to be precise. When Kubernetes schedules a pod:
Kubelet → gRPC → containerd → spawns → containerd-shim (per container) → runc → container process
The shim is the architectural detail most Day-2 problems trace back to. Unlike containerd itself — a single long-running daemon — the shim is a per-container process that stays alive for the entire container lifecycle. It handles PID tracking, stdout/stderr relay, exit code capture, and signal forwarding.
At 10 containers, this is invisible. At 800 containers, you are running 800 shim processes. That changes the failure math entirely.
Failure Pattern #1: The containerd Shim Tax at High Pod Density
Trigger: Node density exceeds ~500 containers per node
Failure signature: OOM kills on containers that appear resource-compliant
Each containerd-shim process consumes approximately 10–15 MB of resident memory — static, regardless of what the container itself is doing:
| Container Count | Shim Overhead (est.) | Available Node Headroom Lost |
|---|---|---|
| 100 containers | ~1.2 GB | Minor |
| 300 containers | ~3.5 GB | Moderate |
| 500 containers | ~6.0 GB | Significant |
| 800 containers | ~10.0 GB | Critical |
Why dashboards miss it: Kubernetes resource accounting tracks container cgroup limits, not shim process overhead. A node can be technically under its container memory budget while the shim layer consumes enough unreserved memory to trigger OOM events.
Diagnostic:
# Count shim processes and total memory consumption
ps aux | grep containerd-shim | awk '{sum += $6} END {print "Total shim RSS: " sum/1024 " MB, Count: " NR}'
# Check OOM kill history
dmesg | grep -i "oom\|killed process" | tail -20
Mitigation: Reserve explicit non-container memory headroom using --system-reserved and --kube-reserved kubelet flags. A practical floor for high-density nodes is 2–4 GB above documented container limits. Evaluate crun as a replacement for runc — it reduces per-shim overhead by approximately 30–40% in high-density configurations.
Failure Pattern #2: Mixed cgroup v1/v2 Nodes Corrupt Resource Accounting
Trigger: Cluster nodes running mixed cgroup versions during OS or kernel upgrades
Failure signature: Pods report under CPU limits but exhibit throttling; containers OOM unexpectedly despite memory limits appearing enforced
| Dimension | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | Per-subsystem flat | Unified single hierarchy |
| CPU accounting |
cpu and cpuacct separate |
Combined in unified tree |
| Memory accounting |
memory subsystem |
Different semantics |
| Kubernetes support | Full | Requires kernel ≥ 5.8, containerd ≥ 1.4 |
In a mixed cluster, the Kubernetes scheduler has no visibility into cgroup version per node. Pods scheduled to cgroup v2 nodes running containerd configured for v1 semantics will have their resource limits enforced through the wrong accounting model.
Diagnostic:
# Check cgroup version on a node
stat -fc %T /sys/fs/cgroup/
# Check containerd's cgroup driver configuration
containerd config dump | grep cgroup
# Verify kubelet cgroup driver matches
systemctl show kubelet | grep cgroup
If stat returns cgroup2fs but containerd is configured with SystemdCgroup = false, resource accounting is broken.
Mitigation: Enforce explicit cgroup version consistency across all node pools. Use node taints during migrations to prevent mixed-version scheduling. Add cgroup version verification to your node provisioning pipeline as a gate — not a check.
Failure Pattern #3: OverlayFS Layer Debt in Long-Running Nodes
Trigger: Node uptime exceeding 60–90 days without image cleanup
Failure signature: Image pull times degrade from seconds to minutes; df -i shows inode exhaustion
OverlayFS layer debt accumulates through a predictable sequence: image pulls create new layer directories, image updates add layers without removing old ones, deleted containers leave dangling snapshot references, and node uptime compounds all of the above.
The critical failure mode: on nodes with high container churn running many distinct images, inode exhaustion occurs before disk capacity exhaustion — appearing as "No space left on device" errors with gigabytes of available disk space.
Diagnostic:
# Check snapshot count and disk usage
du -sh /var/lib/containerd/
find /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/ -maxdepth 1 -type d | wc -l
# Check inode utilization
df -i /var/lib/containerd/
# Snapshot vs container count
ctr snapshots ls | wc -l
Threshold: When snapshot count exceeds ~2,000–3,000, pull performance begins degrading measurably. When inode utilization exceeds 85%, image pulls start failing intermittently.
Mitigation: Implement node-level image cleanup as a DaemonSet running crictl rmi --prune on a scheduled interval. Consider node rotation policies that rebuild nodes every 60–90 days. For high-churn clusters, configure containerd's GC parameters explicitly:
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
GCPercent = 50
Failure Pattern #4: containerd Snapshot Garbage Collection Drift
Trigger: Long-running clusters with high container churn and no explicit GC tuning
Failure signature: Gradual degradation in image pull latency; storage consumption growing despite containers being deleted
| Mechanism | Description | Detection |
|---|---|---|
| Dangling snapshots | No active container reference, GC skips due to reference counting errors |
ctr snapshots ls vs ctr containers ls delta |
| Lease accumulation | containerd leases never explicitly released |
ctr leases ls count growing unbound |
| Content store orphans | Image layers referenced by snapshots but not by active manifests |
ctr content ls size vs expected |
Diagnostic:
# Force a GC run
ctr content gc --verbose 2>&1 | tail -30
# Check lease accumulation
ctr leases ls | wc -l
# Snapshot-to-container ratio
echo "Snapshots: $(ctr snapshots ls | wc -l)"
echo "Containers: $(ctr containers ls | wc -l)"
A snapshot count more than 3–5× the container count warrants investigation.
Mitigation: Configure explicit GC thresholds and implement periodic forced GC via a privileged DaemonSet. For persistent drift, evaluate moving to a separate containerd data root on a dedicated volume.
Failure Pattern #5: containerd gRPC Socket Saturation Under Rapid Pod Cycling
Trigger: High pod churn — CI/CD clusters, ephemeral job runners, GitHub Actions nodes
Failure signature: Pod scheduling delays that appear as node pressure with no CPU or memory constraint; kubectl describe node shows KubeletHasTooManyPods without approaching the pod limit
The containerd gRPC socket (/run/containerd/containerd.sock) processes requests serially per operation type. In environments with rapid pod cycling, it becomes a serialization bottleneck:
- CI/CD pipeline triggers 50+ pod creates in rapid succession
- Each create requires:
PullImage → CreateContainer → StartContainergRPC calls - containerd processes these serially
- Kubelet's CRI client begins queuing requests
- Pod startup latency increases from seconds to minutes
- Kubelet reports node pressure before any actual resource constraint exists
kubectl top nodes shows the node as underutilized. The scheduler sees pressure that doesn't match resource metrics.
Diagnostic:
# Measure pod startup latency
kubectl get events --field-selector reason=Started --sort-by='.lastTimestamp' | tail -20
# Check containerd task metrics if endpoint enabled
curl -s http://localhost:1338/v1/metrics | grep containerd_task
# Monitor socket utilization
ss -x | grep containerd
Mitigation: Isolate batch/CI workloads to dedicated node pools. Tune --max-pods conservatively for high-churn nodes — the default of 110 assumes steady-state workloads, not rapid cycling. For GitHub Actions runners, the actions-runner-controller project includes node pool isolation patterns that address this directly.
Diagnostic Summary: When to Expect These Failures
| Failure Pattern | Primary Trigger | Secondary Signal |
|---|---|---|
| Shim Tax | >500 containers per node | OOM kills on compliant containers |
| cgroup v1/v2 Drift | Mixed-version node pool | CPU throttling without limit violation |
| OverlayFS Layer Debt | Node uptime >60–90 days | Image pull time degradation |
| Snapshot GC Drift | High churn + no GC tuning | Storage growth despite container deletion |
| Socket Saturation | >100 pod creates/hour | Scheduling delays without resource pressure |
These patterns are predictable. The diagnostics are repeatable. The mitigations are operational decisions — not emergency responses.
Additional Resources
- Architecting for Density: Docker vs Podman vs containerd — Day-1 runtime selection and the foundational architecture this post builds on
- Kubernetes Day-2 Operations: The Rack2Cloud Method — The complete diagnostic framework these patterns plug into
- Kubernetes Scheduler Stuck: CPU Fragmentation and the Pending Pod Problem
- Kubernetes PVC Stuck: Volume Node Affinity and Storage Loop Failures
- containerd Project Documentation
- Linux cgroup v2 Documentation
- crun Project — High-Performance OCI Runtime
Originally published at rack2cloud.com — field-verified infrastructure architecture for engineers navigating the complexity gap.





Top comments (0)