NTCTech

Posted on Mar 12 • Originally published at rack2cloud.com

Kubernetes Day-2 Incidents: 5 Real-World Failures and the One Metric That Predicts Them

#kubernetes #cloudnative #devops #platformengineering

Day 1 is shipping the cluster. Day 2 is living with it.

And Day 2 has patterns. The same five failure modes surface every month, across clusters of different sizes, different clouds, and different teams. They present differently each time — but they all have a tell.

This is the short version. Each incident links to a full diagnostic deep-dive in the Rack2Cloud K8s Day-2 Method.

1. CrashLoopBackOff

What it looks like: RESTARTS climbing in kubectl get pods. Logs often empty — container exits too fast to write anything.

What it usually is: IAM or credential failure. The container starts, auth call fails, exit code 1, kubelet restarts, repeat. Not a code bug — an identity bug.

The loop: Container starts → auth fails → exit 1 → restart → repeat.

Watch:

kube_pod_container_status_restarts_total

Combine with exit code from kubectl describe pod. Exit 1 or 128+ with rapid restarts = identity problem. Exit 0 with crash loop = readiness probe failure.

→ Full deep-dive: Kubernetes ImagePullBackOff: It's Not the Registry (It's IAM)

2. Scheduler Stuck — Pending Pods That Never Resolve

What it looks like: Pods sit Pending. FailedScheduling in describe. Nodes show Ready. kubectl top nodes looks fine.

What it usually is: The gap between allocatable and requested. Nodes are full on paper even when utilisation is 30%. The scheduler can't place — and never escalates.

The loop: Pod → no node satisfies constraints → Pending → stays Pending forever.

Watch:

kube_pod_status_phase{phase="Pending"}
kube_node_status_allocatable vs kube_pod_resource_requests_cpu_cores (sum by node)

95% requested with 25% actual utilisation = stuck scheduler incoming.

→ Full deep-dive: Your Kubernetes Cluster Isn't Out of CPU — The Scheduler Is Stuck

3. Silent Network Failures — MTU Mismatch on Overlay

What it looks like: Services work locally. Cross-node calls timeout or 502. DNS resolves. TCP connects. Large payloads disappear.

What it usually is: VXLAN/Geneve overlay adds header overhead. Large packets hit the MTU ceiling of the underlying cloud network. Dropped silently — no ICMP, no error, just gone.

The loop: Large payload → overlay encapsulation exceeds MTU → silent drop → TCP retransmit storm → timeout → 502.

Watch:

node_network_receive_drop_total
node_network_transmit_drop_total  — per interface, not aggregate

Drops on flannel0, cilium_vxlan, or geneve.1 with clean physical NIC stats = MTU.

→ Full deep-dive: It's Not DNS (It's MTU): Debugging Kubernetes Ingress

4. Storage Gravity — PVC Stuck or Pod on Wrong Node

What it looks like: Pod reschedules after a node failure. New node can't mount the PVC. ContainerCreating indefinitely. Zone A volume, zone B pod.

What it usually is: Block storage is zonal. AWS EBS, Azure Disk, GCP PD — all tied to the AZ where they were provisioned. volumeBindingMode: WaitForFirstConsumer fixes it, but most default storage classes don't use it. The failure only appears under pressure.

The loop: Node fails → pod reschedules cross-AZ → PVC can't attach → ContainerCreating → stuck.

Watch:

kube_persistentvolumeclaim_status_phase{phase="Pending"}

kubectl describe pvc — look for ProvisioningFailed or FailedMount with AZ mismatch detail.

5. etcd / Control-Plane Saturation

What it looks like: kubectl hangs. API server timeouts. Controllers flap. Deployments stop reconciling. HPA erratic. Everything appears fine at the workload layer.

What it usually is: etcd is I/O-bound. Shared cloud storage under burst conditions hits a disk IOPS ceiling. When etcd can't commit writes fast enough, the entire control plane degrades — with no workload-layer signal until it cascades.

The loop: High cluster churn → etcd write amplification → IOPS ceiling → heartbeat latency → API server queue → controller desync → more churn.

Watch:

etcd_disk_wal_fsync_duration_seconds_bucket  — p99 > 10ms = warning, > 100ms = problem
apiserver_request_duration_seconds_bucket    — p99 spike with no app traffic = control-plane problem

On AKS/EKS/GKE you won't see etcd directly — API server latency is surfaced instead. p99 spike with no application traffic change = control plane, not app.

The Pattern Underneath All Five

Symptom layer: noisy and misleading. Cause layer: quiet until it isn't.

CrashLoopBackOff looks like code. Pending pods look like resource exhaustion. 502s look like DNS. Storage failures look like provisioner bugs. API timeouts look like network.

The Rack2Cloud Method maps the signal chain from symptom to root cause — for all four loops (identity, compute, network, storage).

Canonical failure signatures and diagnostic protocols are in the Rack2Cloud GitHub repo — open reference, no login required.

If you're on AKS specifically, Petro Kostiuk's Azure Edition walkthrough applies the method to the managed-plane layer where several of these sequences change.

The full series covers each loop in depth. Start with the method, follow the loop that matches your incident.

Originally published at Rack2Cloud.com

Top comments (2)

vandana.platform • Mar 12

This is a great reminder that most Kubernetes incidents on Day-2 aren’t random they follow repeatable patterns. What’s really valuable here is tying each failure mode to a specific leading metric. When teams monitor the right signals (restarts, Pending pods vs allocatable, network drops, PVC phases, API latency), many of these incidents can be detected before they escalate into outages. Observability at the control-plane and infrastructure layers is often the difference between reactive firefighting and proactive platform operations.

NTCTech • Mar 12

Exactly right and you've nailed the distinction that most teams miss: it's not just having the metrics, it's knowing which layer they belong to. Restarts and Pending pods live at the workload layer. API latency and etcd heartbeat variance live at the control plane. Teams that conflate the two end up with dashboards that look busy but don't actually predict anything.

The pattern I keep seeing in production is that the control-plane signals degrade first (quietly) and well before the workload layer shows symptoms. By the time restarts spike, you're already in incident response mode. The leading metric advantage only works if you're watching the right layer at the right time.

Curious what your team uses for control-plane observability, are you running anything custom on top of kube-state-metrics, or relying on a managed layer?