DEV Community

NTCTech
NTCTech

Posted on • Originally published at rack2cloud.com

Kubernetes Day-2 Incidents: 5 Real-World Failures and the One Metric That Predicts Them

Five Kubernetes Day-2 incident patterns and the single metric that predicts each one

Day 1 is shipping the cluster. Day 2 is living with it.

And Day 2 has patterns. The same five failure modes surface every month, across clusters of different sizes, different clouds, and different teams. They present differently each time — but they all have a tell.

This is the short version. Each incident links to a full diagnostic deep-dive in the Rack2Cloud K8s Day-2 Method.


Kubernetes Day-2 incident to metric mapping quick reference

1. CrashLoopBackOff

What it looks like: RESTARTS climbing in kubectl get pods. Logs often empty — container exits too fast to write anything.

What it usually is: IAM or credential failure. The container starts, auth call fails, exit code 1, kubelet restarts, repeat. Not a code bug — an identity bug.

The loop: Container starts → auth fails → exit 1 → restart → repeat.

Watch:

kube_pod_container_status_restarts_total
Enter fullscreen mode Exit fullscreen mode

Combine with exit code from kubectl describe pod. Exit 1 or 128+ with rapid restarts = identity problem. Exit 0 with crash loop = readiness probe failure.

→ Full deep-dive: Kubernetes ImagePullBackOff: It's Not the Registry (It's IAM)


2. Scheduler Stuck — Pending Pods That Never Resolve

What it looks like: Pods sit Pending. FailedScheduling in describe. Nodes show Ready. kubectl top nodes looks fine.

What it usually is: The gap between allocatable and requested. Nodes are full on paper even when utilisation is 30%. The scheduler can't place — and never escalates.

The loop: Pod → no node satisfies constraints → Pending → stays Pending forever.

Watch:

kube_pod_status_phase{phase="Pending"}
kube_node_status_allocatable vs kube_pod_resource_requests_cpu_cores (sum by node)
Enter fullscreen mode Exit fullscreen mode

95% requested with 25% actual utilisation = stuck scheduler incoming.

→ Full deep-dive: Your Kubernetes Cluster Isn't Out of CPU — The Scheduler Is Stuck


3. Silent Network Failures — MTU Mismatch on Overlay

What it looks like: Services work locally. Cross-node calls timeout or 502. DNS resolves. TCP connects. Large payloads disappear.

What it usually is: VXLAN/Geneve overlay adds header overhead. Large packets hit the MTU ceiling of the underlying cloud network. Dropped silently — no ICMP, no error, just gone.

The loop: Large payload → overlay encapsulation exceeds MTU → silent drop → TCP retransmit storm → timeout → 502.

Watch:

node_network_receive_drop_total
node_network_transmit_drop_total   per interface, not aggregate
Enter fullscreen mode Exit fullscreen mode

Drops on flannel0, cilium_vxlan, or geneve.1 with clean physical NIC stats = MTU.

→ Full deep-dive: It's Not DNS (It's MTU): Debugging Kubernetes Ingress


4. Storage Gravity — PVC Stuck or Pod on Wrong Node

What it looks like: Pod reschedules after a node failure. New node can't mount the PVC. ContainerCreating indefinitely. Zone A volume, zone B pod.

What it usually is: Block storage is zonal. AWS EBS, Azure Disk, GCP PD — all tied to the AZ where they were provisioned. volumeBindingMode: WaitForFirstConsumer fixes it, but most default storage classes don't use it. The failure only appears under pressure.

The loop: Node fails → pod reschedules cross-AZ → PVC can't attach → ContainerCreating → stuck.

Watch:

kube_persistentvolumeclaim_status_phase{phase="Pending"}
Enter fullscreen mode Exit fullscreen mode

kubectl describe pvc — look for ProvisioningFailed or FailedMount with AZ mismatch detail.


5. etcd / Control-Plane Saturation

What it looks like: kubectl hangs. API server timeouts. Controllers flap. Deployments stop reconciling. HPA erratic. Everything appears fine at the workload layer.

What it usually is: etcd is I/O-bound. Shared cloud storage under burst conditions hits a disk IOPS ceiling. When etcd can't commit writes fast enough, the entire control plane degrades — with no workload-layer signal until it cascades.

The loop: High cluster churn → etcd write amplification → IOPS ceiling → heartbeat latency → API server queue → controller desync → more churn.

Watch:

etcd_disk_wal_fsync_duration_seconds_bucket   p99 > 10ms = warning, > 100ms = problem
apiserver_request_duration_seconds_bucket     p99 spike with no app traffic = control-plane problem
Enter fullscreen mode Exit fullscreen mode

On AKS/EKS/GKE you won't see etcd directly — API server latency is surfaced instead. p99 spike with no application traffic change = control plane, not app.


The Pattern Underneath All Five

Symptom layer: noisy and misleading. Cause layer: quiet until it isn't.

CrashLoopBackOff looks like code. Pending pods look like resource exhaustion. 502s look like DNS. Storage failures look like provisioner bugs. API timeouts look like network.

The Rack2Cloud Method maps the signal chain from symptom to root cause — for all four loops (identity, compute, network, storage).

Canonical failure signatures and diagnostic protocols are in the Rack2Cloud GitHub repo — open reference, no login required.

If you're on AKS specifically, Petro Kostiuk's Azure Edition walkthrough applies the method to the managed-plane layer where several of these sequences change.


The full series covers each loop in depth. Start with the method, follow the loop that matches your incident.

Originally published at Rack2Cloud.com

Top comments (2)

Collapse
 
vandana_platform profile image
vandana.platform

This is a great reminder that most Kubernetes incidents on Day-2 aren’t random they follow repeatable patterns. What’s really valuable here is tying each failure mode to a specific leading metric. When teams monitor the right signals (restarts, Pending pods vs allocatable, network drops, PVC phases, API latency), many of these incidents can be detected before they escalate into outages. Observability at the control-plane and infrastructure layers is often the difference between reactive firefighting and proactive platform operations.

Collapse
 
ntctech profile image
NTCTech

Exactly right and you've nailed the distinction that most teams miss: it's not just having the metrics, it's knowing which layer they belong to. Restarts and Pending pods live at the workload layer. API latency and etcd heartbeat variance live at the control plane. Teams that conflate the two end up with dashboards that look busy but don't actually predict anything.

The pattern I keep seeing in production is that the control-plane signals degrade first (quietly) and well before the workload layer shows symptoms. By the time restarts spike, you're already in incident response mode. The leading metric advantage only works if you're watching the right layer at the right time.

Curious what your team uses for control-plane observability, are you running anything custom on top of kube-state-metrics, or relying on a managed layer?