Why this exists
I've been running K8s troubleshooting workshops for two years. We have a 200-student program at IT Defined where we throw broken clusters at people. Patterns emerged.
Most failures aren't novel. The same 25-30 failure modes account for 90% of real-world K8s incidents. If you can confidently debug these, you'll handle most production incidents.
Here are the 10 most critical scenarios. Full 26 in the linked post.
1. CrashLoopBackOff
Symptom: Pod restart count climbing.
Diagnosis:
kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous
Likely causes: App crashes on startup (config error, missing env var, can't connect to DB), liveness probe too aggressive, command/args misconfigured.
Fix: Read the previous container's logs. Reason is usually right there. If logs are empty, the container died before logging — check the entrypoint, command, and args.
2. ImagePullBackOff or ErrImagePull
Diagnosis: kubectl describe pod, look at events at the bottom.
Likely causes: Image name typo, image doesn't exist, registry credentials missing, wrong region (ECR is regional), node IAM role can't pull from ECR.
Fix: Run docker pull manually from a workstation. If it works, it's a node permission issue.
3. Pod stuck Pending
Diagnosis: kubectl describe pod. Look for "0/3 nodes available: insufficient cpu" or "didn't match node selector."
Likely causes: Insufficient capacity, resource requests too high, taints/tolerations mismatch, PVC not bound.
Fix: Check kubectl describe nodes for available resources. If maxed, autoscale.
4. OOMKilled
Diagnosis: kubectl describe pod shows "Last State: Terminated, Reason: OOMKilled."
Likely causes: Container exceeded memory limit, JVM not configured for container limits, memory leak.
Fix: Increase limits if workload genuinely needs more. For Java apps, use -XX:MaxRAMPercentage properly.
5. Service unreachable
Diagnosis:
kubectl get endpoints SVC_NAME
Likely causes: No endpoints (selector doesn't match pod labels), pod not listening on expected port, NetworkPolicy blocking traffic.
Fix: 99% of the time it's a label selector mismatch.
6. DNS resolution failing
Diagnosis: kubectl exec into pod, run nslookup. Check CoreDNS pods.
Likely causes: CoreDNS pods crashed, NetworkPolicy blocking DNS, /etc/resolv.conf misconfigured.
Fix: Restart CoreDNS if misbehaving. On EKS, defaults are sometimes too low for busy clusters.
7. Ingress 502 Bad Gateway
Likely causes: Backend pod down, target group health check failing, port mismatch, slow startup so ALB marks unhealthy.
Fix: Check target group health in AWS console. Fix readiness probe if pods unhealthy.
8. PVC stuck Pending
Likely causes: No StorageClass set, EBS CSI driver not installed, IAM permissions for the driver.
Fix on EKS: Install EBS CSI driver as an EKS add-on. Service account needs the right IAM role via IRSA.
9. Node Not Ready
Likely causes: Kubelet crashed, container runtime issue, disk pressure, network plugin failure.
Fix: SSH to node (or SSM Session Manager). Check journalctl -u kubelet. Often it's disk full from log accumulation.
10. HPA not scaling
Likely causes: Metrics-server not installed, HPA targeting CPU but pod has no CPU requests, max replicas reached.
Fix: kubectl get hpa. If <unknown> appears under metrics, metrics-server is broken.
How to use this playbook
When you hit a real incident, search for keywords from the symptom. Most day-to-day stuff is covered.
If you want to actually practice these in a safe environment, our K8s troubleshooting labs at IT Defined are exactly this — broken clusters with planted issues, fix them under time pressure.
Full 26 scenarios — including ConfigMap updates, Secret rotation, NetworkPolicy issues, PDB blocks, autoscaler problems, kube-proxy/CNI issues, Job failures, IRSA problems, webhook admission controllers, liveness probes, PV cleanup, and cluster upgrades — on itdefined.org.
Top comments (0)