IT Defined

Posted on Apr 30

Kubernetes Troubleshooting

#devops #kubernetes #tutorial

Why this exists

I've been running K8s troubleshooting workshops for two years. We have a 200-student program at IT Defined where we throw broken clusters at people. Patterns emerged.

Most failures aren't novel. The same 25-30 failure modes account for 90% of real-world K8s incidents. If you can confidently debug these, you'll handle most production incidents.

Here are the 10 most critical scenarios. Full 26 in the linked post.

1. CrashLoopBackOff

Symptom: Pod restart count climbing.

Diagnosis:

kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous

Likely causes: App crashes on startup (config error, missing env var, can't connect to DB), liveness probe too aggressive, command/args misconfigured.

Fix: Read the previous container's logs. Reason is usually right there. If logs are empty, the container died before logging — check the entrypoint, command, and args.

2. ImagePullBackOff or ErrImagePull

Diagnosis: kubectl describe pod, look at events at the bottom.

Likely causes: Image name typo, image doesn't exist, registry credentials missing, wrong region (ECR is regional), node IAM role can't pull from ECR.

Fix: Run docker pull manually from a workstation. If it works, it's a node permission issue.

3. Pod stuck Pending

Diagnosis: kubectl describe pod. Look for "0/3 nodes available: insufficient cpu" or "didn't match node selector."

Likely causes: Insufficient capacity, resource requests too high, taints/tolerations mismatch, PVC not bound.

Fix: Check kubectl describe nodes for available resources. If maxed, autoscale.

4. OOMKilled

Diagnosis: kubectl describe pod shows "Last State: Terminated, Reason: OOMKilled."

Likely causes: Container exceeded memory limit, JVM not configured for container limits, memory leak.

Fix: Increase limits if workload genuinely needs more. For Java apps, use -XX:MaxRAMPercentage properly.

5. Service unreachable

Diagnosis:

kubectl get endpoints SVC_NAME

Likely causes: No endpoints (selector doesn't match pod labels), pod not listening on expected port, NetworkPolicy blocking traffic.

Fix: 99% of the time it's a label selector mismatch.

6. DNS resolution failing

Diagnosis: kubectl exec into pod, run nslookup. Check CoreDNS pods.

Likely causes: CoreDNS pods crashed, NetworkPolicy blocking DNS, /etc/resolv.conf misconfigured.

Fix: Restart CoreDNS if misbehaving. On EKS, defaults are sometimes too low for busy clusters.

7. Ingress 502 Bad Gateway

Likely causes: Backend pod down, target group health check failing, port mismatch, slow startup so ALB marks unhealthy.

Fix: Check target group health in AWS console. Fix readiness probe if pods unhealthy.

8. PVC stuck Pending

Likely causes: No StorageClass set, EBS CSI driver not installed, IAM permissions for the driver.

Fix on EKS: Install EBS CSI driver as an EKS add-on. Service account needs the right IAM role via IRSA.

9. Node Not Ready

Likely causes: Kubelet crashed, container runtime issue, disk pressure, network plugin failure.

Fix: SSH to node (or SSM Session Manager). Check journalctl -u kubelet. Often it's disk full from log accumulation.

10. HPA not scaling

Likely causes: Metrics-server not installed, HPA targeting CPU but pod has no CPU requests, max replicas reached.

Fix: kubectl get hpa. If <unknown> appears under metrics, metrics-server is broken.

How to use this playbook

When you hit a real incident, search for keywords from the symptom. Most day-to-day stuff is covered.

If you want to actually practice these in a safe environment, our K8s troubleshooting labs at IT Defined are exactly this — broken clusters with planted issues, fix them under time pressure.

Full 26 scenarios — including ConfigMap updates, Secret rotation, NetworkPolicy issues, PDB blocks, autoscaler problems, kube-proxy/CNI issues, Job failures, IRSA problems, webhook admission controllers, liveness probes, PV cleanup, and cluster upgrades — on itdefined.org.

DEV Community