Pod Troubleshooting - SRE’s Fast Lane

#sre #devops #monitoring #observability

Pod Troubleshooting - SRE’s Fast Lane
When a pod fails in Kubernetes, every second counts.
SREs need to quickly determine if the issue is due to configuration errors, resource limits, or application-level failures. The key is to follow a fast, structured troubleshooting flow that reduces MTTR.

1. Start with Pod Status
• Run: kubectl get pods -n <namespace>
• Look for states: CrashLoopBackOff, OOMKilled, Pending, Evicted.
• Status gives the first hint: scheduling issue vs runtime failure.

2. Check Pod Events
• Run: kubectl describe pod <pod-name> -n <namespace>
• Look for: FailedScheduling, ImagePullBackOff, Readiness/Liveness probe failures.
• Events often pinpoint the root cause faster than logs.

3. Analyze Logs
• Run: kubectl logs <pod-name> -n <namespace>
• For previous container crashes:
kubectl logs --previous -n
• Look for stack traces, memory errors, or connection issues.

4. Correlate with Metrics
• Check Prometheus metrics for the pod:
o CPU throttling → container_cpu_usage_seconds_total
o Memory spikes → container_memory_working_set_bytes
• Correlation ensures the issue isn’t just application-level but possibly resource starvation.
Read More: https://kubeha.com/shift-left-security-in-kubernetes/
Follow KubeHA Linkedin Page: https://lnkd.in/gV4Q2d4m
KubeHA's introduction: 👉 https://www.youtube.com/watch?v=PyzTQPLGaD0

Top comments (2)

Nagendra Kumar • Sep 17

Thanks for sharing

kubeha • Sep 17

Welcome!