DEV Community

Cover image for Pod Troubleshooting - SRE’s Fast Lane
kubeha
kubeha

Posted on

Pod Troubleshooting - SRE’s Fast Lane

Pod Troubleshooting - SRE’s Fast Lane
When a pod fails in Kubernetes, every second counts.
SREs need to quickly determine if the issue is due to configuration errors, resource limits, or application-level failures. The key is to follow a fast, structured troubleshooting flow that reduces MTTR.


1. Start with Pod Status
• Run: kubectl get pods -n <namespace>
• Look for states: CrashLoopBackOff, OOMKilled, Pending, Evicted.
• Status gives the first hint: scheduling issue vs runtime failure.


2. Check Pod Events
• Run: kubectl describe pod <pod-name> -n <namespace>
• Look for: FailedScheduling, ImagePullBackOff, Readiness/Liveness probe failures.
• Events often pinpoint the root cause faster than logs.


3. Analyze Logs
• Run: kubectl logs <pod-name> -n <namespace>
• For previous container crashes:
kubectl logs --previous -n
• Look for stack traces, memory errors, or connection issues.


4. Correlate with Metrics
• Check Prometheus metrics for the pod:
o CPU throttling → container_cpu_usage_seconds_total
o Memory spikes → container_memory_working_set_bytes
• Correlation ensures the issue isn’t just application-level but possibly resource starvation.
Read More: https://kubeha.com/shift-left-security-in-kubernetes/
Follow KubeHA Linkedin Page: https://lnkd.in/gV4Q2d4m
KubeHA's introduction: 👉 https://www.youtube.com/watch?v=PyzTQPLGaD0

Top comments (2)

Collapse
 
nagendra_kumar_c4d5b124d4 profile image
Nagendra Kumar

Thanks for sharing

Collapse
 
kubeha_18 profile image
kubeha

Welcome!