Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro
You've deployed your application to Kubernetes, checked the status, and... disaster. Your pod is either stuck in Pending or crashing endlessly in CrashLoopBackOff. The logs are cryptic, and you're not sure where to start.
Don't panic. These are the two most common Kubernetes headaches, and both have systematic solutions. In this guide, you'll learn exactly how to diagnose and fix each one, with real commands and scenarios you can use today.
Understanding the Two Problem States
Pending: "I Can't Find a Home"
When a pod is Pending, the Kubernetes scheduler hasn't been able to assign it to any node. The container hasn't started—it hasn't even been placed. The scheduler is saying, "I checked all nodes, and none of them work for this pod."
Common causes:
- Insufficient resources: Not enough CPU, memory, or storage
- Node selection constraints: Labels, affinities, or taints that don't match
- Persistent volume issues: PVCs that can't bind
- Resource quotas: Namespace limits preventing scheduling
CrashLoopBackOff: "I Keep Crashing on Startup"
When a pod is in CrashLoopBackOff, it has been scheduled to a node, but the container keeps crashing. Kubernetes restarts it, it crashes again, and eventually Kubernetes backs off with increasing delays between restarts.
Common causes:
- Application errors: Unhandled exceptions, missing config
- Health check failures: Liveness/readiness probes failing
- Missing dependencies: Database not ready, configmaps missing
- Resource limits: Container OOMKilled or CPU throttled
Let's debug each systematically.
Part 1: Debugging CrashLoopBackOff
Step 1: Check Pod Status
First, confirm the state:
kubectl get pods -n <namespace>
You'll see something like:
NAME READY STATUS RESTARTS AGE
my-app-6f4d5c7b8-x9kmn 0/1 CrashLoopBackOff 5 10m
The RESTARTS count tells you how many times Kubernetes has tried to bring it back.
Step 2: Check Current Logs
If the container is running (between crashes), check current logs:
kubectl logs <pod-name> -n <namespace>
Step 3: Check Previous Container Logs
This is the key step. When a container crashes, the logs from the crashed instance are still available. Use the --previous flag:
kubectl logs <pod-name> -n <namespace> --previous
This shows you what happened right before the crash. Look for:
- Stack traces
- "Connection refused" or "timeout" errors
- "Permission denied" messages
- "Out of memory" indicators
Step 4: Check Pod Events
Events give you the Kubernetes-level perspective:
kubectl describe pod <pod-name> -n <namespace>
Scroll to the Events section:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned default/my-app to node-1
Normal Pulled 4m (x5 over 5m) kubelet Container image "my-app:latest" already present
Normal Created 4m (x5 over 5m) kubelet Created container app
Normal Started 4m (x5 over 5m) kubelet Started container app
Warning BackOff 1m (x10 over 5m) kubelet Back-off restarting failed container
The (x5 over 5m) means the container has been created 5 times in the last 5 minutes.
Step 5: Check Resource Limits
If your app crashes without clear errors, it might be getting OOMKilled:
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
If you see OOMKilled, your container exceeded its memory limit:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Fix: Increase the memory limit in your deployment:
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "250m"
Step 6: Check Probes
If your app starts but keeps failing health checks, the liveness probe might be too aggressive:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe
Common issues:
-
initialDelaySecondstoo short (app needs more startup time) - Probe endpoint returning wrong status
- Timeout too short
Fix: Adjust probe settings:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Give app time to start
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3 # Allow 3 failures before restart
Part 2: Debugging Pending Pods
Step 1: Check Pod Events
Your first stop is always kubectl describe:
kubectl describe pod <pod-name> -n <namespace>
Look at the Events section:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
The scheduler tells you exactly why it can't place the pod.
Step 2: Check Node Resources
If events mention insufficient CPU or memory:
kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 1800m 90% 14Gi 70%
node-2 1500m 75% 12Gi 60%
node-3 1900m 95% 15Gi 75%
If nodes are maxed out, your options are:
- Scale down unnecessary workloads
- Add nodes to the cluster
- Reduce pod resource requests
Check what your pod is requesting:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
Step 3: Check Node Selectors and Affinities
If events mention "node selector mismatch":
# Check your pod's constraints
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 "nodeSelector\|affinity"
# List available node labels
kubectl get nodes --show-labels
Common issue: Pod requires zone=us-east-1a but no node has that label.
Fix: Either add the label or remove the constraint:
# Add label to node
kubectl label node <node-name> zone=us-east-1a
# Or remove constraint from pod spec
Step 4: Check PVC Binding
If your pod uses PersistentVolumeClaims:
kubectl get pvc -n <namespace>
NAME STATUS VOLUME CAPACITY STORAGECLASS AGE
data-pvc Pending standard 5m
A Pending PVC means no matching PersistentVolume exists.
Fix: Create a matching PV or ensure your StorageClass supports dynamic provisioning:
# Check available storage classes
kubectl get storageclass
# Describe the PVC to see why it's not binding
kubectl describe pvc <pvc-name> -n <namespace>
Step 5: Check Taints and Tolerations
Nodes with taints repel pods that don't have matching tolerations:
kubectl describe nodes | grep -A 3 Taints
Common taints:
-
node.kubernetes.io/not-ready:NoSchedule— node is unhealthy -
node.kubernetes.io/unschedulable:NoSchedule— cordoned node -
dedicated=gpu:NoSchedule— reserved for specific workloads
Fix: Add tolerations to your pod:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Or remove the taint:
kubectl taint nodes <node-name> dedicated:NoSchedule-
Real-World Troubleshooting Scenarios
Scenario 1: App Crashes on Startup with No Logs
Symptoms: Container exits immediately, --previous logs show nothing
Diagnosis: The container might be missing environment variables or config files.
# Check if ConfigMaps/Secrets are mounted
kubectl describe pod <pod-name> | grep -A 20 "Volumes"
# Check container command
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'
Fix: Ensure all required ConfigMaps and Secrets exist:
kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>
Scenario 2: Pod Works in Dev, Stuck in Prod
Symptoms: Same manifest, different behavior
Diagnosis: Environment differences. Check:
# Compare node resources
kubectl top nodes
# Compare storage classes
kubectl get storageclass
# Compare resource quotas
kubectl get resourcequota -n <namespace>
Common causes:
- Prod has fewer nodes or different sizing
- StorageClass names differ
- ResourceQuota limits in prod namespace
Scenario 3: Intermittent Crashes
Symptoms: Pod runs for a while, then crashes
Diagnosis: Likely memory leak or resource exhaustion over time.
# Watch pod memory over time
kubectl top pod <pod-name> -n <namespace> --containers
# Check for OOMKilled in events
kubectl describe pod <pod-name> | grep -i oom
Fix: Add memory limits and monitor with:
resources:
limits:
memory: "1Gi"
Quick Reference: Diagnostic Commands
# Check pod status
kubectl get pods -n <namespace>
# Get detailed pod info
kubectl describe pod <pod-name> -n <namespace>
# Check current logs
kubectl logs <pod-name> -n <namespace>
# Check previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous
# Check all container logs (multi-container pod)
kubectl logs <pod-name> -n <namespace> --all-containers
# Check node resources
kubectl top nodes
# Check storage
kubectl get pvc,pv -n <namespace>
# Check events cluster-wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Check resource quotas
kubectl describe resourcequota -n <namespace>
# Check node labels and taints
kubectl describe nodes | grep -E "Labels:|Taints:" -A 1
FAQ
Q: How many restarts before Kubernetes gives up?
Kubernetes uses exponential backoff: 10s, 20s, 40s, 80s, up to 5 minutes max. It will keep trying indefinitely unless you set spec.activeDeadlineSeconds.
Q: Can I see what's happening inside the container?
If the container keeps crashing, use a debug container:
kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>
This creates a sidecar container sharing the same process namespace.
Q: How do I prevent pods from crashing repeatedly?
Set a restartPolicy:
spec:
restartPolicy: OnFailure # or Never for Jobs
For Deployments, the default is Always (restart indefinitely).
Q: What if the pod is stuck in ImagePullBackOff?
This isn't a crash—it's an image pull failure:
kubectl describe pod <pod-name> | grep -A 5 "Events:"
Common causes:
- Wrong image name or tag
- Private registry credentials missing (
imagePullSecrets) - Network issues reaching registry
Conclusion
Pods stuck in CrashLoopBackOff or Pending are frustrating, but they're always diagnosable with the right commands.
For CrashLoopBackOff:
- Check
--previouslogs for crash details - Look for OOMKilled in events
- Verify liveness/readiness probes
- Check ConfigMaps and Secrets exist
For Pending:
- Check
kubectl describefor scheduler messages - Verify node resources are available
- Check PVC binding status
- Verify node selectors, affinities, and tolerations match
The key is systematic debugging: start with kubectl describe, read the events, and follow the breadcrumbs. Kubernetes tells you what's wrong—you just need to know where to look.
Happy debugging! 🚀
Top comments (0)