DEV Community

AttractivePenguin
AttractivePenguin

Posted on

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

You've deployed your application to Kubernetes, checked the status, and... disaster. Your pod is either stuck in Pending or crashing endlessly in CrashLoopBackOff. The logs are cryptic, and you're not sure where to start.

Don't panic. These are the two most common Kubernetes headaches, and both have systematic solutions. In this guide, you'll learn exactly how to diagnose and fix each one, with real commands and scenarios you can use today.


Understanding the Two Problem States

Pending: "I Can't Find a Home"

When a pod is Pending, the Kubernetes scheduler hasn't been able to assign it to any node. The container hasn't started—it hasn't even been placed. The scheduler is saying, "I checked all nodes, and none of them work for this pod."

Common causes:

  • Insufficient resources: Not enough CPU, memory, or storage
  • Node selection constraints: Labels, affinities, or taints that don't match
  • Persistent volume issues: PVCs that can't bind
  • Resource quotas: Namespace limits preventing scheduling

CrashLoopBackOff: "I Keep Crashing on Startup"

When a pod is in CrashLoopBackOff, it has been scheduled to a node, but the container keeps crashing. Kubernetes restarts it, it crashes again, and eventually Kubernetes backs off with increasing delays between restarts.

Common causes:

  • Application errors: Unhandled exceptions, missing config
  • Health check failures: Liveness/readiness probes failing
  • Missing dependencies: Database not ready, configmaps missing
  • Resource limits: Container OOMKilled or CPU throttled

Let's debug each systematically.


Part 1: Debugging CrashLoopBackOff

Step 1: Check Pod Status

First, confirm the state:

kubectl get pods -n <namespace>
Enter fullscreen mode Exit fullscreen mode

You'll see something like:

NAME                      READY   STATUS             RESTARTS   AGE
my-app-6f4d5c7b8-x9kmn    0/1     CrashLoopBackOff   5          10m
Enter fullscreen mode Exit fullscreen mode

The RESTARTS count tells you how many times Kubernetes has tried to bring it back.

Step 2: Check Current Logs

If the container is running (between crashes), check current logs:

kubectl logs <pod-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Step 3: Check Previous Container Logs

This is the key step. When a container crashes, the logs from the crashed instance are still available. Use the --previous flag:

kubectl logs <pod-name> -n <namespace> --previous
Enter fullscreen mode Exit fullscreen mode

This shows you what happened right before the crash. Look for:

  • Stack traces
  • "Connection refused" or "timeout" errors
  • "Permission denied" messages
  • "Out of memory" indicators

Step 4: Check Pod Events

Events give you the Kubernetes-level perspective:

kubectl describe pod <pod-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Scroll to the Events section:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  5m                 default-scheduler  Successfully assigned default/my-app to node-1
  Normal   Pulled     4m (x5 over 5m)    kubelet            Container image "my-app:latest" already present
  Normal   Created    4m (x5 over 5m)    kubelet            Created container app
  Normal   Started    4m (x5 over 5m)    kubelet            Started container app
  Warning  BackOff    1m (x10 over 5m)   kubelet            Back-off restarting failed container
Enter fullscreen mode Exit fullscreen mode

The (x5 over 5m) means the container has been created 5 times in the last 5 minutes.

Step 5: Check Resource Limits

If your app crashes without clear errors, it might be getting OOMKilled:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
Enter fullscreen mode Exit fullscreen mode

If you see OOMKilled, your container exceeded its memory limit:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
Enter fullscreen mode Exit fullscreen mode

Fix: Increase the memory limit in your deployment:

resources:
  limits:
    memory: "512Mi"
    cpu: "500m"
  requests:
    memory: "256Mi"
    cpu: "250m"
Enter fullscreen mode Exit fullscreen mode

Step 6: Check Probes

If your app starts but keeps failing health checks, the liveness probe might be too aggressive:

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe
Enter fullscreen mode Exit fullscreen mode

Common issues:

  • initialDelaySeconds too short (app needs more startup time)
  • Probe endpoint returning wrong status
  • Timeout too short

Fix: Adjust probe settings:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30    # Give app time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3        # Allow 3 failures before restart
Enter fullscreen mode Exit fullscreen mode

Part 2: Debugging Pending Pods

Step 1: Check Pod Events

Your first stop is always kubectl describe:

kubectl describe pod <pod-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Look at the Events section:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  12s   default-scheduler  0/3 nodes are available: 3 Insufficient cpu.
Enter fullscreen mode Exit fullscreen mode

The scheduler tells you exactly why it can't place the pod.

Step 2: Check Node Resources

If events mention insufficient CPU or memory:

kubectl top nodes
Enter fullscreen mode Exit fullscreen mode
NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1     1800m        90%    14Gi            70%
node-2     1500m        75%    12Gi            60%
node-3     1900m        95%    15Gi            75%
Enter fullscreen mode Exit fullscreen mode

If nodes are maxed out, your options are:

  1. Scale down unnecessary workloads
  2. Add nodes to the cluster
  3. Reduce pod resource requests

Check what your pod is requesting:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
Enter fullscreen mode Exit fullscreen mode

Step 3: Check Node Selectors and Affinities

If events mention "node selector mismatch":

# Check your pod's constraints
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 "nodeSelector\|affinity"

# List available node labels
kubectl get nodes --show-labels
Enter fullscreen mode Exit fullscreen mode

Common issue: Pod requires zone=us-east-1a but no node has that label.

Fix: Either add the label or remove the constraint:

# Add label to node
kubectl label node <node-name> zone=us-east-1a

# Or remove constraint from pod spec
Enter fullscreen mode Exit fullscreen mode

Step 4: Check PVC Binding

If your pod uses PersistentVolumeClaims:

kubectl get pvc -n <namespace>
Enter fullscreen mode Exit fullscreen mode
NAME        STATUS   VOLUME         CAPACITY   STORAGECLASS   AGE
data-pvc    Pending                           standard       5m
Enter fullscreen mode Exit fullscreen mode

A Pending PVC means no matching PersistentVolume exists.

Fix: Create a matching PV or ensure your StorageClass supports dynamic provisioning:

# Check available storage classes
kubectl get storageclass

# Describe the PVC to see why it's not binding
kubectl describe pvc <pvc-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Step 5: Check Taints and Tolerations

Nodes with taints repel pods that don't have matching tolerations:

kubectl describe nodes | grep -A 3 Taints
Enter fullscreen mode Exit fullscreen mode

Common taints:

  • node.kubernetes.io/not-ready:NoSchedule — node is unhealthy
  • node.kubernetes.io/unschedulable:NoSchedule — cordoned node
  • dedicated=gpu:NoSchedule — reserved for specific workloads

Fix: Add tolerations to your pod:

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"
Enter fullscreen mode Exit fullscreen mode

Or remove the taint:

kubectl taint nodes <node-name> dedicated:NoSchedule-
Enter fullscreen mode Exit fullscreen mode

Real-World Troubleshooting Scenarios

Scenario 1: App Crashes on Startup with No Logs

Symptoms: Container exits immediately, --previous logs show nothing

Diagnosis: The container might be missing environment variables or config files.

# Check if ConfigMaps/Secrets are mounted
kubectl describe pod <pod-name> | grep -A 20 "Volumes"

# Check container command
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'
Enter fullscreen mode Exit fullscreen mode

Fix: Ensure all required ConfigMaps and Secrets exist:

kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Pod Works in Dev, Stuck in Prod

Symptoms: Same manifest, different behavior

Diagnosis: Environment differences. Check:

# Compare node resources
kubectl top nodes

# Compare storage classes
kubectl get storageclass

# Compare resource quotas
kubectl get resourcequota -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Prod has fewer nodes or different sizing
  • StorageClass names differ
  • ResourceQuota limits in prod namespace

Scenario 3: Intermittent Crashes

Symptoms: Pod runs for a while, then crashes

Diagnosis: Likely memory leak or resource exhaustion over time.

# Watch pod memory over time
kubectl top pod <pod-name> -n <namespace> --containers

# Check for OOMKilled in events
kubectl describe pod <pod-name> | grep -i oom
Enter fullscreen mode Exit fullscreen mode

Fix: Add memory limits and monitor with:

resources:
  limits:
    memory: "1Gi"
Enter fullscreen mode Exit fullscreen mode

Quick Reference: Diagnostic Commands

# Check pod status
kubectl get pods -n <namespace>

# Get detailed pod info
kubectl describe pod <pod-name> -n <namespace>

# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check all container logs (multi-container pod)
kubectl logs <pod-name> -n <namespace> --all-containers

# Check node resources
kubectl top nodes

# Check storage
kubectl get pvc,pv -n <namespace>

# Check events cluster-wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource quotas
kubectl describe resourcequota -n <namespace>

# Check node labels and taints
kubectl describe nodes | grep -E "Labels:|Taints:" -A 1
Enter fullscreen mode Exit fullscreen mode

FAQ

Q: How many restarts before Kubernetes gives up?

Kubernetes uses exponential backoff: 10s, 20s, 40s, 80s, up to 5 minutes max. It will keep trying indefinitely unless you set spec.activeDeadlineSeconds.

Q: Can I see what's happening inside the container?

If the container keeps crashing, use a debug container:

kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>
Enter fullscreen mode Exit fullscreen mode

This creates a sidecar container sharing the same process namespace.

Q: How do I prevent pods from crashing repeatedly?

Set a restartPolicy:

spec:
  restartPolicy: OnFailure  # or Never for Jobs
Enter fullscreen mode Exit fullscreen mode

For Deployments, the default is Always (restart indefinitely).

Q: What if the pod is stuck in ImagePullBackOff?

This isn't a crash—it's an image pull failure:

kubectl describe pod <pod-name> | grep -A 5 "Events:"
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • Wrong image name or tag
  • Private registry credentials missing (imagePullSecrets)
  • Network issues reaching registry

Conclusion

Pods stuck in CrashLoopBackOff or Pending are frustrating, but they're always diagnosable with the right commands.

For CrashLoopBackOff:

  1. Check --previous logs for crash details
  2. Look for OOMKilled in events
  3. Verify liveness/readiness probes
  4. Check ConfigMaps and Secrets exist

For Pending:

  1. Check kubectl describe for scheduler messages
  2. Verify node resources are available
  3. Check PVC binding status
  4. Verify node selectors, affinities, and tolerations match

The key is systematic debugging: start with kubectl describe, read the events, and follow the breadcrumbs. Kubernetes tells you what's wrong—you just need to know where to look.

Happy debugging! 🚀

Top comments (0)