AttractivePenguin

Posted on Mar 26

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

#kubernetes #devops #debugging #tutorial

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

You've deployed your application to Kubernetes, checked the status, and... disaster. Your pod is either stuck in Pending or crashing endlessly in CrashLoopBackOff. The logs are cryptic, and you're not sure where to start.

Don't panic. These are the two most common Kubernetes headaches, and both have systematic solutions. In this guide, you'll learn exactly how to diagnose and fix each one, with real commands and scenarios you can use today.

Understanding the Two Problem States

Pending: "I Can't Find a Home"

When a pod is Pending, the Kubernetes scheduler hasn't been able to assign it to any node. The container hasn't started—it hasn't even been placed. The scheduler is saying, "I checked all nodes, and none of them work for this pod."

Common causes:

Insufficient resources: Not enough CPU, memory, or storage
Node selection constraints: Labels, affinities, or taints that don't match
Persistent volume issues: PVCs that can't bind
Resource quotas: Namespace limits preventing scheduling

CrashLoopBackOff: "I Keep Crashing on Startup"

When a pod is in CrashLoopBackOff, it has been scheduled to a node, but the container keeps crashing. Kubernetes restarts it, it crashes again, and eventually Kubernetes backs off with increasing delays between restarts.

Common causes:

Application errors: Unhandled exceptions, missing config
Health check failures: Liveness/readiness probes failing
Missing dependencies: Database not ready, configmaps missing
Resource limits: Container OOMKilled or CPU throttled

Let's debug each systematically.

Part 1: Debugging CrashLoopBackOff

Step 1: Check Pod Status

First, confirm the state:

kubectl get pods -n <namespace>

You'll see something like:

NAME                      READY   STATUS             RESTARTS   AGE
my-app-6f4d5c7b8-x9kmn    0/1     CrashLoopBackOff   5          10m

The RESTARTS count tells you how many times Kubernetes has tried to bring it back.

Step 2: Check Current Logs

If the container is running (between crashes), check current logs:

kubectl logs <pod-name> -n <namespace>

Step 3: Check Previous Container Logs

This is the key step. When a container crashes, the logs from the crashed instance are still available. Use the --previous flag:

kubectl logs <pod-name> -n <namespace> --previous

This shows you what happened right before the crash. Look for:

Stack traces
"Connection refused" or "timeout" errors
"Permission denied" messages
"Out of memory" indicators

Step 4: Check Pod Events

Events give you the Kubernetes-level perspective:

kubectl describe pod <pod-name> -n <namespace>

Scroll to the Events section:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  5m                 default-scheduler  Successfully assigned default/my-app to node-1
  Normal   Pulled     4m (x5 over 5m)    kubelet            Container image "my-app:latest" already present
  Normal   Created    4m (x5 over 5m)    kubelet            Created container app
  Normal   Started    4m (x5 over 5m)    kubelet            Started container app
  Warning  BackOff    1m (x10 over 5m)   kubelet            Back-off restarting failed container

The (x5 over 5m) means the container has been created 5 times in the last 5 minutes.

Step 5: Check Resource Limits

If your app crashes without clear errors, it might be getting OOMKilled:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

If you see OOMKilled, your container exceeded its memory limit:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Fix: Increase the memory limit in your deployment:

resources:
  limits:
    memory: "512Mi"
    cpu: "500m"
  requests:
    memory: "256Mi"
    cpu: "250m"

Step 6: Check Probes

If your app starts but keeps failing health checks, the liveness probe might be too aggressive:

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 livenessProbe

Common issues:

initialDelaySeconds too short (app needs more startup time)
Probe endpoint returning wrong status
Timeout too short

Fix: Adjust probe settings:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30    # Give app time to start
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3        # Allow 3 failures before restart

Part 2: Debugging Pending Pods

Step 1: Check Pod Events

Your first stop is always kubectl describe:

kubectl describe pod <pod-name> -n <namespace>

Look at the Events section:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  12s   default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

The scheduler tells you exactly why it can't place the pod.

Step 2: Check Node Resources

If events mention insufficient CPU or memory:

kubectl top nodes

NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
node-1     1800m        90%    14Gi            70%
node-2     1500m        75%    12Gi            60%
node-3     1900m        95%    15Gi            75%

If nodes are maxed out, your options are:

Scale down unnecessary workloads
Add nodes to the cluster
Reduce pod resource requests

Check what your pod is requesting:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'

Step 3: Check Node Selectors and Affinities

If events mention "node selector mismatch":

# Check your pod's constraints
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 "nodeSelector\|affinity"

# List available node labels
kubectl get nodes --show-labels

Common issue: Pod requires zone=us-east-1a but no node has that label.

Fix: Either add the label or remove the constraint:

# Add label to node
kubectl label node <node-name> zone=us-east-1a

# Or remove constraint from pod spec

Step 4: Check PVC Binding

If your pod uses PersistentVolumeClaims:

kubectl get pvc -n <namespace>

NAME        STATUS   VOLUME         CAPACITY   STORAGECLASS   AGE
data-pvc    Pending                           standard       5m

A Pending PVC means no matching PersistentVolume exists.

Fix: Create a matching PV or ensure your StorageClass supports dynamic provisioning:

# Check available storage classes
kubectl get storageclass

# Describe the PVC to see why it's not binding
kubectl describe pvc <pvc-name> -n <namespace>

Step 5: Check Taints and Tolerations

Nodes with taints repel pods that don't have matching tolerations:

kubectl describe nodes | grep -A 3 Taints

Common taints:

node.kubernetes.io/not-ready:NoSchedule — node is unhealthy
node.kubernetes.io/unschedulable:NoSchedule — cordoned node
dedicated=gpu:NoSchedule — reserved for specific workloads

Fix: Add tolerations to your pod:

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu"
  effect: "NoSchedule"

Or remove the taint:

kubectl taint nodes <node-name> dedicated:NoSchedule-

Real-World Troubleshooting Scenarios

Scenario 1: App Crashes on Startup with No Logs

Symptoms: Container exits immediately, --previous logs show nothing

Diagnosis: The container might be missing environment variables or config files.

# Check if ConfigMaps/Secrets are mounted
kubectl describe pod <pod-name> | grep -A 20 "Volumes"

# Check container command
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'

Fix: Ensure all required ConfigMaps and Secrets exist:

kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>

Scenario 2: Pod Works in Dev, Stuck in Prod

Symptoms: Same manifest, different behavior

Diagnosis: Environment differences. Check:

# Compare node resources
kubectl top nodes

# Compare storage classes
kubectl get storageclass

# Compare resource quotas
kubectl get resourcequota -n <namespace>

Common causes:

Prod has fewer nodes or different sizing
StorageClass names differ
ResourceQuota limits in prod namespace

Scenario 3: Intermittent Crashes

Symptoms: Pod runs for a while, then crashes

Diagnosis: Likely memory leak or resource exhaustion over time.

# Watch pod memory over time
kubectl top pod <pod-name> -n <namespace> --containers

# Check for OOMKilled in events
kubectl describe pod <pod-name> | grep -i oom

Fix: Add memory limits and monitor with:

resources:
  limits:
    memory: "1Gi"

Quick Reference: Diagnostic Commands

# Check pod status
kubectl get pods -n <namespace>

# Get detailed pod info
kubectl describe pod <pod-name> -n <namespace>

# Check current logs
kubectl logs <pod-name> -n <namespace>

# Check previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check all container logs (multi-container pod)
kubectl logs <pod-name> -n <namespace> --all-containers

# Check node resources
kubectl top nodes

# Check storage
kubectl get pvc,pv -n <namespace>

# Check events cluster-wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource quotas
kubectl describe resourcequota -n <namespace>

# Check node labels and taints
kubectl describe nodes | grep -E "Labels:|Taints:" -A 1

FAQ

Q: How many restarts before Kubernetes gives up?

Kubernetes uses exponential backoff: 10s, 20s, 40s, 80s, up to 5 minutes max. It will keep trying indefinitely unless you set spec.activeDeadlineSeconds.

Q: Can I see what's happening inside the container?

If the container keeps crashing, use a debug container:

kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>

This creates a sidecar container sharing the same process namespace.

Q: How do I prevent pods from crashing repeatedly?

Set a restartPolicy:

spec:
  restartPolicy: OnFailure  # or Never for Jobs

For Deployments, the default is Always (restart indefinitely).

Q: What if the pod is stuck in ImagePullBackOff?

This isn't a crash—it's an image pull failure:

kubectl describe pod <pod-name> | grep -A 5 "Events:"

Common causes:

Wrong image name or tag
Private registry credentials missing (imagePullSecrets)
Network issues reaching registry

Conclusion

Pods stuck in CrashLoopBackOff or Pending are frustrating, but they're always diagnosable with the right commands.

For CrashLoopBackOff:

Check --previous logs for crash details
Look for OOMKilled in events
Verify liveness/readiness probes
Check ConfigMaps and Secrets exist

For Pending:

Check kubectl describe for scheduler messages
Verify node resources are available
Check PVC binding status
Verify node selectors, affinities, and tolerations match

The key is systematic debugging: start with kubectl describe, read the events, and follow the breadcrumbs. Kubernetes tells you what's wrong—you just need to know where to look.

Happy debugging! 🚀

DEV Community

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

Kubernetes Pods Stuck? Debug CrashLoopBackOff and Pending States Like a Pro

Understanding the Two Problem States

Pending: "I Can't Find a Home"

CrashLoopBackOff: "I Keep Crashing on Startup"

Part 1: Debugging CrashLoopBackOff

Step 1: Check Pod Status

Step 2: Check Current Logs

Step 3: Check Previous Container Logs

Step 4: Check Pod Events

Step 5: Check Resource Limits

Step 6: Check Probes

Part 2: Debugging Pending Pods

Step 1: Check Pod Events

Step 2: Check Node Resources

Step 3: Check Node Selectors and Affinities

Step 4: Check PVC Binding

Step 5: Check Taints and Tolerations

Real-World Troubleshooting Scenarios

Scenario 1: App Crashes on Startup with No Logs

Scenario 2: Pod Works in Dev, Stuck in Prod

Scenario 3: Intermittent Crashes

Quick Reference: Diagnostic Commands

FAQ

Q: How many restarts before Kubernetes gives up?

Q: Can I see what's happening inside the container?

Q: How do I prevent pods from crashing repeatedly?

Q: What if the pod is stuck in ImagePullBackOff?

Conclusion

Top comments (0)