inboryn

Posted on Jan 7

How I Debug Kubernetes Pods in Production (Without Breaking Things)

#docker #devops #kubernetes

3 AM. Slack notification. Production pod is down. CrashLoopBackOff.

Your hands are shaking. You know one wrong kubectl command could take down the entire service. You need to debug fast—but safely.

This is the exact 5-step process I use. It’s saved me dozens of times when production is burning and I can’t afford to experiment.

The 5-Command Debug Process
These commands are ordered by safety and speed. Always start at #1. Only move to the next step if you haven’t found the problem.

Step 1: Get the Lay of the Land
kubectl get pods -o wide
What you’re looking for:

Pod status: Running, Pending, CrashLoopBackOff, Error?
Restart count: High number = crash loop
Node location: Is the problem node-specific?
Age: Recent creation = deployment issue?
Example output:

NAME READY STATUS RESTARTS AGE NODE
api-7d4f8c9b6f-j8m2k 0/1 CrashLoopBackOff 12 15m node-1
api-7d4f8c9b6f-p9x4n 1/1 Running 0 15m node-2
What this tells you:

One pod is crashing (12 restarts in 15 minutes)
The other pod is fine
Problem might be node-specific OR configuration-specific
Red flags:

ImagePullBackOff = Wrong image name or registry auth
CrashLoopBackOff = Application crash on startup
Pending = Resource constraints or scheduling issues
High RESTARTS = Repeated crashes
Step 2: Read the Kubernetes Detective Report
kubectl describe pod
This command is gold. It shows you:

Recent events (last 10 minutes)
Container states
Resource limits
Volume mounts
Node assignments
What to look for in the Events section:

Bad:

Events:
Type Reason Age Message

Warning BackOff 2m (x10 over 5m) Back-off restarting failed container
Warning Failed 1m (x8 over 4m) Error: failed to create containerd task
This tells you: Container is failing during startup. You need to check logs (Step 3).

Common problems you’ll spot here:

Liveness probe failing
Message: “Liveness probe failed: HTTP probe failed”
Fix: Check your /health endpoint
OOMKilled
Message: “Container killed: OOMKilled”
Fix: Increase memory limits
Image pull errors
Message: “Failed to pull image: unauthorized”
Fix: Check imagePullSecrets
Volume mount failures
Message: “Unable to attach or mount volumes”
Fix: Check PVC and storage class
Step 3: Get the Crash Logs (Before They Disappear)
kubectl logs --previous
The --previous flag is critical. It shows logs from the crashed container BEFORE the restart.

Without --previous, you only see logs from the current (potentially empty) container.

Real example:

Wrong command:

$ kubectl logs api-7d4f8c9b6f-j8m2k
Error from server (BadRequest): container "api" in pod "api-7d4f8c9b6f-j8m2k" is waiting to start: CrashLoopBackOff
Right command:

$ kubectl logs api-7d4f8c9b6f-j8m2k --previous
Traceback (most recent call last):
File "/app/main.py", line 3, in
import psycopg2
ModuleNotFoundError: No module named 'psycopg2'
Boom. Found it. Missing Python dependency.

Pro tip: For multi-container pods:

kubectl logs -c --previous
Common log patterns:

Connection refused
Error: connect ECONNREFUSED database:5432
Database service isn’t ready
Check service name and port
Environment variable missing
KeyError: 'DATABASE_URL'
ConfigMap or Secret not mounted
Check env definitions in deployment
Permission denied
PermissionError: [Errno 13] Permission denied: '/data/file.txt'
Volume permissions issue
Check securityContext and fsGroup
Step 4: Step Inside the Running Container (Use With Caution)
kubectl exec -it -- /bin/sh
When to use this:

Check environment variables
Test network connectivity
Verify file permissions
Debug configuration files
Safety rules:

NEVER run this on the only healthy pod
Use read-only commands
Don’t modify files in production
Exit as soon as you find the issue
Common debugging commands inside the pod:

Check environment variables

env | grep DATABASE

Test database connection

ping database-service
nc -zv database-service 5432

Check file permissions

ls -la /app/config

Test HTTP endpoint

wget -O- http://localhost:8080/health
curl http://localhost:8080/health

Check mounted secrets/configmaps

cat /etc/config/app.conf
For distroless images (no shell):

In 2026, many images don’t include shells. Use kubectl debug instead:

kubectl debug -it --image=busybox --target=
This attaches a debug container with a shell to your running pod without disrupting it.

Step 5: Check Cluster Events (The Big Picture)
kubectl get events --sort-by='.lastTimestamp' | tail -20
This shows cluster-wide events. Useful when:

Multiple pods are failing
Node issues are suspected
Scheduling problems
Example output:

5m Warning FailedScheduling pod/api 0/3 nodes available: insufficient memory
3m Warning Evicted pod/api Pod ephemeral local storage usage exceeds limit
Common cluster-level problems:

Node pressure – CPU/Memory/Disk full
Network policies blocking traffic
ResourceQuota exceeded
Storage class issues
Real Scenario Walkthrough: Fixing CrashLoopBackOff
Let’s put it all together with a real production scenario.

Problem: API pods are in CrashLoopBackOff

Step 1: Get the status

$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE
api-7d4f8c9b6f-j8m2k 0/1 CrashLoopBackOff 8 12m
Analysis: Pod has restarted 8 times in 12 minutes. Classic crash loop.

Step 2: Describe the pod

$ kubectl describe pod api-7d4f8c9b6f-j8m2k
...
Events:
Warning BackOff 5m (x15 over 10m) kubelet Back-off restarting failed container
Analysis: Container keeps failing. Need to check logs.

Step 3: Get the crash logs

$ kubectl logs api-7d4f8c9b6f-j8m2k --previous
Error: ECONNREFUSED 10.96.0.10:5432
at TCPConnectWrap.afterConnect as oncomplete {
errno: 'ECONNREFUSED',
syscall: 'connect',
address: '10.96.0.10',
port: 5432
}
Found it! Can’t connect to database.

Step 4: Verify the issue

$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
database ClusterIP 10.96.0.12 5432/TCP
Analysis: Service IP is 10.96.0.12, but app is trying to connect to 10.96.0.10.

Root cause: Hardcoded old database IP in deployment.

Fix:
Update deployment to use service name instead:

env:

name: DATABASE_HOST value: "database" # Use service name, not IP Result: Pods start successfully. Crisis averted.

Common Mistakes (And How to Avoid Them)
Mistake #1: Using kubectl delete pod as a Fix
Wrong:

kubectl delete pod api-7d4f8c9b6f-j8m2k # This just creates a new crashing pod
Right:
Find and fix the root cause first. Deleting pods in a crash loop just wastes time.

Mistake #2: Not Using --previous Flag
Without --previous, you can’t see crash logs. Always use it for CrashLoopBackOff.

Mistake #3: Exec Into Production Pods Blindly
Don’t run random commands in production containers. Use read-only commands and kubectl debug when possible.

Mistake #4: Ignoring Resource Limits
Check if your pod is being OOMKilled:

kubectl describe pod | grep -A 5 "Last State"
If you see OOMKilled, increase memory limits.

Mistake #5: Not Checking Node Health
Sometimes the problem isn’t the pod—it’s the node:

kubectl top nodes
kubectl describe node
The Production Safety Checklist
Before running ANY debug command in production:

✅ Is this a read-only operation?
✅ Will this command disrupt running services?
✅ Do I have the kubectl context set correctly? (Check with kubectl config current-context)
✅ Am I targeting the right namespace? (kubectl config set-context --current --namespace=)
✅ Have I documented what I’m about to do?
✅ Is there a rollback plan if this makes things worse?

If you answered “no” to any of these, stop and think.

Quick Reference: The Debug Commands Cheat Sheet

1. Get pod status

kubectl get pods -o wide

2. Detailed pod information

kubectl describe pod

3. View crash logs

kubectl logs --previous
kubectl logs -c --previous # Multi-container

4. Interactive shell (careful!)

kubectl exec -it -- /bin/sh

5. Cluster events

kubectl get events --sort-by='.lastTimestamp'

Bonus: Modern debug container (2026)

kubectl debug -it --image=busybox --target=

Check resource usage

kubectl top pods
kubectl top nodes

Port forward for local testing

kubectl port-forward 8080:80
The Real Work
Debugging production Kubernetes isn’t about memorizing commands. It’s about having a process when you’re panicking at 3 AM.

These five steps have saved me countless times:

Get the status (get pods)
Read the events (describe pod)
Check the crash logs (logs --previous)
Verify the environment (exec or debug)
Check the bigger picture (get events)
Follow this order. Don’t skip steps. Don’t panic.

The pods will come back. Production will stabilize. You’ll sleep again.

Just don’t forget to use --previous next time.

DEV Community