Kubernetes is powerful, but with that power comes complexity. In real-world DevOps environments, issues like pod failures, scheduling problems, and resource mismanagement are common. Understanding how to troubleshoot these effectively is what separates a beginner from a skilled DevOps engineer.
- ImagePullBackOff Issue
One of the most common errors in Kubernetes is ImagePullBackOff, which occurs when a container image cannot be pulled.
Causes:
Invalid or non-existent image
Private repository without authentication
Solution:
For private images, use ImagePullSecrets:
kubectl create secret docker-registry demo
--docker-server=your-registry-server
--docker-username=your-name
--docker-password=your-password
--docker-email=your-email
Then reference it in your deployment:
spec:
imagePullSecrets:
- name: demo
For AWS ECR:
kubectl create secret docker-registry ecr-secret
--docker-server=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
--docker-username=AWS
--docker-password=$(aws ecr get-login-password)
--namespace=default
- CrashLoopBackOff
This error indicates that a container is repeatedly crashing and restarting.
Common Reasons:
Misconfigurations (env variables, volumes)
Incorrect commands in Dockerfile
Application bugs
Liveness probe failures
Insufficient CPU or memory
How It Works:
Kubernetes restarts the container with increasing delay:
First retry: ~10 seconds
Next retry: ~60 seconds
This is called backoff strategy.
Fix:
Check logs: kubectl logs
Describe pod: kubectl describe pod
Validate configs and probes
- Liveness & Readiness Probes
Kubernetes uses probes to monitor application health.
Types:
Liveness Probe → Restarts container if unhealthy
Readiness Probe → Controls traffic routing
Misconfigured probes can cause continuous restarts → CrashLoopBackOff.
- Resource Management (Critical in Real-Time)
In shared clusters, improper resource usage can affect all applications.
Problem:
One application consumes excessive CPU/memory → others fail
Solutions:
1) Resource Quota (Namespace Level)
Limits total resources a namespace can use
2) Resource Limits (Pod Level)
Restricts individual pod usage
Important Rule:
Never blindly increase resources. Always identify the root cause and allocate the correct usage.
- Pod Not Schedulable
If a pod is stuck in Pending, it means the scheduler cannot place it on any node.
Debug:
kubectl describe pod
Common Causes & Fixes:
1) Node Selector: Forces pod to run on a specific node
nodeSelector:
node-name: arm-worker
If label doesn’t match → pod won’t schedule
Fix:
kubectl edit node
2) Node Affinity: More flexible than nodeSelector:
Required → Must match
Preferred → Try to match, else fallback
3) Taints: Prevents pods from scheduling on nodes.
Types:
NoSchedule
NoExecute
PreferNoSchedule
kubectl taint nodes nodename key=value:NoSchedule
4) Tolerations: Allows specific pods to run on tainted nodes.
6.StatefulSet & Persistent Volume Issues
Stateful applications depend on storage.
Problem:
Pods stuck in Pending due to missing Persistent Volume (PV)
Root Cause:
Incorrect StorageClass
Example issue:
storageClassName: ebs
This works in AWS but fails in other environments.
Solution
storageClassName: standard
Debug:
kubectl get storageclass
kubectl describe pod
Note:
Delete old PVC before reapplying:
kubectl delete pvc
- OOMKilled (Out Of Memory)
Occurs when a container exceeds memory limits.
Causes:
Low memory limits
Memory leaks in application
Debug:
Check pod events
For Java apps:
Thread dump → kill -3
Heap dump → jstack
Example:
If app needs 2GB but limit is 200MB → crash is inevitable
Kubernetes troubleshooting is not about memorizing commands, it’s about understanding system behavior.
Top comments (0)