DEV Community

Anand k
Anand k

Posted on

Kubernetes Troubleshooting Guide: Real-Time Scenarios & Solutions

Kubernetes is powerful, but with that power comes complexity. In real-world DevOps environments, issues like pod failures, scheduling problems, and resource mismanagement are common. Understanding how to troubleshoot these effectively is what separates a beginner from a skilled DevOps engineer.

  1. ImagePullBackOff Issue

One of the most common errors in Kubernetes is ImagePullBackOff, which occurs when a container image cannot be pulled.

Causes:
Invalid or non-existent image
Private repository without authentication
Solution:

For private images, use ImagePullSecrets:

kubectl create secret docker-registry demo
--docker-server=your-registry-server
--docker-username=your-name
--docker-password=your-password
--docker-email=your-email

Then reference it in your deployment:
spec:
imagePullSecrets:
- name: demo
For AWS ECR:
kubectl create secret docker-registry ecr-secret
--docker-server=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com
--docker-username=AWS
--docker-password=$(aws ecr get-login-password)
--namespace=default

  1. CrashLoopBackOff

This error indicates that a container is repeatedly crashing and restarting.

Common Reasons:
Misconfigurations (env variables, volumes)
Incorrect commands in Dockerfile
Application bugs
Liveness probe failures
Insufficient CPU or memory

How It Works:
Kubernetes restarts the container with increasing delay:

First retry: ~10 seconds
Next retry: ~60 seconds
This is called backoff strategy.

Fix:
Check logs: kubectl logs
Describe pod: kubectl describe pod
Validate configs and probes

  1. Liveness & Readiness Probes

Kubernetes uses probes to monitor application health.
Types:
Liveness Probe → Restarts container if unhealthy
Readiness Probe → Controls traffic routing

Misconfigured probes can cause continuous restarts → CrashLoopBackOff.

  1. Resource Management (Critical in Real-Time)

In shared clusters, improper resource usage can affect all applications.
Problem:
One application consumes excessive CPU/memory → others fail
Solutions:
1) Resource Quota (Namespace Level)
Limits total resources a namespace can use
2) Resource Limits (Pod Level)
Restricts individual pod usage

Important Rule:
Never blindly increase resources. Always identify the root cause and allocate the correct usage.

  1. Pod Not Schedulable

If a pod is stuck in Pending, it means the scheduler cannot place it on any node.

Debug:
kubectl describe pod
Common Causes & Fixes:

1) Node Selector: Forces pod to run on a specific node

nodeSelector:
node-name: arm-worker

If label doesn’t match → pod won’t schedule
Fix:
kubectl edit node

2) Node Affinity: More flexible than nodeSelector:

Required → Must match
Preferred → Try to match, else fallback

3) Taints: Prevents pods from scheduling on nodes.
Types:
NoSchedule
NoExecute
PreferNoSchedule

kubectl taint nodes nodename key=value:NoSchedule

4) Tolerations: Allows specific pods to run on tainted nodes.

6.StatefulSet & Persistent Volume Issues

Stateful applications depend on storage.

Problem:
Pods stuck in Pending due to missing Persistent Volume (PV)

Root Cause:
Incorrect StorageClass

Example issue:
storageClassName: ebs

This works in AWS but fails in other environments.

Solution
storageClassName: standard
Debug:
kubectl get storageclass
kubectl describe pod

Note:

Delete old PVC before reapplying:
kubectl delete pvc

  1. OOMKilled (Out Of Memory)

Occurs when a container exceeds memory limits.

Causes:
Low memory limits
Memory leaks in application

Debug:
Check pod events
For Java apps:
Thread dump → kill -3
Heap dump → jstack

Example:

If app needs 2GB but limit is 200MB → crash is inevitable

Kubernetes troubleshooting is not about memorizing commands, it’s about understanding system behavior.

Top comments (0)