Mohammed Nasser

Posted on Oct 6

Kubernetes Troubleshooting 2025

#devops #softwaredevelopment #kubernetes #containers

Kubernetes Troubleshooting Guide for Application Developers

1. Inspecting Resources 🛠️

General Information 📋

Get an overview of all resources across namespaces:

kubectl get all -A

Checking Deployment Details 🔍

Get Full YAML Configuration:

kubectl get -n uat deployments.apps uat-deployment -o yaml

Check Replica Count:

kubectl get -n uat deployments.apps uat-deployment -o yaml | grep replicas

Search for Specific Deployments:

kubectl get deployments --all-namespaces | grep frontend

View Labels:

kubectl get -n uat deployments.apps uat-deployment -o yaml | grep labels -A5

Get Replica Count in JSON Format:

kubectl get -n uat deployments.apps uat-deployment -o=jsonpath='{.spec.replicas}'

Check Containers:

kubectl get -n uat deployments.apps uat-deployment -o=jsonpath='{.spec.template.spec.containers}'

Get Pods on Specific Node:

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=node01

2. Describing Nodes and Pods 🏗️

Get Node Details:

kubectl describe node node01

Describe a Specific Pod:

kubectl describe -n uat pod/uat-pod

3. Viewing Events 📅

Events provide crucial information about what's happening in your cluster:

kubectl events -n uat

4. Checking Logs 📜

Basic Log Commands

Get Logs for a Deployment:

kubectl logs -n uat deployments/uat-deployment

Logs for All Containers in a Deployment:

kubectl logs -n uat deployments/uat-deployment --all-containers

Save Logs to File:

kubectl logs -n app deployments/frontend >> logs.txt

Logs for a Specific Container:

kubectl logs -n uat deployments/uat-deployment -c uat-container01

Advanced Log Options

Logs Based on Label:

kubectl logs -n uat -l app=uat-app

Logs with Timestamps:

kubectl logs -n uat uat-pod --timestamps

Save Timestamped Logs to File:

kubectl logs -n app myapp --timestamps >> timestamps.txt

Time-Based Log Filtering:

kubectl logs nginx --since=10s    # Last 10 seconds
kubectl logs nginx --since=1h     # Last hour

Follow Logs in Real-Time:

kubectl logs nginx -f

5. Executing Commands Inside Containers 🖥️

List Files in Container:

kubectl exec -n uat nginx -- ls

Read a File:

kubectl exec -n uat nginx -- cat /usr/share/nginx/html/index.html

Open Interactive Bash Shell:

kubectl exec -it -n uat nginx -- /bin/bash

6. Port Forwarding 🔀

Forward local port to service port for testing:

kubectl port-forward -n uat svc/uat-svc 8000:80

This forwards local port 8000 to service port 80.

7. Authentication and Authorization 🔑

Check Current User

kubectl auth whoami

Check Permissions

Check Your Own Permissions:

kubectl auth can-i list pods -n uat
kubectl auth can-i get pods -n uat
kubectl auth can-i update pods -n uat
kubectl auth can-i patch pods -n uat
kubectl auth can-i delete pods -n uat

Check Permissions as Another User:

kubectl auth can-i get pods --as=jane --v=10

Check Service Account Permissions:

kubectl auth can-i delete pods --as=system:serviceaccount:default:default

8. Resource Utilization 📊

Node Resources

Get Node Details:

kubectl get nodes -o wide

View Node Resource Usage:

kubectl top nodes

Pod Resources

Get Pods in Namespace:

kubectl get pods -n uat

View Pod Resource Usage:

kubectl top pods -n uat

9. Explaining Kubernetes Objects 📖

The explain command provides documentation about Kubernetes resources:

Explain Pod Resource:

kubectl explain pods

Explain Pod Specifications:

kubectl explain pods.spec

Explain Security Settings (Recursive):

kubectl explain pods.spec.securityContext --recursive

10. Debugging 🛠️

Compare Configuration Changes

kubectl diff -f nginx.yaml

Debug a Running Pod

kubectl debug -it nginx-pod --image=busybox --target=nginx

Copy and Debug a Pod

kubectl debug nginx-pod --image=busybox -it --copy-to=debugging-pod --share-processes

11. Common Issues and Fixes 🚨

ImagePullBackOff Error ❗

Description: Pod cannot pull the container image from the registry.

Diagnosis:

Describe the pod and check the events section to find the reason

Possible Causes:

❌ Incorrect image name: Verify the image name in your deployment YAML
🔑 Missing imagePullSecrets: Results in 401 authentication error
🏷️ Incorrect image tag: Check if the specified tag exists
🌐 Cluster cannot resolve registry hostname: Check DNS and network connectivity

Fix:

kubectl describe pod <pod-name> -n <namespace>
# Check Events section for detailed error

CrashLoopBackOff Error 🔄

Description: Container keeps crashing and Kubernetes restarts it repeatedly.

Key Indicators:

restartPolicy in pod YAML is set to Always

Exit Code Analysis:

Exit Code 1: Application error (check application logs)
Exit Code 137: Possible liveness probe failure or OOM kill
Exit Code 127: Trying to access a non-existent file or command

Other Causes:

📂 Volume mount issues: Check if volumes are properly mounted

Fix:

kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>

Pending Pods ⏳

Description: Pods are stuck in the Pending state and not being scheduled.

Common Causes:

⚡ Insufficient resources on nodes: Not enough CPU/memory available
🔍 Node selector mismatch: Pod's nodeSelector doesn't match any node labels
🚫 Taints and tolerations: Nodes are tainted and pod lacks required tolerations

Fix:

kubectl describe pod <pod-name> -n <namespace>
# Check Events section for scheduling failures

# Add label to node if needed
kubectl label nodes <node-name> <label-key>=<label-value>

# Check node capacity
kubectl describe nodes

Missing Pods ❓

Description: Expected pods are not running.

Possible Causes:

🚧 Pod quota exceeded: Namespace has reached its resource quota
🔑 Service account missing in deployment: Required service account doesn't exist

Fix:

# Check events for quota issues
kubectl get events -n uat

# Create missing service account
kubectl create sa service-account-uat -n uat

Schrodinger's Deployment 🐱

Description: Multiple deployments sharing common selectors causing pod management issues.

Problem: Using common selectors like version=1 across multiple deployments.

Fix:

# Check affected pods
kubectl get pods -l version=1

# Verify endpoints
kubectl get endpoints

# Use unique selectors for each deployment

CreateContainerError / CreateContainerConfigError ⚙️

CreateContainerConfigError:

🔍 Missing Secret
🔍 Missing ConfigMap
🔍 Missing environment variable

CreateContainerError:

❌ Missing entrypoint or command
❌ Invalid container configuration

Fix:

kubectl describe pod <pod-name> -n <namespace>
# Check Events section for specific error

# Verify ConfigMap exists
kubectl get configmap -n <namespace>

# Verify Secret exists
kubectl get secret -n <namespace>

Config Out of Date 🔄

Description: ConfigMap or Secret changes not reflected in running pods.

Cause: ConfigMaps and Secrets are mounted at pod creation time.

Fix:

# Option 1: Rollout restart
kubectl rollout restart deployment/<deployment-name> -n <namespace>

# Option 2: Use reloader controller
# Install and configure reloader to automatically restart pods on config changes

Endless Terminating State ♾️

Description: Pod stuck in Terminating state.

Possible Causes:

Finalizer preventing deletion
Node where pod was running is unavailable

Fix:

# Force delete the pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Check for finalizers
kubectl get pod <pod-name> -n <namespace> -o yaml | grep finalizers -A5

Field Immutability 🔒

Description: Cannot update certain fields after resource creation.

Problem: Metadata fields like matchLabels cannot be changed directly.

Fix:

# ❌ Delete and re-create the deployment
kubectl delete deployment <deployment-name> -n <namespace>
kubectl apply -f <deployment-file>.yaml

EnableServiceLinks Issue 🔄

Description: Too many environment variables created for services.

Problem: By default, Kubernetes creates environment variables for all services.

Fix:

spec:
  template:
    spec:
      enableServiceLinks: false

Network Policy Issues 🌐

Description: Pods cannot communicate due to network policies.

Diagnosis:

# Check network policies
kubectl get netpol -n uat

# Describe network policy
kubectl describe netpol <policy-name> -n uat

Verify:

Ingress rules (incoming traffic)
Egress rules (outgoing traffic)
Pod selectors
Namespace selectors

Multi-Attach Volume Error 💾

Description: Volume cannot be attached to multiple pods on different nodes.

Quick Fix:

# Scale down to 0
kubectl scale deployment/<deployment-name> --replicas=0 -n <namespace>

# Scale back to 1
kubectl scale deployment/<deployment-name> --replicas=1 -n <namespace>

Recommended Fix: Use Recreate strategy in deployment:

spec:
  strategy:
    type: Recreate

Persistent Volume Access Modes:

✅ RWO (ReadWriteOnce): Volume can be mounted as read-write by a single node
✅ RWX (ReadWriteMany): Volume can be mounted as read-write by multiple nodes
✅ ROX (ReadOnlyMany): Volume can be mounted as read-only by multiple nodes

Check PV Access Mode:

kubectl get pv
kubectl describe pv <pv-name>

Quick Reference Cheat Sheet

Most Used Commands

# Get resources
kubectl get pods -n <namespace>
kubectl get all -A

# Describe resources
kubectl describe pod <pod-name> -n <namespace>
kubectl describe node <node-name>

# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs -f <pod-name> -n <namespace>

# Execute commands
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

# Events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>