DEV Community

Cover image for πŸ› οΈ How to Troubleshoot a Kubernetes Cluster: A Step-by-Step Guide πŸš€
Kuldeepkumhar-fs
Kuldeepkumhar-fs

Posted on • Edited on

πŸ› οΈ How to Troubleshoot a Kubernetes Cluster: A Step-by-Step Guide πŸš€

πŸ› οΈ How to Troubleshoot a Kubernetes Cluster: A Step-by-Step Guide πŸš€

Kubernetes is powerful but troubleshooting issues in a K8s cluster can be complex. This guide will help you debug and fix common Kubernetes problems like pod failures, networking issues, node problems, and control plane failures.

πŸ“Œ Keywords: Kubernetes troubleshooting, fix Kubernetes issues, Kubernetes debugging, Kubernetes pod errors, Kubernetes service not working, Kubernetes networking issues, Kubernetes health checks failing, Kubernetes CrashLoopBackOff, Kubernetes NotReady node, Kubernetes API server down, kubectl logs, Kubernetes monitoring tools.


πŸ› οΈ Common Kubernetes Issues & How to Fix Them

Kubernetes failures usually fall into these categories:
βœ… Pod Issues (CrashLoopBackOff, ImagePullBackOff, OOMKilled)
βœ… Service & Networking Issues (Pods unreachable, DNS failures)
βœ… Node Issues (NotReady nodes, kubelet failures, resource exhaustion)
βœ… Control Plane Issues (API Server down, etcd failures)
βœ… Persistent Storage Issues (PVC not bound, Disk Pressure)

Let’s dive into how to troubleshoot each of these step by step! πŸ”


πŸš€ Step 1: Troubleshooting Pod Issues

πŸ”Ή 1. Check Pod Status

kubectl get pods -A
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Common Issues:

  • CrashLoopBackOff β†’ Pod is repeatedly crashing
  • ImagePullBackOff β†’ Image pull failed
  • Pending β†’ Pod is waiting for a resource

πŸ”Ή 2. Inspect Pod Logs

kubectl logs <pod-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Fix:

  • If the error is related to the application (e.g., missing dependencies), update the container image.
  • If logs show connection refused, check Service & Networking.

πŸ”Ή 3. Check Pod Events & Describe Pod

kubectl describe pod <pod-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Look for:

  • "FailedMount" (Persistent Volume issue)
  • "FailedScheduling" (Node scheduling issue)
  • "OOMKilled" (Out of Memory)

πŸ’‘ Fix:

  • FailedMount? Check if Persistent Volume is correctly attached.
  • OOMKilled? Increase memory limits in the pod spec.

πŸ–§ Step 2: Troubleshooting Service & Networking Issues

πŸ”Ή 1. Check Service Details

kubectl get svc -A
Enter fullscreen mode Exit fullscreen mode

Verify if:

  • Type: ClusterIP, NodePort, LoadBalancer is correct
  • EXTERNAL-IP is assigned (for LoadBalancer services)

πŸ”Ή 2. Check Service Endpoints

kubectl get endpoints -A
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ If there are no endpoints, your service is not connecting to pods.

πŸ”Ή 3. Manually Test Service Connectivity

kubectl exec -it <pod-name> -- curl http://<service-ip>:<port>
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Fix:

  • If curl fails, check if the Service Selector correctly maps to pods.
  • If using CoreDNS, verify it’s running:
kubectl get pods -n kube-system | grep coredns
Enter fullscreen mode Exit fullscreen mode

πŸ–₯️ Step 3: Troubleshooting Node Issues

πŸ”Ή 1. Check Node Status

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

If a node is NotReady, check its events:

kubectl describe node <node-name>
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Possible Errors & Fixes:
| Issue | Cause | Fix |
|--------|-------|------|
| NotReady | Kubelet crash | Restart Kubelet: sudo systemctl restart kubelet |
| DiskPressure | Node out of disk | Clean logs: sudo du -sh /var/lib/docker |
| MemoryPressure | Insufficient memory | Increase node memory in cloud provider |

πŸ”Ή 2. Check Kubelet Logs

journalctl -u kubelet -n 50
Enter fullscreen mode Exit fullscreen mode

If Kubelet is not responding, restart it:

sudo systemctl restart kubelet
Enter fullscreen mode Exit fullscreen mode

βš™οΈ Step 4: Troubleshooting Control Plane Issues

If your API server is down, the entire cluster becomes unresponsive.

πŸ”Ή 1. Check API Server Logs

kubectl cluster-info
Enter fullscreen mode Exit fullscreen mode

If the API server is not reachable, check logs:

sudo journalctl -u kube-apiserver -n 50
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Fix:

  • If etcd is failing, restart it:
sudo systemctl restart etcd
Enter fullscreen mode Exit fullscreen mode
  • Check if control plane nodes are under resource constraints.

πŸ’Ύ Step 5: Troubleshooting Persistent Storage Issues

If your pods are stuck in "ContainerCreating" due to volume issues:

πŸ”Ή 1. Check Persistent Volume (PV) and Claim (PVC)

kubectl get pv,pvc -A
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Fix:

  • If PVC is Pending, check storage class:
kubectl get storageclass
Enter fullscreen mode Exit fullscreen mode
  • If disk is full, expand storage.

πŸ› οΈ Best Practices for Troubleshooting Kubernetes

πŸ”Ή Enable Logging & Monitoring (kubectl logs, Prometheus, Loki)
πŸ”Ή Use kubectl get events for real-time issues
πŸ”Ή Keep your cluster nodes updated
πŸ”Ή Automate Scaling (Horizontal Pod Autoscaler)


πŸš€ Conclusion

Troubleshooting Kubernetes requires systematic debugging of pods, services, nodes, and control plane components. Using tools like kubectl logs, kubectl describe, and monitoring solutions like Prometheus can help detect and resolve issues quickly.

Top comments (0)