The Problem With Learning Kubernetes
Everyone tells you to "learn Kubernetes"
So you read the docs. You watch YouTube. You follow a tutorial that deploys nginx. You feel great.
After a cheerful weekend, you logged in Monday morning. A pod is stuck in CrashLoopBackOff. You stare at the terminal. You Google or ask any GPT models. You paste random commands. Thirty minutes later, you're still stuck.
Sound familiar?
Here's the thing: you don't really understand Kubernetes until something breaks.
And the best way to learn troubleshooting is to break things on purpose, in a safe environment, where you can take your time and actually understand what went wrong.
That's exactly why I built this.
Introducing: Troubleshoot Kubernetes Like a Pro
GitHub: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro
It's a free, open-source collection of 35 real-world Kubernetes failure scenarios that you can simulate, investigate, and fix on your own cluster.
No custom Docker images. No complex setup. No cloud account required. Just a local Kubernetes cluster (Minikube, Kind, or Docker Desktop) and kubectl.
How It Works
Every scenario follows the same simple pattern:
1. Break it
kubectl apply -f issue.yaml
This creates a deliberately broken Kubernetes resource. A pod that crashes. A service that points to nothing. A container that runs out of memory.
2. Investigate it
kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name>
Just like you would in production. No hints. No hand-holding. You figure it out.
3. Fix it
kubectl apply -f fix.yaml
The fix resolves the issue. You can compare the two YAML files to see exactly what changed and why.
4. Understand it
Every scenario includes a description.md that explains:
- What the issue is
- What causes it in the real world
- How to identify it
- How to fix it
The 35 Scenarios
Here's what's inside, organized by category:
Scheduling Failures (Pod stuck in Pending)
| Scenario | What Happens |
|---|---|
| Affinity Rules Violation | Pod requires a node label that doesn't exist |
| Node Affinity Issue | Pod targets a non-existent node |
| Insufficient Resources | Pod requests more CPU/memory than available |
| Taints and Tolerations Mismatch | Pod can't schedule due to node selector |
| Cluster Autoscaler Issues | Too many replicas for the cluster to handle |
Container Crashes
| Scenario | What Happens |
|---|---|
| CrashLoopBackOff | Container exits immediately with error |
| OOM Killed | Container exceeds memory limit, killed by cgroup |
| Wrong Container Command | Invalid command in container spec |
| CGroup Issues | Memory stress exceeds cgroup limits |
| Failed Resource Limits | Workload exceeds restrictive resource limits |
Image Problems
| Scenario | What Happens |
|---|---|
| Image Pull BackOff | Non-existent image |
| Image Pull Error | Private registry image without credentials |
Probe Failures
| Scenario | What Happens |
|---|---|
| Liveness Probe Failure | Probe hits wrong endpoint, container keeps restarting |
| Readiness Probe Failure | Probe fails, pod shows 0/1 Ready |
| Liveness & Readiness Failure | Both probes misconfigured |
Storage Issues
| Scenario | What Happens |
|---|---|
| Volume Mount Issue | Pod references a volume that doesn't exist |
| Persistent Volume Claim Issues | PVC can't bind to any PV |
| Disk IO Errors | HostPath points to non-existent directory |
| File Permissions on Mounted Volumes | Read-only filesystem blocks writes |
| Crash Due to Insufficient Disk Space | Ephemeral storage limit exceeded |
Networking Issues
| Scenario | What Happens |
|---|---|
| DNS Resolution Failure | Custom DNS config points to invalid nameserver |
| Firewall Restriction | NetworkPolicy blocks all egress |
| Network Connectivity Issues | NetworkPolicy blocks pod traffic |
| Service Port Mismatch | Service port doesn't match container port |
| Ingress Configuration Issue | Ingress points to wrong host |
| LoadBalancer Misconfiguration | Service selector doesn't match any pods |
| Port Binding Issues | Container port conflict |
Security & RBAC
| Scenario | What Happens |
|---|---|
| Service Account Permissions | Pod references non-existent ServiceAccount |
| Security Context Issues | Running as root vs non-root (best practice) |
| SELinux/AppArmor Policy Violation | Security policy configuration |
| PID Namespace Collision | Host PID namespace shared (security risk) |
Other
| Scenario | What Happens |
|---|---|
| Container Runtime (CRI) Errors | RuntimeClass with non-existent handler |
| Resource Requests & Limits Mismatch | CPU limit less than request (API rejection) |
| Pod Disruption Budget Violations | PDB blocks voluntary disruptions |
| Outdated Kubernetes Version | Educational scenario about version management |
Quick Start (5 Minutes)
Option 1: Use the interactive script
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh
Pick a scenario number. The script handles everything - creates the issue, lets you investigate, then applies the fix.
Option 2: Run scenarios manually
cd scenarios/crashloopbackoff
# Create the problem
kubectl apply -f issue.yaml
# Investigate
kubectl get pods
kubectl describe pod crashloopbackoff-pod
kubectl logs crashloopbackoff-pod
# Fix it
kubectl delete -f issue.yaml
kubectl apply -f fix.yaml
# Verify
kubectl get pods
What You'll Actually Learn
After working through these scenarios, you'll be able to:
Read pod status and know exactly what's wrong - Pending means scheduling, CrashLoopBackOff means the app is failing, ImagePullBackOff means the image is wrong.
Use kubectl describe like a pro - The Events section at the bottom tells you everything. Failed scheduling, failed mounts, failed pulls, probe failures - it's all there.
Understand resource management - Requests vs limits, ephemeral storage, cgroup OOM kills, and why your pod got evicted.
Debug networking issues - DNS resolution, NetworkPolicy, service selectors, port mismatches, and ingress configuration.
Handle security configurations - ServiceAccounts, security contexts, PID namespaces, and runtime classes.
Who Is This For?
- Beginners who just finished a Kubernetes tutorial and want real practice
- Developers who deploy to Kubernetes but panic when something breaks
- DevOps engineers preparing for CKA/CKAD certification
- SREs who want to sharpen their troubleshooting instincts
- Teams who want to run Kubernetes troubleshooting workshops
A Note on Two Scenario Types
30 scenarios produce hard failures - you'll see Pending, CrashLoopBackOff, OOMKilled, Error, or ImagePullBackOff. These are obvious and satisfying to fix.
3 scenarios are educational - Security Context, SELinux, and Outdated K8s Version. Both the issue and fix pods run successfully. The learning is in understanding the security implications of the configuration difference.
2 scenarios require a CNI with NetworkPolicy support (like Calico or Cilium) to fully demonstrate blocked traffic.
Try It Today
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh
Star the repo if it helps you: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro
Share it with someone who's learning Kubernetes. The best way to learn is to break things safely.
Built and maintained by Koti Vellanki. Contributions welcome!
Top comments (0)