DEV Community

Cover image for Why I Broke Kubernetes Cluster 35 Times? I did it So You Don't Have To
Koti Vellanki
Koti Vellanki

Posted on

Why I Broke Kubernetes Cluster 35 Times? I did it So You Don't Have To

The Problem With Learning Kubernetes

Everyone tells you to "learn Kubernetes"

So you read the docs. You watch YouTube. You follow a tutorial that deploys nginx. You feel great.

After a cheerful weekend, you logged in Monday morning. A pod is stuck in CrashLoopBackOff. You stare at the terminal. You Google or ask any GPT models. You paste random commands. Thirty minutes later, you're still stuck.

Sound familiar?

Here's the thing: you don't really understand Kubernetes until something breaks.

And the best way to learn troubleshooting is to break things on purpose, in a safe environment, where you can take your time and actually understand what went wrong.

That's exactly why I built this.

Introducing: Troubleshoot Kubernetes Like a Pro

GitHub: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro

It's a free, open-source collection of 35 real-world Kubernetes failure scenarios that you can simulate, investigate, and fix on your own cluster.

No custom Docker images. No complex setup. No cloud account required. Just a local Kubernetes cluster (Minikube, Kind, or Docker Desktop) and kubectl.

How It Works

Every scenario follows the same simple pattern:

1. Break it

kubectl apply -f issue.yaml
Enter fullscreen mode Exit fullscreen mode

This creates a deliberately broken Kubernetes resource. A pod that crashes. A service that points to nothing. A container that runs out of memory.

2. Investigate it

kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name>
Enter fullscreen mode Exit fullscreen mode

Just like you would in production. No hints. No hand-holding. You figure it out.

3. Fix it

kubectl apply -f fix.yaml
Enter fullscreen mode Exit fullscreen mode

The fix resolves the issue. You can compare the two YAML files to see exactly what changed and why.

4. Understand it

Every scenario includes a description.md that explains:

  • What the issue is
  • What causes it in the real world
  • How to identify it
  • How to fix it

The 35 Scenarios

Here's what's inside, organized by category:

Scheduling Failures (Pod stuck in Pending)

Scenario What Happens
Affinity Rules Violation Pod requires a node label that doesn't exist
Node Affinity Issue Pod targets a non-existent node
Insufficient Resources Pod requests more CPU/memory than available
Taints and Tolerations Mismatch Pod can't schedule due to node selector
Cluster Autoscaler Issues Too many replicas for the cluster to handle

Container Crashes

Scenario What Happens
CrashLoopBackOff Container exits immediately with error
OOM Killed Container exceeds memory limit, killed by cgroup
Wrong Container Command Invalid command in container spec
CGroup Issues Memory stress exceeds cgroup limits
Failed Resource Limits Workload exceeds restrictive resource limits

Image Problems

Scenario What Happens
Image Pull BackOff Non-existent image
Image Pull Error Private registry image without credentials

Probe Failures

Scenario What Happens
Liveness Probe Failure Probe hits wrong endpoint, container keeps restarting
Readiness Probe Failure Probe fails, pod shows 0/1 Ready
Liveness & Readiness Failure Both probes misconfigured

Storage Issues

Scenario What Happens
Volume Mount Issue Pod references a volume that doesn't exist
Persistent Volume Claim Issues PVC can't bind to any PV
Disk IO Errors HostPath points to non-existent directory
File Permissions on Mounted Volumes Read-only filesystem blocks writes
Crash Due to Insufficient Disk Space Ephemeral storage limit exceeded

Networking Issues

Scenario What Happens
DNS Resolution Failure Custom DNS config points to invalid nameserver
Firewall Restriction NetworkPolicy blocks all egress
Network Connectivity Issues NetworkPolicy blocks pod traffic
Service Port Mismatch Service port doesn't match container port
Ingress Configuration Issue Ingress points to wrong host
LoadBalancer Misconfiguration Service selector doesn't match any pods
Port Binding Issues Container port conflict

Security & RBAC

Scenario What Happens
Service Account Permissions Pod references non-existent ServiceAccount
Security Context Issues Running as root vs non-root (best practice)
SELinux/AppArmor Policy Violation Security policy configuration
PID Namespace Collision Host PID namespace shared (security risk)

Other

Scenario What Happens
Container Runtime (CRI) Errors RuntimeClass with non-existent handler
Resource Requests & Limits Mismatch CPU limit less than request (API rejection)
Pod Disruption Budget Violations PDB blocks voluntary disruptions
Outdated Kubernetes Version Educational scenario about version management

Quick Start (5 Minutes)

Option 1: Use the interactive script

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh
Enter fullscreen mode Exit fullscreen mode

Pick a scenario number. The script handles everything - creates the issue, lets you investigate, then applies the fix.

Option 2: Run scenarios manually

cd scenarios/crashloopbackoff

# Create the problem
kubectl apply -f issue.yaml

# Investigate
kubectl get pods
kubectl describe pod crashloopbackoff-pod
kubectl logs crashloopbackoff-pod

# Fix it
kubectl delete -f issue.yaml
kubectl apply -f fix.yaml

# Verify
kubectl get pods
Enter fullscreen mode Exit fullscreen mode

What You'll Actually Learn

After working through these scenarios, you'll be able to:

  • Read pod status and know exactly what's wrong - Pending means scheduling, CrashLoopBackOff means the app is failing, ImagePullBackOff means the image is wrong.

  • Use kubectl describe like a pro - The Events section at the bottom tells you everything. Failed scheduling, failed mounts, failed pulls, probe failures - it's all there.

  • Understand resource management - Requests vs limits, ephemeral storage, cgroup OOM kills, and why your pod got evicted.

  • Debug networking issues - DNS resolution, NetworkPolicy, service selectors, port mismatches, and ingress configuration.

  • Handle security configurations - ServiceAccounts, security contexts, PID namespaces, and runtime classes.

Who Is This For?

  • Beginners who just finished a Kubernetes tutorial and want real practice
  • Developers who deploy to Kubernetes but panic when something breaks
  • DevOps engineers preparing for CKA/CKAD certification
  • SREs who want to sharpen their troubleshooting instincts
  • Teams who want to run Kubernetes troubleshooting workshops

A Note on Two Scenario Types

30 scenarios produce hard failures - you'll see Pending, CrashLoopBackOff, OOMKilled, Error, or ImagePullBackOff. These are obvious and satisfying to fix.

3 scenarios are educational - Security Context, SELinux, and Outdated K8s Version. Both the issue and fix pods run successfully. The learning is in understanding the security implications of the configuration difference.

2 scenarios require a CNI with NetworkPolicy support (like Calico or Cilium) to fully demonstrate blocked traffic.

Try It Today

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh
Enter fullscreen mode Exit fullscreen mode

Star the repo if it helps you: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro

Share it with someone who's learning Kubernetes. The best way to learn is to break things safely.

Built and maintained by Koti Vellanki. Contributions welcome!

Top comments (0)