Koti Vellanki

Posted on Apr 7

Why I Broke Kubernetes Cluster 35 Times? I did it So You Don't Have To

#kubernetes #beginners #tutorial #ai

The Problem With Learning Kubernetes

Everyone tells you to "learn Kubernetes"

So you read the docs. You watch YouTube. You follow a tutorial that deploys nginx. You feel great.

After a cheerful weekend, you logged in Monday morning. A pod is stuck in CrashLoopBackOff. You stare at the terminal. You Google or ask any GPT models. You paste random commands. Thirty minutes later, you're still stuck.

Sound familiar?

Here's the thing: you don't really understand Kubernetes until something breaks.

And the best way to learn troubleshooting is to break things on purpose, in a safe environment, where you can take your time and actually understand what went wrong.

That's exactly why I built this.

Introducing: Troubleshoot Kubernetes Like a Pro

GitHub: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro

It's a free, open-source collection of 35 real-world Kubernetes failure scenarios that you can simulate, investigate, and fix on your own cluster.

No custom Docker images. No complex setup. No cloud account required. Just a local Kubernetes cluster (Minikube, Kind, or Docker Desktop) and kubectl.

How It Works

Every scenario follows the same simple pattern:

1. Break it

kubectl apply -f issue.yaml

This creates a deliberately broken Kubernetes resource. A pod that crashes. A service that points to nothing. A container that runs out of memory.

2. Investigate it

kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name>

Just like you would in production. No hints. No hand-holding. You figure it out.

3. Fix it

kubectl apply -f fix.yaml

The fix resolves the issue. You can compare the two YAML files to see exactly what changed and why.

4. Understand it

Every scenario includes a description.md that explains:

What the issue is
What causes it in the real world
How to identify it
How to fix it

The 35 Scenarios

Here's what's inside, organized by category:

Scheduling Failures (Pod stuck in Pending)

Scenario	What Happens
Affinity Rules Violation	Pod requires a node label that doesn't exist
Node Affinity Issue	Pod targets a non-existent node
Insufficient Resources	Pod requests more CPU/memory than available
Taints and Tolerations Mismatch	Pod can't schedule due to node selector
Cluster Autoscaler Issues	Too many replicas for the cluster to handle

Container Crashes

Scenario	What Happens
CrashLoopBackOff	Container exits immediately with error
OOM Killed	Container exceeds memory limit, killed by cgroup
Wrong Container Command	Invalid command in container spec
CGroup Issues	Memory stress exceeds cgroup limits
Failed Resource Limits	Workload exceeds restrictive resource limits

Image Problems

Scenario	What Happens
Image Pull BackOff	Non-existent image
Image Pull Error	Private registry image without credentials

Probe Failures

Scenario	What Happens
Liveness Probe Failure	Probe hits wrong endpoint, container keeps restarting
Readiness Probe Failure	Probe fails, pod shows 0/1 Ready
Liveness & Readiness Failure	Both probes misconfigured

Storage Issues

Scenario	What Happens
Volume Mount Issue	Pod references a volume that doesn't exist
Persistent Volume Claim Issues	PVC can't bind to any PV
Disk IO Errors	HostPath points to non-existent directory
File Permissions on Mounted Volumes	Read-only filesystem blocks writes
Crash Due to Insufficient Disk Space	Ephemeral storage limit exceeded

Networking Issues

Scenario	What Happens
DNS Resolution Failure	Custom DNS config points to invalid nameserver
Firewall Restriction	NetworkPolicy blocks all egress
Network Connectivity Issues	NetworkPolicy blocks pod traffic
Service Port Mismatch	Service port doesn't match container port
Ingress Configuration Issue	Ingress points to wrong host
LoadBalancer Misconfiguration	Service selector doesn't match any pods
Port Binding Issues	Container port conflict

Security & RBAC

Scenario	What Happens
Service Account Permissions	Pod references non-existent ServiceAccount
Security Context Issues	Running as root vs non-root (best practice)
SELinux/AppArmor Policy Violation	Security policy configuration
PID Namespace Collision	Host PID namespace shared (security risk)

Other

Scenario	What Happens
Container Runtime (CRI) Errors	RuntimeClass with non-existent handler
Resource Requests & Limits Mismatch	CPU limit less than request (API rejection)
Pod Disruption Budget Violations	PDB blocks voluntary disruptions
Outdated Kubernetes Version	Educational scenario about version management

Quick Start (5 Minutes)

Option 1: Use the interactive script

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh

Pick a scenario number. The script handles everything - creates the issue, lets you investigate, then applies the fix.

Option 2: Run scenarios manually

cd scenarios/crashloopbackoff

# Create the problem
kubectl apply -f issue.yaml

# Investigate
kubectl get pods
kubectl describe pod crashloopbackoff-pod
kubectl logs crashloopbackoff-pod

# Fix it
kubectl delete -f issue.yaml
kubectl apply -f fix.yaml

# Verify
kubectl get pods

What You'll Actually Learn

After working through these scenarios, you'll be able to:

Read pod status and know exactly what's wrong - Pending means scheduling, CrashLoopBackOff means the app is failing, ImagePullBackOff means the image is wrong.
Use kubectl describe like a pro - The Events section at the bottom tells you everything. Failed scheduling, failed mounts, failed pulls, probe failures - it's all there.
Understand resource management - Requests vs limits, ephemeral storage, cgroup OOM kills, and why your pod got evicted.
Debug networking issues - DNS resolution, NetworkPolicy, service selectors, port mismatches, and ingress configuration.
Handle security configurations - ServiceAccounts, security contexts, PID namespaces, and runtime classes.

Who Is This For?

Beginners who just finished a Kubernetes tutorial and want real practice
Developers who deploy to Kubernetes but panic when something breaks
DevOps engineers preparing for CKA/CKAD certification
SREs who want to sharpen their troubleshooting instincts
Teams who want to run Kubernetes troubleshooting workshops

A Note on Two Scenario Types

30 scenarios produce hard failures - you'll see Pending, CrashLoopBackOff, OOMKilled, Error, or ImagePullBackOff. These are obvious and satisfying to fix.

3 scenarios are educational - Security Context, SELinux, and Outdated K8s Version. Both the issue and fix pods run successfully. The learning is in understanding the security implications of the configuration difference.

2 scenarios require a CNI with NetworkPolicy support (like Calico or Cilium) to fully demonstrate blocked traffic.

Try It Today

git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git
cd troubleshoot-kubernetes-like-a-pro
./manage-scenarios.sh

Star the repo if it helps you: https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro

Share it with someone who's learning Kubernetes. The best way to learn is to break things safely.

Built and maintained by Koti Vellanki. Contributions welcome!

DEV Community