Chaos by Design: Production Maintenance Drills on Kubernetes

#kubernetes #sre #grafana #fastapi

There's an old SRE adage: "Hope is not a strategy." Yet most engineering teams only discover how their systems fail under pressure when that pressure is real, unplanned, and 2 AM on a Saturday. Production outages are expensive teachers.

The alternative is to make failure boring — to rehearse it so often that when it actually happens, your team moves through the recovery playbook on autopilot. That's the idea behind prod-maintenance-drills: a self-hosted Kubernetes environment where you deliberately break things to learn how to fix them.

Why Drills Matter
Chaos engineering, popularized by Netflix's Chaos Monkey, is the discipline of intentionally introducing failures into a system to build confidence in its ability to withstand turbulent, unexpected conditions. But you don't need a Netflix-scale infrastructure to benefit from it.

Even on a local Kubernetes cluster with a handful of pods, running structured drills teaches you things you can't learn from diagrams or documentation:

How fast does your deployment actually recover after a pod crash?
Does your application handle a database restart gracefully, or does it need a manual restart too?
At what CPU threshold does your HPA kick in — and does it kick in fast enough?
When disk fills up, do your alerts fire before the application starts failing? Running drills answers these questions with evidence, not assumptions.

The FastAPI application exposes two endpoints: / for a health check and /db to verify database connectivity. These simple endpoints become the canary in the coal mine — you watch them during drills to confirm the system has recovered.

The monitoring stack runs kube-prometheus-stack via Helm, giving you Prometheus scraping, Grafana dashboards, and Alertmanager rules all preconfigured. You get real-time visibility into pod restarts, CPU usage, and database status without having to wire anything up manually.

The Five Drills

app_crash.sh
Deletes a running pod to test Kubernetes self-healing and deployment recovery time.
db_failure.sh
Kills the PostgreSQL pod to validate application reconnection behavior.

-high_load.sh
Schedules a CPU stress job to trigger horizontal pod autoscaling at 60% threshold.

-backup.sh
Creates a timestamped PostgreSQL dump to practice backup and restore procedures.

disk_fill.sh Simulates disk exhaustion on a node and verifies that monitoring alerts fire correctly.

Setting It Up
Prerequisites
You'll need: kind or minikube, kubectl, docker, and helm. That's it no cloud account required.

Spin up the cluster** # Create a local kind cluster kind create cluster --name simple-k8sbash
Install the monitoring stack** helm repo add prometheus-community \ https://prometheus-community.github.io/helm-charts helm repo update

helm install monitoring \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespacebash
Tip
For kind clusters, you'll also need to install metrics-server and add --kubelet-insecure-tls to its deployment args, otherwise HPA won't be able to read resource metrics.
3.Build and load the application image**

Build the FastAPI image

cd app && docker build -t prod-app:latest . && cd ..

Load it into kind (not needed for cloud clusters)

kind load docker-image prod-app:latest --name simple-k8sbash

Deploy everything** kubectl apply -f k8s/

Verify everything came up

kubectl get pods -n prod
kubectl get pods -n monitoring
kubectl get hpa -n prodbash
Once the pods are running, grab the node IP and hit your endpoints:

NODE_IP=$(kubectl get nodes -o jsonpath=\
'{.items[0].status.addresses[?(@.type=="InternalIP")].address}')

curl http://$NODE_IP:30007/

→ {"status":"running"}

curl http://$NODE_IP:30007/db

→ {"db":"connected"}bash

Running Your First Drill
Let's walk through the pod crash drill end to end, because it's the cleanest example of what these drills teach you.

Open two terminal windows. In the first, start watching your pods:

watch kubectl get pods -n prodbash In the second, run the drill:

./app_crash.shbash
You'll see one pod disappear from the watch window. Within seconds — typically under 30 — Kubernetes will have scheduled a replacement. That's the Deployment controller doing its job.

Now do it again. And again. Notice the restart count climb in Prometheus. Notice the brief dip in the up{namespace="prod"} metric. This is what your monitoring dashboards look like during an incident. Seeing it in a drill is far less stressful than seeing it at 2 AM for the first time.

Prometheus Queries for Drills
Monitor these metrics live while running drills:

sum(kube_pod_container_status_restarts_total{namespace="prod"}) by (pod)
sum(rate(container_cpu_usage_seconds_total{namespace="prod"}[1m])) by (pod)
up{namespace="prod"}
The HPA Drill — Watching Your System Scale
The high_load.sh drill is especially satisfying because you get to watch autoscaling happen in real time. The HPA is configured with a minimum of 2 replicas, a maximum of 5, and a target CPU utilization of 60%.

In one terminal: watch the HPA

watch kubectl get hpa -n prod

In another: trigger the load

./high_load.shbash
You'll see the TARGETS column climb past 60%, and within a minute or two the REPLICAS column will tick up from 2. The load job eventually completes, and the HPA scales back down after the cooldown period.

This drill builds intuition for how long scaling takes end-to-end: from metric collection, to HPA decision, to pod scheduling, to readiness. That latency matters when you're sizing your HPA thresholds for real production traffic spikes.

The Database Failure Drill — Resilience by Default?
This one is the most revealing. When PostgreSQL restarts, does your FastAPI application reconnect automatically, or does it need a restart too?

./db_failure.shbash
While the PostgreSQL pod is down, hitting /db should return a connectivity error. When it comes back, your application should reconnect on the next request — if your database connection pool is configured to retry.

If it doesn't reconnect automatically, you've just discovered a resilience gap before it cost you. Fix it: use a connection pool with reconnect logic, add retry wrappers around database calls, or configure proper liveness/readiness probes that cycle the app pod when the DB is unreachable.

Connection Resilience Checklist
After the DB drill, verify: (1) App eventually reconnects without manual intervention. (2) Readiness probe correctly marks the pod as not-ready while DB is down. (3) Prometheus alert fires within your SLO window. (4) The alert resolves automatically after recovery.
Integrating With Your Workflow
Drills are most valuable when they're scheduled, not spontaneous. A few patterns that work well:

Weekly game days: Block 30 minutes every week for one drill, rotating through the five scenarios. Document observations and improvements in a shared runbook.
Pre-release validation: Run the full suite before any major deployment. If your new release doesn't survive a pod crash drill, it's not ready.
Onboarding tool: New engineers run the drills in their first week. There's no better way to learn a system than to watch it fail and recover.
CI gate: In staging, run app_crash.sh as part of your pipeline and fail the build if recovery takes longer than your SLO allows.
What's Next
The current drill set covers the most common failure modes. Here are some directions to extend it:

Network partition drill: Use a network policy to block traffic between the app and the database for a set duration, simulating a network split.
Memory pressure: A complement to the CPU drill — fill pod memory to trigger OOM kills and test restart behavior.
Rolling update with failure injection: Trigger a deployment rollout while simultaneously running the crash drill to validate zero-downtime deploys.
Restore drill: Pair backup.sh with a corresponding restore.sh that brings the database back from a backup and validates data integrity.
Multi-node scenarios: With a multi-node kind cluster, add a node drain drill to practice pod eviction and rescheduling.

https://github.com/Copubah/prod-maintenance-drills