Chaos Engineering Toolkit
Build confidence in your production systems by breaking them on purpose. This toolkit provides ready-to-run chaos experiment designs, Litmus and Gremlin configurations, failure injection scripts, and game day planning templates that let your team practice incident response before real outages happen. Every experiment includes a hypothesis, steady-state definition, rollback procedure, and blast radius controls — because chaos without discipline is just an outage.
Key Features
- 12 pre-built experiments — Network latency, pod kill, CPU stress, disk fill, DNS failure, zone outage, and more
- Litmus ChaosEngine manifests — Drop-in YAML for LitmusChaos with tunable parameters and abort conditions
- Gremlin attack configs — JSON configs for Gremlin's API covering infrastructure and application-layer attacks
- Game day planner — Markdown templates for planning, executing, and debriefing chaos game days
- Blast radius calculator — Python script that estimates impact scope before running experiments
- Steady-state validation — Prometheus queries to verify system health before, during, and after experiments
- Automated rollback hooks — Shell scripts that abort experiments when error budgets are breached
Quick Start
unzip chaos-engineering-toolkit.zip && cd chaos-engineering-toolkit/
# Dry-run a pod-kill experiment (validates without executing)
python3 src/chaos_engineering_toolkit/core.py \
--experiment pod-kill \
--target-namespace staging --dry-run
# Run the blast radius calculator
python3 src/chaos_engineering_toolkit/utils.py blast-radius \
--experiment network-latency --target-service api-gateway
Architecture / How It Works
PLAN → VALIDATE → EXECUTE → ANALYZE
│ │
└── Abort if SLO ────┘
budget breached
- Plan — Define hypothesis ("API latency stays under 200ms when cache nodes fail"), set blast radius limits
- Validate — Prometheus queries confirm the system is in steady state before injecting failure
- Execute — Apply chaos manifest (Litmus ChaosEngine or Gremlin API call) with abort conditions
- Analyze — Compare during-experiment metrics to baseline, record findings
Usage Examples
Litmus ChaosEngine — Pod Kill Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-experiment
namespace: staging
spec:
appinfo:
appns: staging
applabel: app=api-gateway
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: PODS_AFFECTED_PERC
value: "50"
Steady-State Validation with Prometheus
from chaos_engineering_toolkit.core import SteadyStateValidator
validator = SteadyStateValidator(
prometheus_url="https://prometheus.example.com"
)
# Define steady-state conditions
conditions = [
{
"name": "api_latency_p99",
"query": 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m]))',
"operator": "less_than",
"threshold": 0.200,
},
{
"name": "error_rate",
"query": 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])',
"operator": "less_than",
"threshold": 0.01,
},
]
result = validator.check(conditions)
if result.all_passed:
print("Steady state confirmed — safe to proceed with experiment")
else:
print(f"ABORT: {result.failures}")
Automated Rollback Script
#!/usr/bin/env bash
# rollback.sh — abort experiment if error budget is breached
ERROR_BUDGET=$(curl -s "https://prometheus.example.com/api/v1/query" \
--data-urlencode 'query=slo:error_budget_remaining:ratio{service="api-gateway"}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
if (( $(echo "$ERROR_BUDGET < 0.10" | bc -l) )); then
echo "ERROR BUDGET BELOW 10% — aborting experiment"
kubectl delete chaosengine pod-kill-experiment -n staging
exit 1
fi
echo "Error budget at ${ERROR_BUDGET}% — experiment continues"
Configuration
# config.example.yaml
chaos:
default_namespace: staging # Never run in production without approval
max_duration: 300 # Max experiment duration in seconds
abort_on_slo_breach: true # Auto-abort if SLO is violated
error_budget_floor: 0.10 # Abort if remaining budget below 10%
cooldown_between_experiments: 600 # 10 min between experiments
blast_radius:
max_pods_affected_pct: 50 # Never kill more than 50% of pods
excluded_namespaces: # Never inject chaos here
- kube-system
- monitoring
- cert-manager
prometheus:
url: https://prometheus.example.com
query_timeout: 10s
notifications:
slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK/HERE
notify_on: [start, abort, complete]
Best Practices
- Start in staging, graduate to production after the team is comfortable with rollback procedures
- Always define abort conditions — an experiment without a kill switch is an outage
- Run during business hours — chaos at 2 AM teaches you nothing
- One variable at a time — single failure mode per experiment for clear signal
- Document every surprise — the value is in what you didn't expect
Troubleshooting
Litmus experiment stuck in "Running" state
Check the chaos-runner pod logs: kubectl logs -n staging -l app=chaos-runner. Common cause: the target application label doesn't match any pods. Verify with kubectl get pods -n staging -l app=api-gateway.
Gremlin agent not reporting
Ensure the Gremlin daemonset is running: kubectl get ds gremlin -n gremlin. Check that GREMLIN_TEAM_ID and GREMLIN_TEAM_SECRET are set correctly.
Rollback script doesn't abort
The script requires bc for floating-point comparison (apt-get install bc). Also verify the Prometheus query returns data.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Chaos Engineering Toolkit] with all files, templates, and documentation for $49.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)