DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Chaos Engineering Toolkit

Chaos Engineering Toolkit

Build confidence in your production systems by breaking them on purpose. This toolkit provides ready-to-run chaos experiment designs, Litmus and Gremlin configurations, failure injection scripts, and game day planning templates that let your team practice incident response before real outages happen. Every experiment includes a hypothesis, steady-state definition, rollback procedure, and blast radius controls — because chaos without discipline is just an outage.

Key Features

  • 12 pre-built experiments — Network latency, pod kill, CPU stress, disk fill, DNS failure, zone outage, and more
  • Litmus ChaosEngine manifests — Drop-in YAML for LitmusChaos with tunable parameters and abort conditions
  • Gremlin attack configs — JSON configs for Gremlin's API covering infrastructure and application-layer attacks
  • Game day planner — Markdown templates for planning, executing, and debriefing chaos game days
  • Blast radius calculator — Python script that estimates impact scope before running experiments
  • Steady-state validation — Prometheus queries to verify system health before, during, and after experiments
  • Automated rollback hooks — Shell scripts that abort experiments when error budgets are breached

Quick Start

unzip chaos-engineering-toolkit.zip && cd chaos-engineering-toolkit/

# Dry-run a pod-kill experiment (validates without executing)
python3 src/chaos_engineering_toolkit/core.py \
  --experiment pod-kill \
  --target-namespace staging --dry-run

# Run the blast radius calculator
python3 src/chaos_engineering_toolkit/utils.py blast-radius \
  --experiment network-latency --target-service api-gateway
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

PLAN → VALIDATE → EXECUTE → ANALYZE
 │                    │
 └── Abort if SLO ────┘
     budget breached
Enter fullscreen mode Exit fullscreen mode
  1. Plan — Define hypothesis ("API latency stays under 200ms when cache nodes fail"), set blast radius limits
  2. Validate — Prometheus queries confirm the system is in steady state before injecting failure
  3. Execute — Apply chaos manifest (Litmus ChaosEngine or Gremlin API call) with abort conditions
  4. Analyze — Compare during-experiment metrics to baseline, record findings

Usage Examples

Litmus ChaosEngine — Pod Kill Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-kill-experiment
  namespace: staging
spec:
  appinfo:
    appns: staging
    applabel: app=api-gateway
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: PODS_AFFECTED_PERC
              value: "50"
Enter fullscreen mode Exit fullscreen mode

Steady-State Validation with Prometheus

from chaos_engineering_toolkit.core import SteadyStateValidator

validator = SteadyStateValidator(
    prometheus_url="https://prometheus.example.com"
)

# Define steady-state conditions
conditions = [
    {
        "name": "api_latency_p99",
        "query": 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="api-gateway"}[5m]))',
        "operator": "less_than",
        "threshold": 0.200,
    },
    {
        "name": "error_rate",
        "query": 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])',
        "operator": "less_than",
        "threshold": 0.01,
    },
]

result = validator.check(conditions)
if result.all_passed:
    print("Steady state confirmed — safe to proceed with experiment")
else:
    print(f"ABORT: {result.failures}")
Enter fullscreen mode Exit fullscreen mode

Automated Rollback Script

#!/usr/bin/env bash
# rollback.sh — abort experiment if error budget is breached

ERROR_BUDGET=$(curl -s "https://prometheus.example.com/api/v1/query" \
  --data-urlencode 'query=slo:error_budget_remaining:ratio{service="api-gateway"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['result'][0]['value'][1])")

if (( $(echo "$ERROR_BUDGET < 0.10" | bc -l) )); then
  echo "ERROR BUDGET BELOW 10% — aborting experiment"
  kubectl delete chaosengine pod-kill-experiment -n staging
  exit 1
fi
echo "Error budget at ${ERROR_BUDGET}% — experiment continues"
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
chaos:
  default_namespace: staging       # Never run in production without approval
  max_duration: 300                # Max experiment duration in seconds
  abort_on_slo_breach: true        # Auto-abort if SLO is violated
  error_budget_floor: 0.10         # Abort if remaining budget below 10%
  cooldown_between_experiments: 600  # 10 min between experiments

blast_radius:
  max_pods_affected_pct: 50        # Never kill more than 50% of pods
  excluded_namespaces:             # Never inject chaos here
    - kube-system
    - monitoring
    - cert-manager

prometheus:
  url: https://prometheus.example.com
  query_timeout: 10s

notifications:
  slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK/HERE
  notify_on: [start, abort, complete]
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Start in staging, graduate to production after the team is comfortable with rollback procedures
  • Always define abort conditions — an experiment without a kill switch is an outage
  • Run during business hours — chaos at 2 AM teaches you nothing
  • One variable at a time — single failure mode per experiment for clear signal
  • Document every surprise — the value is in what you didn't expect

Troubleshooting

Litmus experiment stuck in "Running" state
Check the chaos-runner pod logs: kubectl logs -n staging -l app=chaos-runner. Common cause: the target application label doesn't match any pods. Verify with kubectl get pods -n staging -l app=api-gateway.

Gremlin agent not reporting
Ensure the Gremlin daemonset is running: kubectl get ds gremlin -n gremlin. Check that GREMLIN_TEAM_ID and GREMLIN_TEAM_SECRET are set correctly.

Rollback script doesn't abort
The script requires bc for floating-point comparison (apt-get install bc). Also verify the Prometheus query returns data.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Chaos Engineering Toolkit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)