Chaos Engineering for Teams That Aren't Netflix

#chaosengineering #sre #testing #reliability

You Don't Need Chaos Monkey

Every chaos engineering talk starts with Netflix and Chaos Monkey. Cool story. You're not Netflix. You probably have 5-50 services, not 500. You don't need a sophisticated chaos platform.

You need a methodology.

Start With Game Days

Before injecting failures into production, run tabletop exercises:

## Game Day: Database Failover
Date: 2024-03-15
Scope: Primary DB goes down
Participants: SRE team + Backend leads

### Scenario
At 10:00 AM, the primary database becomes unreachable.

### Questions to answer:
1. How does the application behave?
2. Does the replica promote automatically?
3. What's the expected failover time?
4. What manual steps are needed?
5. How do we verify data consistency after failover?

### Pre-requisites
- [ ] Backup verified within last 24 hours
- [ ] Runbook for DB failover reviewed
- [ ] Rollback plan documented
- [ ] All participants in incident channel

Game days cost nothing and reveal 80% of the gaps.

Your First Real Chaos Experiment

Start small. Really small.

# Experiment 1: Kill a single pod
# Hypothesis: Traffic shifts to remaining pods with zero errors

# Before
kubectl get pods -l app=api-service
# NAME                          READY   STATUS
# api-service-7d8f9b6c4-abc12   1/1     Running
# api-service-7d8f9b6c4-def34   1/1     Running
# api-service-7d8f9b6c4-ghi56   1/1     Running

# The experiment
kubectl delete pod api-service-7d8f9b6c4-abc12

# Observe
# - Did error rate spike?
# - Did latency increase?
# - Did K8s reschedule the pod?
# - How long until back to 3 replicas?

If killing one pod causes errors, you have a serious problem that's better to find now.

The Chaos Experiment Template

experiment:
  name: "Network latency to payment service"
  date: "2024-03-20"

  hypothesis:
    steady_state: "P99 checkout latency < 500ms, error rate < 0.1%"
    expected_behavior: >
      Circuit breaker activates within 5 seconds.
      Checkout falls back to cached payment validation.
      Users see a 'retry' message, not an error.

  method:
    tool: "tc (traffic control)"
    injection: "200ms latency on port 443 to payment-service"
    duration: "5 minutes"
    scope: "Single availability zone"

  abort_conditions:
    - "Error rate exceeds 5%"
    - "Checkout success rate drops below 90%"
    - "Any data inconsistency detected"

  rollback:
    command: "tc qdisc del dev eth0 root"
    verification: "Check latency returns to baseline"

  results:
    actual_behavior: "[filled in after experiment]"
    hypothesis_confirmed: true/false
    action_items: []

Progressive Chaos Levels

Level 1: Kill a pod (Week 1)
Level 2: Kill all pods in one AZ (Week 4)
Level 3: Inject latency to a dependency (Week 8)
Level 4: Simulate full dependency outage (Week 12)
Level 5: Multi-failure scenario (Week 16+)

Don't skip levels. Each one builds confidence and reveals issues.

What We Found

After 6 months of regular chaos experiments:

12 missing circuit breakers discovered
3 services with no health checks
5 services with incorrect timeout configurations
2 services with hard-coded dependency URLs (no DNS)
1 service that crashed when its cache was unavailable

All of these would have caused outages eventually. We found them on our terms, during business hours, with everyone ready.

The Business Case

Prevented outages (estimated): 4 per quarter
Average outage cost: $15,000
Chaos engineering cost: ~20 hours/quarter of eng time

ROI: ($60,000 saved - $4,000 cost) = $56,000/quarter

If you want to run chaos experiments with AI-guided blast radius analysis, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community