Chaos Engineering Is Theater Without These Three Things

#sre #devops #chaos #resilience

Chaos engineering has a credibility problem. Half the teams that adopt it are doing it because it's fashionable, not because it makes their systems more reliable. The result is a lot of chaos tools getting installed, a lot of demo videos getting recorded, and not much actually improving.

If you don't have these three things in place, your chaos engineering practice is theater. Skip it and fix the prerequisites first.

1. You actually fix what chaos finds

The biggest chaos engineering failure mode I've seen: a team runs experiments, documents the issues they uncover, and then never fixes them. The findings pile up in a Jira backlog labeled "future-work" and rot.

If your team isn't going to immediately allocate engineering time to fix what chaos finds, don't run chaos. You're just generating discovery debt.

The rule I use: every chaos finding gets a fix-by date within two weeks of discovery, or the finding gets formally accepted as a known limitation with sign-off from leadership. No middle ground.

2. Your monitoring is good enough to see the damage

Chaos works by breaking things on purpose and observing what happens. If your monitoring is bad, you'll break something and not notice until customers complain. That's not chaos engineering, that's a self-inflicted incident.

The bar: when you inject a failure into a non-critical component, you should be able to see, within 60 seconds, every downstream system that's affected. If you can't, your dependency mapping isn't good enough yet. Fix that first.

3. You have a blast radius you control

Don't run your first chaos experiment in production. Don't run it on the critical path. Don't run it on customer-affecting infrastructure.

Start with one specific component, in a non-production environment, during business hours, with the engineer who owns it watching. Successful chaos engineering programs build up trust over months by demonstrating they can stop blast radius before it hurts anyone.

Teams that skip this earn justified skepticism. "Chaos engineer broke prod" becomes the story, and your program loses leadership support before it gets a chance to prove value.

What this looks like in practice

For a team just starting out:

Pick one service. The most boring one you've got.
In staging, kill a single pod, observe, fix what breaks.
Repeat with network latency injection, then disk pressure, then memory pressure.
After a quarter of staging-only experiments, propose a production experiment on a single, well-understood failure mode.

Boring is the point. Chaos engineering is most valuable when it's least exciting, when the team can predict the outcome and run experiments as routine maintenance.

The teams getting real value from chaos look like that. The teams making YouTube videos about their chaos platform are usually the ones doing it for the wrong reasons.