Nex Tools

Posted on May 26

Claude Code for Chaos Engineering: How I Stopped Hoping My System Was Resilient and Started Proving It

#claudecode #chaosengineering #devops #reliability

For years I told myself my production system was resilient. I had retries. I had circuit breakers. I had timeouts. I had a runbook. I had survived enough incidents that I knew the system mostly held together under stress. What I did not have was any real evidence that the next incident would not be the one that broke it for good. Resilience was a story I was telling myself, not a property I had measured.

Chaos engineering is the practice of replacing that story with measurements. Instead of hoping the system survives a database failover, you trigger a database failover in production during a quiet hour and watch what actually happens. Instead of hoping the timeouts you set are right, you inject latency into a dependency and see whether the symptoms match what your timeouts were supposed to prevent. The practice is uncomfortable the first few times you do it. After a while, it becomes the only honest way to know whether your system actually works the way you think it does.

The hard part of chaos engineering is not the experiments themselves. The experiments are mostly small scripts that introduce a specific failure. The hard part is the workflow around the experiments: choosing which failures to test, defining what success looks like, scheduling the experiments without disrupting the business, capturing the results, and turning the results into changes that make the system more resilient. That workflow is where Claude Code reshaped my practice. Here is what the workflow looks like.

Why Most Teams Skip Chaos Engineering

Most teams I have worked with have heard of chaos engineering, agree it is a good idea, and have never run a single experiment. The reasons are predictable. The experiments feel scary. The setup work is invisible to the rest of the business. The first few experiments often produce surprising results that lead to uncomfortable conversations. The path of least resistance is to keep telling the resilience story and hope no one runs the test.

The cost of skipping the practice does not show up on any single day. It shows up in the incident postmortem six months later, when the failure mode that took the system down was something the team would have caught with a fifteen minute experiment. The cost is invisible until the moment it is enormous.

The other reason teams skip the practice is that the work feels like it has no clear owner. Chaos engineering does not fit neatly into any existing role. It is not feature work, not bug fixing, not on call response, not platform engineering in the traditional sense. The work needs someone to push for it, and most teams do not have that person.

The teams that get chaos engineering right are the teams that make it boring. The first few experiments are dramatic, but the goal is to get to the point where running an experiment is as routine as deploying a small change. Boring chaos engineering is the kind that actually happens.

The workflow I describe below is the workflow that made chaos engineering boring for me. The Claude Code skills handle the parts that would otherwise be tedious, which leaves the interesting parts to humans.

If you want the broader picture of how I think about production safety, the Claude Code for Incident Response workflow covers what happens when something does break, and chaos engineering is the practice that surfaces those failures before customers do.

The Hypothesis Skill

The first skill in the workflow handles experiment design. Given a service and a failure mode, the skill produces a structured hypothesis that the experiment will test.

A hypothesis is not a vague intention to break something. It is a specific claim that the experiment is designed to confirm or refute. A good hypothesis says something like, "When the primary database becomes unreachable for thirty seconds, the API continues to serve cached reads, write requests return a 503 with a Retry-After header, and the p99 latency on read endpoints stays below 800 milliseconds." A bad hypothesis says, "The system should handle database failures."

The skill enforces the structure. Every hypothesis has a failure scenario, an expected behavior across multiple dimensions, a blast radius, and a rollback condition. The structure forces the person designing the experiment to think through what they actually expect to happen, which is often the most valuable part of the entire workflow.

The skill also catalogs the failure modes worth testing. The catalog draws from the real incidents the team has seen, the dependencies the system has, and the common failure modes for the service type. The catalog grows over time as new failure modes are discovered, which means the workflow gets more thorough as the team learns more about the system.

The output of the skill is a written experiment plan that goes into a shared document. The plan is reviewable by humans before any experiment runs. Most experiments start as a draft that goes through one or two rounds of review before they are approved for execution.

The Safety Skill

The second skill handles the safety envelope around the experiment. Before any failure is injected, the skill checks a set of conditions that must be true.

The conditions vary by experiment, but the core set is consistent. The system has to be healthy at baseline, which the skill verifies by comparing recent metrics to historical norms. The time has to be appropriate, which the skill checks against the deployment calendar and the on call schedule. The blast radius has to be bounded, which the skill verifies by checking the configuration of the failure injection. The rollback mechanism has to be ready, which the skill confirms by running a dry test of the rollback.

If any condition fails, the experiment does not run. The skill produces a structured report explaining why the experiment was blocked, and the person running the experiment decides whether to fix the underlying issue or postpone the experiment.

The skill also handles the human approval requirement. For low-risk experiments in lower environments, no approval is needed. For experiments in production or experiments with a large blast radius, the skill requires explicit sign off from a named approver before it will allow the experiment to start. The approval is logged with the experiment record.

The safety layer is what makes the practice sustainable. Without it, every experiment carries enough perceived risk that people will postpone them indefinitely. With it, the experiments become routine because the safety properties are enforced by code rather than by hope.

If you want to see how this connects to deployment safety more broadly, the workflow I described in Claude Code for Canary Deployments uses a similar approach to keep risk bounded during rollouts. Canary deployments and chaos experiments are the two halves of the same safety practice.

The Injection Skill

The third skill handles the actual failure injection. Given an approved experiment plan, the skill executes the failure in the specified system component for the specified duration.

The injection mechanisms are varied. For network failures, the skill manipulates the routing layer to drop or delay traffic between specific services. For resource exhaustion, the skill consumes a controlled amount of CPU, memory, or disk on a specific host. For dependency failures, the skill substitutes a misbehaving mock for the real dependency on a subset of requests. For data failures, the skill injects malformed or unexpected data into a specific path.

The injection is reversible. Every mechanism has a corresponding stop action that returns the system to its baseline. The skill verifies that the stop action works before the injection starts, and it has a hard timeout that triggers the stop action regardless of any other state.

The injection also captures evidence. While the failure is active, the skill records metrics, logs, and traces from the affected components and from the components downstream of them. The evidence is timestamped and tagged with the experiment identifier, which makes it easy to find later. The evidence is the basis for the analysis in the next step.

The skill is also conservative by default. If the captured metrics start to look meaningfully worse than the hypothesis predicted, the skill triggers the rollback automatically rather than waiting for the experiment duration to complete. The early rollback prevents an experiment from turning into an incident.

The Analysis Skill

When the injection is complete, the analysis skill compares the captured evidence to the original hypothesis. The output is a verdict on whether the hypothesis held.

The verdict has more nuance than pass or fail. For each prediction in the hypothesis, the skill reports whether the actual behavior matched the predicted behavior. The verdict can be that all predictions held, that some held and some did not, or that the actual behavior was meaningfully different from what was predicted on every dimension.

The interesting cases are the ones where some predictions held and some did not. Those cases are the highest signal output of the entire workflow. They surface the gap between how the team thinks the system behaves and how it actually behaves, which is the gap the practice exists to close.

The analysis output is a structured report that goes into the experiment record. The report includes the original hypothesis, the captured evidence, the comparison verdict, and a set of follow up actions. The follow up actions are the changes that need to happen to bring the system behavior into line with the hypothesis, or to update the hypothesis to match the actual behavior.

The report also feeds the catalog of known behaviors. Over time, the catalog becomes a high-fidelity description of how the system actually responds to various failures, which is more valuable than any architecture diagram.

How the Workflow Runs in Practice

A typical experiment cycle takes between thirty minutes and a few hours, depending on how complex the failure is and how long the observation window needs to be. The cycle is the same regardless of the experiment type.

Someone proposes a new experiment, usually because of a recent incident, a recent architectural change, or a gap in the catalog. The hypothesis skill produces a structured draft. The draft goes through review, and the people who own the affected systems sign off on the plan and the safety envelope.

When the experiment is scheduled, the safety skill runs the pre flight checks. If everything passes, the injection skill starts the failure. The team watches the metrics in real time, ready to intervene if anything looks worse than expected.

The injection runs for its configured duration, or stops early if the safety boundaries are crossed. The analysis skill produces the verdict. The verdict goes into a follow up queue, and the team triages the actions in the same way they would triage any other engineering work.

Over weeks and months, the catalog of known behaviors grows. Each new experiment either confirms an existing entry, refines an existing entry, or adds a new one. The catalog becomes a living description of the system, and the description is grounded in evidence rather than in assumption.

What This Workflow Did to My Practice

The most visible change is in the kind of bugs that reach customers. The class of bug where a dependency failure cascades into a customer-visible outage has nearly disappeared. The reason is that the experiments surface those failure modes before they happen in real conditions, which gives the team time to fix the cascade before it matters.

The second change is in how the team writes new code. When you know that a chaos experiment will eventually test the failure handling on every dependency, you write the failure handling more carefully the first time around. The cost of careless failure handling becomes visible during the experiment instead of during the incident, which is a much cheaper place to find out.

The third change is in how the team thinks about system documentation. The old documentation described how the system was supposed to work. The new documentation, grounded in the experiment catalog, describes how the system actually works. The two are not always the same, and the differences are where most of the operational learning lives.

The fourth change is in confidence. Before the workflow, every on call rotation carried a low background anxiety because no one knew how the system would behave under stress. After the workflow, the on call rotations carry less anxiety because the answer to most of those questions is known.

For the full picture of how I run production systems with Claude Code, the complete series on DEV.to covers every workflow I rely on, from incident response to dependency management to chaos engineering.

FAQ

Do I have to run experiments in production?

Not at first. The early experiments should run in a non production environment that closely mirrors production. The point of running in production eventually is that some failure modes only appear under real traffic and real data. The path is to start in staging, build confidence in the workflow, and then move the lower-risk experiments to production one at a time.

What if the safety envelope blocks every experiment I want to run?

That is a useful signal. Either the safety envelope is too tight and needs to be relaxed for the kind of experiments you want, or the system is not yet in a state where those experiments would be safe. Both outcomes are worth knowing. Adjust the envelope or fix the underlying issues, but do not bypass the envelope to force an experiment through.

How do I get organizational buy in for this practice?

Start with the cheapest experiment that produces a surprising result. A surprising result, well documented, is the most persuasive case for the practice. Most teams overestimate how resilient their systems are, and a single concrete demonstration is worth more than any number of theoretical arguments.

What if my system is too small to need this?

The workflow scales down. A small system has fewer failure modes worth testing, but it also has less budget for incidents. The cost of an outage in a small system can easily exceed the cost of an outage in a large system as a fraction of the team's capacity. Run the experiments that match your system's size.

The chaos engineering workflow is one of the few practices I know that has paid back the time invested in it multiple times over. The serious incidents that would have hit production reach me first as experiment verdicts, in the form of a written analysis I can read with coffee instead of a phone call I have to answer at 2 a.m. The trade is one of the best ones I have made in my career as an engineer, and the workflow above is the version of it I would rebuild from scratch if I started over today.