Manideep Sale

Posted on Mar 6

Why Your Chaos Experiments Are Probably Wasting Time (and How to Fix It)

#devops #sre #chaosengineering #python

You have 20 microservices. You want to run chaos experiments. Where do you start?

If your answer is "the payment service" — why? Because it feels important? Because it failed last week? Because LitmusChaos defaulted to it?

Most teams pick chaos targets the same way they pick where to eat lunch — gut feel, recent memory, or whoever spoke loudest in the meeting. That's fine when you're running 2 services. It breaks down fast when you're running 20.

The actual problem

Chaos engineering has a prioritization gap. The tooling is excellent at how to break things — LitmusChaos, Chaos Mesh, Gremlin all do this well. None of them tell you what to break next.

The result: teams either test the same high-visibility services repeatedly, or they run random experiments and hope they hit something real. Both approaches leave systematic gaps.

The framing that fixed this for me came from fault tree analysis:

risk = impact × likelihood

Impact — if this service degrades, how many others are affected?
Likelihood — based on history, how probable is a failure?

A payment service with 15 downstream dependents and a clean incident history is a different risk than an auth service with 3 dependents and 5 high-severity incidents this month. Treating them identically isn't principled chaos engineering — it's just noise.

What I built

ChaosRank takes your Jaeger trace export and incident history and produces a ranked list of services to target next, with a suggested fault type and confidence level for each.

Rank  Service           Risk   Blast  Fragility  Suggested Fault     Confidence
1     payment-service   0.91   0.88   0.96       partial-response    high
2     auth-service      0.84   0.79   0.91       latency-injection   high
3     inventory-service 0.71   0.82   0.54       pod-failure         medium

Blast radius is computed from your dependency graph — blended PageRank and in-degree centrality. PageRank captures transitive failure propagation. In-degree captures direct dependents. Neither alone is sufficient: a shallow-wide hub (payment called by 10 services) scores differently from a deep chain (A→B→C→D→E), but both are high-risk.

Fragility is where the interesting engineering lives. The naive approach — count incidents, divide by traffic — produces ranking inversions at high traffic differentials. A service with 5x more incidents can rank below a quieter one if you normalize after aggregation. The fix is per-incident normalization: evaluate each event in its own traffic context before aggregating. Order matters.

Does it actually work?

I evaluated it on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset, a published research dataset with real microservice call graphs and anomaly injection traces.

The result across 20 trials:

Metric	ChaosRank	Random
Mean experiments to first weakness	1.0	9.8
Mean experiments to all weaknesses	3.0	23.2

9.8x faster to the first real weakness. That's the mean across 20 trials with seeded weaknesses placed based on structural importance and incident history.

The honest limitations

ChaosRank builds its graph from synchronous Jaeger spans. If your architecture is heavily event-driven — Kafka, SQS, async choreography — your blast radius scores will be wrong. A Kafka producer consumed by 10 downstream services shows zero dependents in trace data. That's a ranking inversion, not just a missing feature. A warning fires at startup when async messaging services are detected.

It also only reads Jaeger JSON today. OTel OTLP is next.

Try it

pip install chaosrank-cli
git clone https://github.com/Medinz01/chaosrank
cd chaosrank
chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv

Note: Two warnings will fire against the included sample data — a phantom node warning for nginx-compose-post (expected, it's an entry point with no callers) and a signal misalignment warning (expected, the benchmark dataset has uniform incident severity by design). With your own incident history these warnings won't fire.

The output pipes directly to LitmusChaos:

chaosrank rank --output litmus | kubectl apply -f -

If you're running chaos experiments and want to talk through whether this fits your setup — particularly if you have async dependencies — I'm around in the comments.

Top comments (2)

klement Gunndu • Mar 6

The per-incident normalization insight is sharp — order of operations during aggregation is a subtle trap. Curious whether ChaosRank blast radius changes meaningfully with async service meshes where failure propagation is not deterministic?

Manideep Sale • Mar 7 • Edited

Honestly, that's the part I'm least confident about. Right now blast radius assumes synchronous propagation — if A calls B calls C, a failure in A cascades predictably. With async meshes that assumption breaks. A Kafka producer with 10 consumers shows zero dependents in the trace graph, which is a real ranking inversion not just a gap.

Short term the plan is an --async-deps flag that accepts a manifest describing async relationships and merges them into the graph before scoring. Doesn't solve non-deterministic propagation but at least gets the topology right. The probabilistic propagation question is harder — probably needs failure injection data to estimate, which is a bit circular for a chaos prioritization tool.