DEV Community

Manideep Sale
Manideep Sale

Posted on

Why Your Chaos Experiments Are Probably Wasting Time (and How to Fix It)

You have 20 microservices. You want to run chaos experiments. Where do you start?

If your answer is "the payment service" — why? Because it feels important? Because it failed last week? Because LitmusChaos defaulted to it?

Most teams pick chaos targets the same way they pick where to eat lunch — gut feel, recent memory, or whoever spoke loudest in the meeting. That's fine when you're running 2 services. It breaks down fast when you're running 20.


The actual problem

Chaos engineering has a prioritization gap. The tooling is excellent at how to break things — LitmusChaos, Chaos Mesh, Gremlin all do this well. None of them tell you what to break next.

The result: teams either test the same high-visibility services repeatedly, or they run random experiments and hope they hit something real. Both approaches leave systematic gaps.

The framing that fixed this for me came from fault tree analysis:

risk = impact × likelihood
Enter fullscreen mode Exit fullscreen mode
  • Impact — if this service degrades, how many others are affected?
  • Likelihood — based on history, how probable is a failure?

A payment service with 15 downstream dependents and a clean incident history is a different risk than an auth service with 3 dependents and 5 high-severity incidents this month. Treating them identically isn't principled chaos engineering — it's just noise.


What I built

ChaosRank takes your Jaeger trace export and incident history and produces a ranked list of services to target next, with a suggested fault type and confidence level for each.

Rank  Service           Risk   Blast  Fragility  Suggested Fault     Confidence
1     payment-service   0.91   0.88   0.96       partial-response    high
2     auth-service      0.84   0.79   0.91       latency-injection   high
3     inventory-service 0.71   0.82   0.54       pod-failure         medium
Enter fullscreen mode Exit fullscreen mode

Blast radius is computed from your dependency graph — blended PageRank and in-degree centrality. PageRank captures transitive failure propagation. In-degree captures direct dependents. Neither alone is sufficient: a shallow-wide hub (payment called by 10 services) scores differently from a deep chain (A→B→C→D→E), but both are high-risk.

Fragility is where the interesting engineering lives. The naive approach — count incidents, divide by traffic — produces ranking inversions at high traffic differentials. A service with 5x more incidents can rank below a quieter one if you normalize after aggregation. The fix is per-incident normalization: evaluate each event in its own traffic context before aggregating. Order matters.


Does it actually work?

I evaluated it on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset, a published research dataset with real microservice call graphs and anomaly injection traces.

The result across 20 trials:

Metric ChaosRank Random
Mean experiments to first weakness 1.0 9.8
Mean experiments to all weaknesses 3.0 23.2

Discovery curve — cumulative weaknesses found vs experiments run. 20 trials, DeathStarBench social-network (31 services).

9.8x faster to the first real weakness. That's the mean across 20 trials with seeded weaknesses placed based on structural importance and incident history.


The honest limitations

ChaosRank builds its graph from synchronous Jaeger spans. If your architecture is heavily event-driven — Kafka, SQS, async choreography — your blast radius scores will be wrong. A Kafka producer consumed by 10 downstream services shows zero dependents in trace data. That's a ranking inversion, not just a missing feature. A warning fires at startup when async messaging services are detected.

It also only reads Jaeger JSON today. OTel OTLP is next.


Try it

pip install chaosrank-cli
git clone https://github.com/Medinz01/chaosrank
cd chaosrank
chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv
Enter fullscreen mode Exit fullscreen mode

Note: Two warnings will fire against the included sample data — a phantom node warning for nginx-compose-post (expected, it's an entry point with no callers) and a signal misalignment warning (expected, the benchmark dataset has uniform incident severity by design). With your own incident history these warnings won't fire.

The output pipes directly to LitmusChaos:

chaosrank rank --output litmus | kubectl apply -f -
Enter fullscreen mode Exit fullscreen mode

If you're running chaos experiments and want to talk through whether this fits your setup — particularly if you have async dependencies — I'm around in the comments.

Top comments (1)

Collapse
 
klement_gunndu profile image
klement Gunndu

The per-incident normalization insight is sharp — order of operations during aggregation is a subtle trap. Curious whether ChaosRank blast radius changes meaningfully with async service meshes where failure propagation is not deterministic?