Question for teams doing chaos engineering: how do you choose experiment targets?

Manideep Sale — Sun, 08 Mar 2026 15:34:31 +0000

While working on a side project related to service reliability, I ran into a question that I’m curious about from people actually running chaos experiments.

Most chaos engineering discussions focus on the types of experiments (latency injection, pod failure, network faults, etc.).

But something less obvious is how teams choose where to run experiments in the first place.In a system with many microservices, there are lots of possible targets.

Do teams typically:

rotate through services over time
prioritize ones that caused incidents
focus on critical dependency paths
rely on platform/SRE intuition
something else?

Interested to hear how this works in real environments.

Why Your Chaos Experiments Are Probably Wasting Time (and How to Fix It)

Manideep Sale — Fri, 06 Mar 2026 12:49:51 +0000

You have 20 microservices. You want to run chaos experiments. Where do you start?

If your answer is "the payment service" — why? Because it feels important? Because it failed last week? Because LitmusChaos defaulted to it?

Most teams pick chaos targets the same way they pick where to eat lunch — gut feel, recent memory, or whoever spoke loudest in the meeting. That's fine when you're running 2 services. It breaks down fast when you're running 20.

The actual problem

Chaos engineering has a prioritization gap. The tooling is excellent at how to break things — LitmusChaos, Chaos Mesh, Gremlin all do this well. None of them tell you what to break next.

The result: teams either test the same high-visibility services repeatedly, or they run random experiments and hope they hit something real. Both approaches leave systematic gaps.

The framing that fixed this for me came from fault tree analysis:

risk = impact × likelihood

Impact — if this service degrades, how many others are affected?
Likelihood — based on history, how probable is a failure?

A payment service with 15 downstream dependents and a clean incident history is a different risk than an auth service with 3 dependents and 5 high-severity incidents this month. Treating them identically isn't principled chaos engineering — it's just noise.

What I built

ChaosRank takes your Jaeger trace export and incident history and produces a ranked list of services to target next, with a suggested fault type and confidence level for each.

Rank  Service           Risk   Blast  Fragility  Suggested Fault     Confidence
1     payment-service   0.91   0.88   0.96       partial-response    high
2     auth-service      0.84   0.79   0.91       latency-injection   high
3     inventory-service 0.71   0.82   0.54       pod-failure         medium

Blast radius is computed from your dependency graph — blended PageRank and in-degree centrality. PageRank captures transitive failure propagation. In-degree captures direct dependents. Neither alone is sufficient: a shallow-wide hub (payment called by 10 services) scores differently from a deep chain (A→B→C→D→E), but both are high-risk.

Fragility is where the interesting engineering lives. The naive approach — count incidents, divide by traffic — produces ranking inversions at high traffic differentials. A service with 5x more incidents can rank below a quieter one if you normalize after aggregation. The fix is per-incident normalization: evaluate each event in its own traffic context before aggregating. Order matters.

Does it actually work?

I evaluated it on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset, a published research dataset with real microservice call graphs and anomaly injection traces.

The result across 20 trials:

Metric	ChaosRank	Random
Mean experiments to first weakness	1.0	9.8
Mean experiments to all weaknesses	3.0	23.2

9.8x faster to the first real weakness. That's the mean across 20 trials with seeded weaknesses placed based on structural importance and incident history.

The honest limitations

ChaosRank builds its graph from synchronous Jaeger spans. If your architecture is heavily event-driven — Kafka, SQS, async choreography — your blast radius scores will be wrong. A Kafka producer consumed by 10 downstream services shows zero dependents in trace data. That's a ranking inversion, not just a missing feature. A warning fires at startup when async messaging services are detected.

It also only reads Jaeger JSON today. OTel OTLP is next.

Try it

pip install chaosrank-cli
git clone https://github.com/Medinz01/chaosrank
cd chaosrank
chaosrank rank --traces benchmarks/real_traces/social_network.json --incidents benchmarks/real_traces/social_network_incidents.csv

Note: Two warnings will fire against the included sample data — a phantom node warning for nginx-compose-post (expected, it's an entry point with no callers) and a signal misalignment warning (expected, the benchmark dataset has uniform incident severity by design). With your own incident history these warnings won't fire.

The output pipes directly to LitmusChaos:

chaosrank rank --output litmus | kubectl apply -f -

If you're running chaos experiments and want to talk through whether this fits your setup — particularly if you have async dependencies — I'm around in the comments.