FaultRay: Why We Formalized Cascade Failure Propagation as a Labeled Transition System

#chaosengineering #reliability #python #opensource

The gap that motivated this project

Production fault injection tools — Gremlin, Steadybit, AWS FIS — are powerful, and the chaos engineering discipline they represent has genuinely matured over the past decade. But every tool in that class shares a structural constraint: it operates on running systems.

That constraint is fine for many organizations. It is not fine for regulated industries operating under mandates like the EU Digital Operational Resilience Act (DORA), where touching production with fault injection commands introduces risk that regulators may not accept. And it is not fine for the more fundamental question that fault injection cannot answer: what is the highest availability your architecture is mathematically capable of reaching, given its dependency structure and external SLA commitments?

Classical reliability methods — Fault Tree Analysis and Reliability Block Diagrams — do answer availability ceiling questions analytically. But they operate on static trees under a component independence assumption that does not hold for cloud infrastructure. When a shared underlay network fails, your database, your cache, and your application tier all degrade simultaneously. They are not independent. A classical RBD will overestimate availability in exactly those cases.

FaultRay is a research prototype that tries to address both gaps: no production touch, and an explicit model of correlated failure propagation. This post describes the two core technical contributions and where the work stands today.

Core contribution 1: Cascade propagation as a Labeled Transition System

The cascade engine in FaultRay is formalized as a Cascade Propagation Semantics (CPS), a Labeled Transition System (LTS) over a dependency graph.

The CPS state is a 4-tuple S = (H, L, T, V) where:

H: Component → HealthStatus — health map (each component is UP, DEGRADED, OVERLOADED, or DOWN)
L: Component → float — accumulated latency map in milliseconds
T: float — elapsed simulation time in seconds
V: set[Component] — visited set, monotonically growing

The key properties we prove for this system (see src/faultray/simulator/cascade.py for the implementation and docs/patent/cascade-formal-spec.md for the derivations):

Monotonicity — health can only worsen during a simulation run. Once a component is marked DOWN, it cannot recover to UP within the same simulation. This prevents oscillation and makes simulation results stable.
Causality — a component transitions to a degraded state only if a dependency has already transitioned. There are no spontaneous failures from unaffected upstream nodes.
Circuit Breaker Correctness — when a circuit breaker is tripped on an edge, cascade propagation halts at that edge. The LTS formulation makes it possible to prove this is actually the case rather than just asserting it.
Termination — for acyclic dependency graphs, CPS terminates in O(|V| + |E|) time. For graphs with cycles (which do appear in real infrastructure — think mutual health-check dependencies), a depth limit D_max = 20 guarantees termination.

The implementation uses BFS traversal with three simulation modes corresponding to different transition subsets:

simulate_fault — Rules 1–5: fault injection followed by recursive propagation
simulate_latency_cascade — Rules 1, 6–7: latency BFS with circuit breaker halts
simulate_traffic_spike — Rule 1 applied per-component for capacity threshold checks

Why does formalizing this as an LTS matter in practice? Because it turns "the cascade engine behaves correctly" from an informal claim into something you can reason about systematically. The O(|V| + |E|) complexity bound is not a benchmark result — it follows from the BFS structure and the monotonicity guarantee. The termination proof holds for the cyclic case not because we tested it on enough graphs but because the depth bound is structurally enforced.

Core contribution 2: N-layer min-composition availability model

The second contribution is an availability ceiling model that explicitly decomposes a system's maximum achievable availability across five distinct constraint layers.

The five layers are:

Layer	What it captures
L1 Software	Deployment downtime, human error rate, configuration drift
L2 Hardware	MTBF, MTTR, redundancy factor, failover promotion time
L3 Theoretical	Irreducible physical noise: packet loss, GC pauses, kernel scheduling jitter
L4 Operational	Incident response time, on-call team size, detection latency
L5 External SLA	Product of all external dependency SLAs (cloud providers, third-party APIs)

The composition operator is min:

A_effective = min(A_L1, A_L2, A_L3, A_L4, A_L5)

This departs from the independence assumption of classical Reliability Block Diagrams, where you would multiply availabilities across components. The min operator captures a different claim: the most constrained layer determines the ceiling. If your external SLA chain caps you at 99.9% (three nines), it does not matter that your software and hardware layers could theoretically support 99.99%. The system cannot exceed its external dependency constraint.

The L2 hardware layer uses a standard parallel reliability model: for a component with replicas instances, the tier availability is A_tier = 1 - (1 - A_single)^replicas, where A_single = MTBF / (MTBF + MTTR). This is classical. What the model adds is the explicit failover penalty — the fraction of uptime lost during replica promotion — and the structural separation of the five layers so the binding constraint is visible rather than hidden inside a single number.

What the tool looks like in practice

pip install faultray
faultray demo

Building demo infrastructure...
╭────────────────────────────────────────────────────╮
│ Metric           │ Value                           │
│ Components       │ 9                               │
│ Dependencies     │ 12                              │
│ Resilience Score │ 50.0/100                        │
╰────────────────────────────────────────────────────╯

Running chaos simulation...

╭──────────── FaultRay Chaos Simulation Report ──────────╮
│ Resilience Score: 50/100                                │
│ Scenarios tested: 255                                   │
│ Critical: 21  Warning: 84  Passed: 150                  │
╰─────────────────────────────────────────────────────────╯

  Generate HTML report: faultray simulate --html report.html
  Generate DORA evidence: faultray dora evidence infra.yaml

To run the five-layer availability model on a topology:

faultray availability --model infra.json --layers 5

To run the cascade engine directly on a YAML infrastructure model:

faultray simulate --model infra.yaml --cascade-depth 5 --html report.html

The tool accepts infrastructure defined in YAML (manual) or JSON exported from Terraform (faultray tf-check plan.json). The dependency graph is a directed acyclic graph by default; the engine handles cyclic cases via the depth bound described above.

Honest assessment of the backtest

We ran the cascade engine against 18 well-documented cloud incidents spanning 2017–2024 (AWS S3 2017, Meta BGP 2021, Cloudflare 2022, CrowdStrike 2024, and others from the public postmortem record). The results show F1 = 1.000, precision = 1.000, recall = 1.000 on affected-component identification across all 18 incidents.

We want to be explicit about what those numbers mean and do not mean.

What they mean: Given a topology that matches the incident's documented architecture and a fault injection at the documented root-cause component, the cascade engine correctly identifies which components the postmortem reported as affected. This validates that the LTS propagation rules are consistent with real-world cascade behavior on these known incidents.

What they do not mean: This is a post-hoc reproduction, not a prospective prediction. We built the topologies knowing what failed. F1 = 1.000 on 18 known incidents does not imply the engine will predict future incidents correctly on topologies it has never seen. Prospective validation — building topologies for incidents that occurred after the paper was written and measuring prediction accuracy without ground-truth fitting — is the work that needs to happen before any predictive claim can be made.

The downtime MAE of ~3,159 minutes across the 18 incidents reflects a known deficiency in the current model: the cascade engine propagates structural failure correctly but does not model recovery dynamics. Actual downtime depends on incident response procedures, team capacity, and external vendor resolution timelines that the simulation does not capture. The calibration recommendations in docs/backtest-results.md include a downtime_bias_correction factor of 3,159.53 minutes, which is a signal that the downtime estimation module needs a richer operational model.

Severity accuracy averages 0.819. Severity is harder to match than affected-component sets because it depends on load and traffic patterns at the time of the incident, which the static topology model does not capture.

What this is not

FaultRay is not a compliance tool. Its outputs are not certified audit evidence. The DORA research dashboard is a prototype mapping of FaultRay's simulation outputs to DORA's five pillars — it is illustrative, not certifiable. Do not submit FaultRay output as audit evidence without independent legal and technical review.

FaultRay does not predict future incidents. The formal properties of the LTS — termination, monotonicity, causality — are properties of the simulation engine, not of your production system. The simulation shows you what would happen given the assumptions encoded in your topology model. If your model is wrong, the simulation output is wrong.

FaultRay is not a replacement for operational chaos engineering. Gremlin, Steadybit, and AWS FIS test your actual system under actual load with actual failure signals propagating through actual monitoring. FaultRay tests a model of your system. The two approaches answer different questions and are complementary rather than competitive.

Concurrent work

Krasnovsky (arXiv:2506.11176, to appear at ICSE-NIER 2026) presents concurrent complementary work on in-memory graph simulation for chaos engineering, using Monte Carlo fail-stop simulation over service-dependency graphs auto-discovered from Jaeger distributed traces. The core overlap is positioning — both tools simulate in-memory rather than injecting real faults. The approaches diverge technically: Krasnovsky uses Monte Carlo methods without formal proofs or multi-layer decomposition; FaultRay uses an LTS with formal termination and complexity guarantees plus the N-layer min-composition model. We treat this as concurrent independent validation that the in-memory simulation direction is worth pursuing, not as prior art that invalidates either contribution.

Status and roadmap

PyPI: pip install faultray (v11.1.0, Apache 2.0)
GitHub: mattyopon/faultray
Zenodo DOI: 10.5281/zenodo.19139911
USPTO provisional patent: Application No. 64/010,200, filed 2026-03-19 (non-provisional deadline 2027-03-19)
ISSRE 2026 Fast Abstract: submission planned for the 37th IEEE International Symposium on Software Reliability Engineering (Fast Abstracts track)

The paper rewrite currently in progress (v12) is stripping the AI agent failure taxonomy sections — that contribution was pre-dated by MAST (arXiv:2503.3657, NeurIPS 2025) and multiple concurrent papers — and focusing on strengthening the formal cascade engine proof and the N-layer model justification. The prospective validation experiment (building topologies for post-v11 incidents and measuring unseen-topology precision/recall) is the next concrete empirical step.

If you are working on infrastructure resilience simulation, formal methods for distributed systems, or chaos engineering tooling, the repository is open and pull requests are welcome. Issues with real incident topologies that the cascade engine handles incorrectly are especially useful.

FaultRay is a research prototype. It is NOT validated for DORA, FISC, or any regulatory audit. Do not rely on FaultRay outputs for compliance decisions without independent legal and technical review. Apache License 2.0.