<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: yutaro</title>
    <description>The latest articles on DEV Community by yutaro (@yutaro_41c2deef88001afd50).</description>
    <link>https://dev.to/yutaro_41c2deef88001afd50</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3839939%2Fe7552c84-b002-43a9-bead-08021a4a7378.png</url>
      <title>DEV Community: yutaro</title>
      <link>https://dev.to/yutaro_41c2deef88001afd50</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yutaro_41c2deef88001afd50"/>
    <language>en</language>
    <item>
      <title>FaultRay: Why We Formalized Cascade Failure Propagation as a Labeled Transition System</title>
      <dc:creator>yutaro</dc:creator>
      <pubDate>Fri, 10 Apr 2026 08:51:26 +0000</pubDate>
      <link>https://dev.to/yutaro_41c2deef88001afd50/faultray-why-we-formalized-cascade-failure-propagation-as-a-labeled-transition-system-1mh3</link>
      <guid>https://dev.to/yutaro_41c2deef88001afd50/faultray-why-we-formalized-cascade-failure-propagation-as-a-labeled-transition-system-1mh3</guid>
      <description>&lt;h2&gt;
  
  
  The gap that motivated this project
&lt;/h2&gt;

&lt;p&gt;Production fault injection tools — Gremlin, Steadybit, AWS FIS — are powerful, and the chaos engineering discipline they represent has genuinely matured over the past decade. But every tool in that class shares a structural constraint: it operates on running systems.&lt;/p&gt;

&lt;p&gt;That constraint is fine for many organizations. It is not fine for regulated industries operating under mandates like the EU Digital Operational Resilience Act (DORA), where touching production with fault injection commands introduces risk that regulators may not accept. And it is not fine for the more fundamental question that fault injection &lt;em&gt;cannot&lt;/em&gt; answer: what is the highest availability your architecture is mathematically capable of reaching, given its dependency structure and external SLA commitments?&lt;/p&gt;

&lt;p&gt;Classical reliability methods — Fault Tree Analysis and Reliability Block Diagrams — do answer availability ceiling questions analytically. But they operate on static trees under a component independence assumption that does not hold for cloud infrastructure. When a shared underlay network fails, your database, your cache, and your application tier all degrade simultaneously. They are not independent. A classical RBD will overestimate availability in exactly those cases.&lt;/p&gt;

&lt;p&gt;FaultRay is a research prototype that tries to address both gaps: no production touch, and an explicit model of correlated failure propagation. This post describes the two core technical contributions and where the work stands today.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core contribution 1: Cascade propagation as a Labeled Transition System
&lt;/h2&gt;

&lt;p&gt;The cascade engine in FaultRay is formalized as a &lt;strong&gt;Cascade Propagation Semantics (CPS)&lt;/strong&gt;, a Labeled Transition System (LTS) over a dependency graph.&lt;/p&gt;

&lt;p&gt;The CPS state is a 4-tuple &lt;code&gt;S = (H, L, T, V)&lt;/code&gt; where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;H: Component → HealthStatus&lt;/code&gt; — health map (each component is UP, DEGRADED, OVERLOADED, or DOWN)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;L: Component → float&lt;/code&gt; — accumulated latency map in milliseconds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T: float&lt;/code&gt; — elapsed simulation time in seconds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;V: set[Component]&lt;/code&gt; — visited set, monotonically growing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key properties we prove for this system (see &lt;code&gt;src/faultray/simulator/cascade.py&lt;/code&gt; for the implementation and &lt;code&gt;docs/patent/cascade-formal-spec.md&lt;/code&gt; for the derivations):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monotonicity&lt;/strong&gt; — health can only worsen during a simulation run. Once a component is marked DOWN, it cannot recover to UP within the same simulation. This prevents oscillation and makes simulation results stable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causality&lt;/strong&gt; — a component transitions to a degraded state only if a dependency has already transitioned. There are no spontaneous failures from unaffected upstream nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breaker Correctness&lt;/strong&gt; — when a circuit breaker is tripped on an edge, cascade propagation halts at that edge. The LTS formulation makes it possible to prove this is actually the case rather than just asserting it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt; — for acyclic dependency graphs, CPS terminates in O(|V| + |E|) time. For graphs with cycles (which do appear in real infrastructure — think mutual health-check dependencies), a depth limit &lt;code&gt;D_max = 20&lt;/code&gt; guarantees termination.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The implementation uses BFS traversal with three simulation modes corresponding to different transition subsets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;simulate_fault&lt;/code&gt; — Rules 1–5: fault injection followed by recursive propagation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;simulate_latency_cascade&lt;/code&gt; — Rules 1, 6–7: latency BFS with circuit breaker halts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;simulate_traffic_spike&lt;/code&gt; — Rule 1 applied per-component for capacity threshold checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why does formalizing this as an LTS matter in practice? Because it turns "the cascade engine behaves correctly" from an informal claim into something you can reason about systematically. The O(|V| + |E|) complexity bound is not a benchmark result — it follows from the BFS structure and the monotonicity guarantee. The termination proof holds for the cyclic case not because we tested it on enough graphs but because the depth bound is structurally enforced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core contribution 2: N-layer min-composition availability model
&lt;/h2&gt;

&lt;p&gt;The second contribution is an availability ceiling model that explicitly decomposes a system's maximum achievable availability across five distinct constraint layers.&lt;/p&gt;

&lt;p&gt;The five layers are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 Software&lt;/td&gt;
&lt;td&gt;Deployment downtime, human error rate, configuration drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L2 Hardware&lt;/td&gt;
&lt;td&gt;MTBF, MTTR, redundancy factor, failover promotion time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 Theoretical&lt;/td&gt;
&lt;td&gt;Irreducible physical noise: packet loss, GC pauses, kernel scheduling jitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L4 Operational&lt;/td&gt;
&lt;td&gt;Incident response time, on-call team size, detection latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L5 External SLA&lt;/td&gt;
&lt;td&gt;Product of all external dependency SLAs (cloud providers, third-party APIs)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The composition operator is &lt;code&gt;min&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;A_effective&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A_L1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A_L2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A_L3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A_L4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A_L5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This departs from the independence assumption of classical Reliability Block Diagrams, where you would multiply availabilities across components. The &lt;code&gt;min&lt;/code&gt; operator captures a different claim: &lt;em&gt;the most constrained layer determines the ceiling&lt;/em&gt;. If your external SLA chain caps you at 99.9% (three nines), it does not matter that your software and hardware layers could theoretically support 99.99%. The system cannot exceed its external dependency constraint.&lt;/p&gt;

&lt;p&gt;The L2 hardware layer uses a standard parallel reliability model: for a component with &lt;code&gt;replicas&lt;/code&gt; instances, the tier availability is &lt;code&gt;A_tier = 1 - (1 - A_single)^replicas&lt;/code&gt;, where &lt;code&gt;A_single = MTBF / (MTBF + MTTR)&lt;/code&gt;. This is classical. What the model adds is the explicit failover penalty — the fraction of uptime lost during replica promotion — and the structural separation of the five layers so the binding constraint is visible rather than hidden inside a single number.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the tool looks like in practice
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;faultray
faultray demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Building demo infrastructure...
╭────────────────────────────────────────────────────╮
│ Metric           │ Value                           │
│ Components       │ 9                               │
│ Dependencies     │ 12                              │
│ Resilience Score │ 50.0/100                        │
╰────────────────────────────────────────────────────╯

Running chaos simulation...

╭──────────── FaultRay Chaos Simulation Report ──────────╮
│ Resilience Score: 50/100                                │
│ Scenarios tested: 255                                   │
│ Critical: 21  Warning: 84  Passed: 150                  │
╰─────────────────────────────────────────────────────────╯

  Generate HTML report: faultray simulate --html report.html
  Generate DORA evidence: faultray dora evidence infra.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the five-layer availability model on a topology:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;faultray availability &lt;span class="nt"&gt;--model&lt;/span&gt; infra.json &lt;span class="nt"&gt;--layers&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the cascade engine directly on a YAML infrastructure model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;faultray simulate &lt;span class="nt"&gt;--model&lt;/span&gt; infra.yaml &lt;span class="nt"&gt;--cascade-depth&lt;/span&gt; 5 &lt;span class="nt"&gt;--html&lt;/span&gt; report.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool accepts infrastructure defined in YAML (manual) or JSON exported from Terraform (&lt;code&gt;faultray tf-check plan.json&lt;/code&gt;). The dependency graph is a directed acyclic graph by default; the engine handles cyclic cases via the depth bound described above.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest assessment of the backtest
&lt;/h2&gt;

&lt;p&gt;We ran the cascade engine against 18 well-documented cloud incidents spanning 2017–2024 (AWS S3 2017, Meta BGP 2021, Cloudflare 2022, CrowdStrike 2024, and others from the public postmortem record). The results show F1 = 1.000, precision = 1.000, recall = 1.000 on affected-component identification across all 18 incidents.&lt;/p&gt;

&lt;p&gt;We want to be explicit about what those numbers mean and do not mean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What they mean:&lt;/strong&gt; Given a topology that matches the incident's documented architecture and a fault injection at the documented root-cause component, the cascade engine correctly identifies which components the postmortem reported as affected. This validates that the LTS propagation rules are consistent with real-world cascade behavior on these known incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What they do not mean:&lt;/strong&gt; This is a post-hoc reproduction, not a prospective prediction. We built the topologies knowing what failed. F1 = 1.000 on 18 known incidents does not imply the engine will predict future incidents correctly on topologies it has never seen. Prospective validation — building topologies for incidents that occurred &lt;em&gt;after&lt;/em&gt; the paper was written and measuring prediction accuracy without ground-truth fitting — is the work that needs to happen before any predictive claim can be made.&lt;/p&gt;

&lt;p&gt;The downtime MAE of ~3,159 minutes across the 18 incidents reflects a known deficiency in the current model: the cascade engine propagates structural failure correctly but does not model recovery dynamics. Actual downtime depends on incident response procedures, team capacity, and external vendor resolution timelines that the simulation does not capture. The calibration recommendations in &lt;code&gt;docs/backtest-results.md&lt;/code&gt; include a &lt;code&gt;downtime_bias_correction&lt;/code&gt; factor of 3,159.53 minutes, which is a signal that the downtime estimation module needs a richer operational model.&lt;/p&gt;

&lt;p&gt;Severity accuracy averages 0.819. Severity is harder to match than affected-component sets because it depends on load and traffic patterns at the time of the incident, which the static topology model does not capture.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FaultRay is not a compliance tool.&lt;/strong&gt; Its outputs are not certified audit evidence. The DORA research dashboard is a prototype mapping of FaultRay's simulation outputs to DORA's five pillars — it is illustrative, not certifiable. Do not submit FaultRay output as audit evidence without independent legal and technical review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FaultRay does not predict future incidents.&lt;/strong&gt; The formal properties of the LTS — termination, monotonicity, causality — are properties of the simulation engine, not of your production system. The simulation shows you what &lt;em&gt;would&lt;/em&gt; happen given the assumptions encoded in your topology model. If your model is wrong, the simulation output is wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FaultRay is not a replacement for operational chaos engineering.&lt;/strong&gt; Gremlin, Steadybit, and AWS FIS test your actual system under actual load with actual failure signals propagating through actual monitoring. FaultRay tests a model of your system. The two approaches answer different questions and are complementary rather than competitive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Concurrent work
&lt;/h2&gt;

&lt;p&gt;Krasnovsky (arXiv:2506.11176, to appear at ICSE-NIER 2026) presents concurrent complementary work on in-memory graph simulation for chaos engineering, using Monte Carlo fail-stop simulation over service-dependency graphs auto-discovered from Jaeger distributed traces. The core overlap is positioning — both tools simulate in-memory rather than injecting real faults. The approaches diverge technically: Krasnovsky uses Monte Carlo methods without formal proofs or multi-layer decomposition; FaultRay uses an LTS with formal termination and complexity guarantees plus the N-layer min-composition model. We treat this as concurrent independent validation that the in-memory simulation direction is worth pursuing, not as prior art that invalidates either contribution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Status and roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;code&gt;pip install faultray&lt;/code&gt; (v11.1.0, Apache 2.0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/mattyopon/faultray" rel="noopener noreferrer"&gt;mattyopon/faultray&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zenodo DOI&lt;/strong&gt;: &lt;a href="https://doi.org/10.5281/zenodo.19139911" rel="noopener noreferrer"&gt;10.5281/zenodo.19139911&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;USPTO provisional patent&lt;/strong&gt;: Application No. 64/010,200, filed 2026-03-19 (non-provisional deadline 2027-03-19)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ISSRE 2026 Fast Abstract&lt;/strong&gt;: submission planned for the 37th IEEE International Symposium on Software Reliability Engineering (Fast Abstracts track)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The paper rewrite currently in progress (v12) is stripping the AI agent failure taxonomy sections — that contribution was pre-dated by MAST (arXiv:2503.3657, NeurIPS 2025) and multiple concurrent papers — and focusing on strengthening the formal cascade engine proof and the N-layer model justification. The prospective validation experiment (building topologies for post-v11 incidents and measuring unseen-topology precision/recall) is the next concrete empirical step.&lt;/p&gt;

&lt;p&gt;If you are working on infrastructure resilience simulation, formal methods for distributed systems, or chaos engineering tooling, the repository is open and pull requests are welcome. Issues with real incident topologies that the cascade engine handles incorrectly are especially useful.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;FaultRay is a research prototype. It is NOT validated for DORA, FISC, or any regulatory audit. Do not rely on FaultRay outputs for compliance decisions without independent legal and technical review. Apache License 2.0.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>chaosengineering</category>
      <category>reliability</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How We Simulate 2,000+ Infrastructure Failures Without Touching Production</title>
      <dc:creator>yutaro</dc:creator>
      <pubDate>Mon, 06 Apr 2026 12:50:24 +0000</pubDate>
      <link>https://dev.to/yutaro_41c2deef88001afd50/how-we-simulate-2000-infrastructure-failures-without-touching-production-2kap</link>
      <guid>https://dev.to/yutaro_41c2deef88001afd50/how-we-simulate-2000-infrastructure-failures-without-touching-production-2kap</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2g2k0vl6ahm8mquhcyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2g2k0vl6ahm8mquhcyc.png" alt="FaultRay Dashboard" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is 2am. Your pager fires. A &lt;code&gt;terraform apply&lt;/code&gt; that "just changed a timeout" has taken down the payment service, the order queue, and half the API layer. The plan output looked clean. The PR had two approvals. And yet here you are, staring at a cascade failure that nobody predicted.&lt;/p&gt;

&lt;p&gt;This is the scenario that led me to build FaultRay.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with breaking things to test things
&lt;/h2&gt;

&lt;p&gt;The standard chaos engineering playbook, pioneered by Netflix's Chaos Monkey in 2011 and continued by tools like Gremlin, Steadybit, and AWS FIS, follows a simple premise: inject real faults into real systems, observe what breaks, fix it.&lt;/p&gt;

&lt;p&gt;This works, but it has structural limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It requires a production-like environment.&lt;/strong&gt; Staging is always out of sync. The failure you test in staging may not match what happens in prod.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It tests scenarios you think of.&lt;/strong&gt; You write the experiments. You choose what to break. The failures you did not imagine are the ones that page you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It cannot answer the ceiling question.&lt;/strong&gt; No amount of fault injection will tell you that your architecture physically cannot reach 99.99% uptime, because your external SLA chain caps you at 99.9%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulated industries cannot use it.&lt;/strong&gt; Banks, healthcare systems, and government agencies are not going to randomly kill production processes to see what happens.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A different approach: simulate, don't break
&lt;/h2&gt;

&lt;p&gt;FaultRay takes a fundamentally different path. Instead of injecting faults into running systems, it builds a dependency graph of your infrastructure and simulates over 2,000 failure scenarios entirely in memory. Nothing is deployed. Nothing is touched. You get a resilience score, a list of single points of failure, and a map of every cascade path — in seconds.&lt;/p&gt;

&lt;p&gt;The most common integration point is the Terraform pipeline. After &lt;code&gt;terraform plan&lt;/code&gt;, you export the plan as JSON and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;plan.out
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; plan.out &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; plan.json
faultray tf-check plan.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╭──────────── FaultRay Terraform Guard ────────────╮
│                                                   │
│  Score Before: 72/100                             │
│  Score After:  45/100  (-27 points)               │
│                                                   │
│  NEW RISKS:                                       │
│  - Database is now a single point of failure      │
│  - Cache has no replication (data loss risk)      │
│                                                   │
│  Recommendation: HIGH RISK - Review Required      │
│                                                   │
╰───────────────────────────────────────────────────╯
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FaultRay models what your infrastructure looks like &lt;em&gt;before&lt;/em&gt; and &lt;em&gt;after&lt;/em&gt; the planned change, runs the full simulation against both states, and shows you the delta. Not "this is risky" but "this specific change drops your score by 27 points and introduces a new SPOF."&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD integration in 2 lines
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/terraform.yml&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check Terraform Plan&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;pip install faultray&lt;/span&gt;
    &lt;span class="s"&gt;faultray tf-check plan.json --fail-on-regression --min-score 60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--fail-on-regression&lt;/code&gt; fails the job if the resilience score drops at all. &lt;code&gt;--min-score 60&lt;/code&gt; fails if the resulting score is below your threshold. The job blocks the merge. The 2am page never happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math behind the score
&lt;/h2&gt;

&lt;p&gt;This is the part that might interest you if you have read this far. FaultRay is not a heuristic engine. It is built on formal methods with proven properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  5-Layer Availability Limit Model
&lt;/h3&gt;

&lt;p&gt;Most teams set SLO targets (99.99%, four nines) without knowing whether their architecture can physically reach them. FaultRay computes five independent availability ceilings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Software Limit     → Deployment downtime, human error, config drift
Layer 2: Hardware Limit     → Component MTBF, MTTR, redundancy, failover time
Layer 3: Theoretical Limit  → Irreducible physical noise (packet loss, GC, jitter)
Layer 4: Operational Limit  → Incident response time, team size, on-call coverage
Layer 5: External SLA Chain → Product of all third-party dependency SLAs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your system's availability ceiling is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A_system = min(L1, L2, L3, L4, L5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Layer 5 says your external SLA chain caps you at 99.9% (three nines), it does not matter that your hardware can do five nines. The bottleneck is the weakest layer. FaultRay surfaces this before you spend months over-engineering the wrong layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  LTS-based cascade engine
&lt;/h3&gt;

&lt;p&gt;The cascade simulator implements a Labeled Transition System (LTS) formalized as a 4-tuple &lt;code&gt;S = (H, L, T, V)&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;H&lt;/code&gt;: health map (component to status)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;L&lt;/code&gt;: accumulated latency map&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T&lt;/code&gt;: elapsed time&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;V&lt;/code&gt;: visited set (monotonically growing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system has four proven properties:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monotonicity&lt;/strong&gt; — health can only worsen during a simulation run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Causality&lt;/strong&gt; — a component fails only if a dependency has failed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaker correctness&lt;/strong&gt; — a tripped circuit breaker stops cascade at that edge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt; — the engine terminates in O(|V| + |E|) for acyclic graphs; a depth limit of 20 guarantees termination for cyclic graphs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These properties mean the simulation is deterministic and complete. It will find every reachable failure state, and it will always halt. The full formal specification is in the &lt;a href="https://doi.org/10.5281/zenodo.19139911" rel="noopener noreferrer"&gt;paper&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI agent hallucination model
&lt;/h3&gt;

&lt;p&gt;FaultRay v11 introduced failure modeling for AI agent systems. The core model computes hallucination probability as a function of three variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;H(a, D, I)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;a&lt;/code&gt; is the agent, &lt;code&gt;D&lt;/code&gt; is the set of data sources, and &lt;code&gt;I&lt;/code&gt; is the infrastructure state. When a data source goes DOWN, the agent's hallucination probability increases proportionally to its dependency weight on that source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If source d is HEALTHY:    h_d = h0
If source d is DOWN:       h_d = h0 + (1 - h0) * w(d)
If source d is DEGRADED:   h_d = h0 + (1 - h0) * w(d) * delta
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This captures a failure mode that traditional chaos tools cannot model: your LLM endpoint stays up, your agent keeps responding, but its answers become unreliable because the grounding data it depends on is gone. The agent does not throw an error. It hallucinates. FaultRay quantifies the probability and traces the cascade through multi-agent chains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Validation: 18 real-world incidents
&lt;/h2&gt;

&lt;p&gt;I backtested FaultRay against 18 documented public cloud incidents (AWS, GCP, Azure outages with known root causes and blast radii). The engine was given the pre-incident topology, told which component failed, and asked to predict which downstream services would be affected.&lt;/p&gt;

&lt;p&gt;Results: &lt;strong&gt;F1 = 1.000&lt;/strong&gt; across all 18 incidents.&lt;/p&gt;

&lt;p&gt;I should be honest about what this means and what it does not. The topologies were constructed post-hoc from incident reports. I knew the architecture because the post-mortems described it. This validates that the cascade engine correctly propagates failures through a known graph. It does not validate topology discovery from real Terraform state, which is a harder and less controlled problem. The backtest methodology and all 18 incidents are documented in the paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;faultray
faultray demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo runs a simulation against a sample infrastructure (load balancer, app servers, database, cache, queue) and outputs a full resilience report. Add &lt;code&gt;--web&lt;/code&gt; for an interactive D3.js dependency graph in your browser.&lt;/p&gt;

&lt;p&gt;To analyze your own infrastructure, define it in YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;components&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;load_balancer&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app_server&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# FaultRay will flag this&lt;/span&gt;

&lt;span class="na"&gt;dependencies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;requires&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;requires&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;faultray load infra.yaml
faultray simulate &lt;span class="nt"&gt;--html&lt;/span&gt; report.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or import directly from Terraform state with &lt;code&gt;faultray tf-import&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;This is a solo project, but I did not cut corners on quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;32,000+ tests&lt;/strong&gt;, all passing&lt;/li&gt;
&lt;li&gt;CI runs lint, type check, unit, E2E, security, performance, and mutation testing on every push&lt;/li&gt;
&lt;li&gt;USPTO provisional patent filed (US 64/010,200)&lt;/li&gt;
&lt;li&gt;Peer-reviewed paper on Zenodo (DOI: 10.5281/zenodo.19139911)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Live demo (browser):&lt;/strong&gt; &lt;a href="https://faultray.com/demo" rel="noopener noreferrer"&gt;faultray.com/demo&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;faultray
faultray demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/mattyopon/faultray" rel="noopener noreferrer"&gt;github.com/mattyopon/faultray&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://faultray.com/demo" rel="noopener noreferrer"&gt;faultray.com/demo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paper (DOI):&lt;/strong&gt; &lt;a href="https://doi.org/10.5281/zenodo.19139911" rel="noopener noreferrer"&gt;doi.org/10.5281/zenodo.19139911&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/faultray/" rel="noopener noreferrer"&gt;pypi.org/project/faultray&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FaultRay is licensed under BSL 1.1, converting to Apache 2.0 in 2030. Contributions and feedback are welcome.&lt;/p&gt;

</description>
      <category>chaosengineering</category>
      <category>devops</category>
      <category>python</category>
      <category>terraform</category>
    </item>
    <item>
      <title>How We Simulate 2,000+ Infrastructure Failures Without Touching Production</title>
      <dc:creator>yutaro</dc:creator>
      <pubDate>Mon, 23 Mar 2026 10:56:23 +0000</pubDate>
      <link>https://dev.to/yutaro_41c2deef88001afd50/how-we-simulate-2000-infrastructure-failures-without-touching-production-21c3</link>
      <guid>https://dev.to/yutaro_41c2deef88001afd50/how-we-simulate-2000-infrastructure-failures-without-touching-production-21c3</guid>
      <description>&lt;p&gt;description: FaultRay scores your infrastructure&lt;br&gt;
     resilience before terraform apply — catching cascade&lt;br&gt;
     risks, SPOFs, and availability ceiling violations in&lt;br&gt;
     seconds.&lt;br&gt;
     tags: chaosengineering, devops, python, terraform&lt;br&gt;
     cover_image:&lt;br&gt;
     ---&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; It is 2am. Your pager fires. A `terraform apply` that
 "just changed a timeout" has taken down the payment
 service, the order queue, and half the API layer. The
 plan output looked clean. The PR had two approvals. And
  yet here you are, staring at a cascade failure that
 nobody predicted.

 This is the scenario that led us to build FaultRay.

 ## The problem with breaking things to test things

 The standard chaos engineering playbook, pioneered by
 Netflix's Chaos Monkey in 2011 and continued by tools
 like Gremlin, Steadybit, and AWS FIS, follows a simple
 premise: inject real faults into real systems, observe
 what breaks, fix it.

 This works, but it has structural limitations:

 - **It requires a production-like environment.**
 Staging is always out of sync. The failure you test in
 staging may not match what happens in prod.
 - **It tests scenarios you think of.** You write the
 experiments. You choose what to break. The failures you
  did not imagine are the ones that page you.
 - **It cannot answer the ceiling question.** No amount
 of fault injection will tell you that your architecture
  physically cannot reach 99.99% uptime, because your
 external SLA chain caps you at 99.9%.
 - **Regulated industries cannot use it.** Banks,
 healthcare systems, and government agencies are not
 going to randomly kill production processes to see what
  happens.

 ## A different approach: simulate, don't break

 FaultRay takes a fundamentally different path. Instead
 of injecting faults into running systems, it builds a
 dependency graph of your infrastructure and simulates
 over 2,000 failure scenarios entirely in memory.
 Nothing is deployed. Nothing is touched. You get a
 resilience score, a list of single points of failure,
 and a map of every cascade path — in seconds.

 The most common integration point is the Terraform
 pipeline. After `terraform plan`, you export the plan
 as JSON and run:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```bash
 terraform plan -out=plan.out
 terraform show -json plan.out &amp;gt; plan.json
 faultray tf-check plan.json
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```
 ╭──────────── FaultRay Terraform Guard ────────────╮
 │                                                   │
 │  Score Before: 72/100                             │
 │  Score After:  45/100  (-27 points)               │
 │                                                   │
 │  NEW RISKS:                                       │
 │  - Database is now a single point of failure      │
 │  - Cache has no replication (data loss risk)      │
 │                                                   │
 │  Recommendation: HIGH RISK - Review Required      │
 │                                                   │
 ╰───────────────────────────────────────────────────╯
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; FaultRay models what your infrastructure looks like
 *before* and *after* the planned change, runs the full
 simulation against both states, and shows you the
 delta. Not "this is risky" but "this specific change
 drops your score by 27 points and introduces a new
 SPOF."

 ### CI/CD integration in 2 lines
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```yaml
 # .github/workflows/terraform.yml
 - name: Check Terraform Plan
   run: |
     pip install faultray
     faultray tf-check plan.json --fail-on-regression
 --min-score 60
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; `--fail-on-regression` fails the job if the resilience
 score drops at all. `--min-score 60` fails if the
 resulting score is below your threshold. The job blocks
  the merge. The 2am page never happens.

 ## The math behind the score

 This is the part that might interest you if you have
 read this far. FaultRay is not a heuristic engine. It
 is built on formal methods with proven properties.

 ### 5-Layer Availability Limit Model

 Most teams set SLO targets (99.99%, four nines) without
  knowing whether their architecture can physically
 reach them. FaultRay computes five independent
 availability ceilings:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```
 Layer 1: Software Limit     → Deployment downtime,
 human error, config drift
 Layer 2: Hardware Limit     → Component MTBF, MTTR,
 redundancy, failover time
 Layer 3: Theoretical Limit  → Irreducible physical
 noise (packet loss, GC, jitter)
 Layer 4: Operational Limit  → Incident response time,
 team size, on-call coverage
 Layer 5: External SLA Chain → Product of all
 third-party dependency SLAs
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Your system's availability ceiling is:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```
 A_system = min(L1, L2, L3, L4, L5)
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; If Layer 5 says your external SLA chain caps you at
 99.9% (three nines), it does not matter that your
 hardware can do five nines. The bottleneck is the
 weakest layer. FaultRay surfaces this before you spend
 months over-engineering the wrong layer.

 ### LTS-based cascade engine

 The cascade simulator implements a Labeled Transition
 System (LTS) formalized as a 4-tuple `S = (H, L, T,
 V)`:

 - `H`: health map (component to status)
 - `L`: accumulated latency map
 - `T`: elapsed time
 - `V`: visited set (monotonically growing)

 The system has four proven properties:

 1. **Monotonicity** — health can only worsen during a
 simulation run
 2. **Causality** — a component fails only if a
 dependency has failed
 3. **Circuit breaker correctness** — a tripped circuit
 breaker stops cascade at that edge
 4. **Termination** — the engine terminates in O(|V| +
 |E|) for acyclic graphs; a depth limit of 20 guarantees
  termination for cyclic graphs

 These properties mean the simulation is deterministic
 and complete. It will find every reachable failure
 state, and it will always halt. The full formal
 specification is in the
 [paper](https://doi.org/10.5281/zenodo.19139911).

 ### AI agent hallucination model

 FaultRay v11 introduced failure modeling for AI agent
 systems. The core model computes hallucination
 probability as a function of three variables:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```
 H(a, D, I)
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Where `a` is the agent, `D` is the set of data sources,
  and `I` is the infrastructure state. When a data
 source goes DOWN, the agent's hallucination probability
  increases proportionally to its dependency weight on
 that source:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```
 If source d is HEALTHY:    h_d = h0
 If source d is DOWN:       h_d = h0 + (1 - h0) * w(d)
 If source d is DEGRADED:   h_d = h0 + (1 - h0) * w(d) *
  delta
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; This captures a failure mode that traditional chaos
 tools cannot model: your LLM endpoint stays up, your
 agent keeps responding, but its answers become
 unreliable because the grounding data it depends on is
 gone. The agent does not throw an error. It
 hallucinates. FaultRay quantifies the probability and
 traces the cascade through multi-agent chains.

 ## Validation: 18 real-world incidents

 We backtested FaultRay against 18 documented public
 cloud incidents (AWS, GCP, Azure outages with known
 root causes and blast radii). The engine was given the
 pre-incident topology, told which component failed, and
  asked to predict which downstream services would be
 affected.

 Results: **F1 = 1.000** across all 18 incidents.

 We should be honest about what this means and what it
 does not. The topologies were constructed post-hoc from
  incident reports. We knew the architecture because the
  post-mortems described it. This validates that the
 cascade engine correctly propagates failures through a
 known graph. It does not validate topology discovery
 from real Terraform state, which is a harder and less
 controlled problem. The backtest methodology and all 18
  incidents are documented in the paper.

 ## Try it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```bash
 pip install faultray
 faultray demo
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; The demo runs a simulation against a sample
 infrastructure (load balancer, app servers, database,
 cache, queue) and outputs a full resilience report. Add
  `--web` for an interactive D3.js dependency graph in
 your browser.

 To analyze your own infrastructure, define it in YAML:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```yaml
 components:
   - id: nginx
     type: load_balancer
     replicas: 2
   - id: api
     type: app_server
     replicas: 3
   - id: postgres
     type: database
     replicas: 1  # FaultRay will flag this

 dependencies:
   - source: nginx
     target: api
     type: requires
   - source: api
     target: postgres
     type: requires
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; ```bash
 faultray load infra.yaml
 faultray simulate --html report.html
 ```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Or import directly from Terraform state with `faultray
 tf-import`.

 ## Links

 - **GitHub:** [github.com/mattyopon/faultray](https://g
 ithub.com/mattyopon/faultray)
 - **Paper (DOI):** [doi.org/10.5281/zenodo.19139911](ht
 tps://doi.org/10.5281/zenodo.19139911)
 - **PyPI:** [pypi.org/project/faultray](https://pypi.or
 g/project/faultray/)

 FaultRay is licensed under BSL 1.1, converting to
 Apache 2.0 in 2030. Contributions and feedback are
 welcome.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>terraform</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
