<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: A3E Ecosystem</title>
    <description>The latest articles on DEV Community by A3E Ecosystem (@a3e_ecosystem).</description>
    <link>https://dev.to/a3e_ecosystem</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3880557%2Fd374a82c-9329-4a3b-a15b-45fdff49e27e.png</url>
      <title>DEV Community: A3E Ecosystem</title>
      <link>https://dev.to/a3e_ecosystem</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/a3e_ecosystem"/>
    <language>en</language>
    <item>
      <title>Fail-Open LLM Architecture: Why Your Reviewer Stage Should Never Block a Decision</title>
      <dc:creator>A3E Ecosystem</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:56:03 +0000</pubDate>
      <link>https://dev.to/a3e_ecosystem/fail-open-llm-architecture-why-your-reviewer-stage-should-never-block-a-decision-1bd9</link>
      <guid>https://dev.to/a3e_ecosystem/fail-open-llm-architecture-why-your-reviewer-stage-should-never-block-a-decision-1bd9</guid>
      <description>&lt;h2&gt;
  
  
  The outage that made me rewrite my pipeline
&lt;/h2&gt;

&lt;p&gt;On December 11, 2024, OpenAI's API went fully down for roughly four hours. The culprit, per their incident report: a newly deployed telemetry service whose configuration caused every node across hundreds of Kubernetes clusters to execute resource-intensive API operations simultaneously. The control plane collapsed. Every OpenAI-dependent product collapsed with it.&lt;/p&gt;

&lt;p&gt;Nine months later, &lt;a href="https://www.implicator.ai/anthropics-postmortem-three-bugs-pushed-claude-degradation-to-16-at-peak/" rel="noopener noreferrer"&gt;Anthropic disclosed three separate infrastructure bugs&lt;/a&gt; that degraded Claude responses for weeks — at peak, 16% of Sonnet 4 requests were affected. A routing error sent short-context requests to 1M-token servers. TPU corruption caused Thai characters to appear in English responses. A compiler bug returned wrong tokens. Detection took weeks because symptoms varied across platforms and Claude often recovered from isolated mistakes.&lt;/p&gt;

&lt;p&gt;If your production pipeline chains two or three LLM calls in series — primary decision, reviewer, formatter — it doesn't take a full outage to break you. A 16% quality degradation on &lt;em&gt;one&lt;/em&gt; stage, unhandled, is enough to push your user-facing output below acceptable.&lt;/p&gt;

&lt;p&gt;And yet most production LLM code I still read does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_primary_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# fine
&lt;/span&gt;    &lt;span class="n"&gt;review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_reviewer_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# oh no
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;                              &lt;span class="c1"&gt;# silence
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the reviewer errors, the whole pipeline returns &lt;code&gt;None&lt;/code&gt;. You lose &lt;em&gt;both&lt;/em&gt; models for the price of the weaker stage's reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern: fail-open with a circuit breaker
&lt;/h2&gt;

&lt;p&gt;Circuit breakers predate LLMs by two decades. Michael Nygard introduced the pattern in &lt;em&gt;Release It!&lt;/em&gt; (Pragmatic Bookshelf, 2007). &lt;a href="https://martinfowler.com/bliki/CircuitBreaker.html" rel="noopener noreferrer"&gt;Martin Fowler canonised it in 2014&lt;/a&gt;. Netflix productionised it in &lt;a href="https://github.com/Netflix/Hystrix/wiki/How-it-Works" rel="noopener noreferrer"&gt;Hystrix&lt;/a&gt;, which protected every inter-service call in their microservice fleet before entering maintenance mode in 2018. Research summarised by &lt;a href="https://www.groundcover.com/learn/performance/circuit-breaker-pattern" rel="noopener noreferrer"&gt;groundcover&lt;/a&gt; puts the reduction in cascading failures at 83.5% for well-instrumented breakers in production distributed systems.&lt;/p&gt;

&lt;p&gt;The state machine is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Closed&lt;/strong&gt; (normal): requests pass through to the downstream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open&lt;/strong&gt;: the breaker has seen enough failures to stop trying. Requests fail immediately with a fallback value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Half-open&lt;/strong&gt;: after a cooldown, one probe request goes through. If it succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker stays open.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Applied to an LLM reviewer stage, "fail immediately with a fallback value" means &lt;strong&gt;pass the primary decision through unmodified&lt;/strong&gt;. Not retry-until-exhausted. Not queue-for-human-review. Not &lt;code&gt;None&lt;/code&gt;. Through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;circuitbreaker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;circuit&lt;/span&gt;  &lt;span class="c1"&gt;# or pybreaker, or hatch your own
&lt;/span&gt;
&lt;span class="nd"&gt;@circuit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recovery_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_review_with_breaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_reviewer_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_primary_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;review&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_review_with_breaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;downgrade_to_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reasons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adjust&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;apply_adjustments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# "approve" falls through unchanged
&lt;/span&gt;    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer unavailable, pass-through: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;except&lt;/code&gt; block is the spec, not the bug. When the breaker is open or the reviewer raises for any reason, &lt;code&gt;primary&lt;/code&gt; is returned as-is with a warning logged. The pipeline ships what it can.&lt;/p&gt;

&lt;h2&gt;
  
  
  But isn't this just try/except?
&lt;/h2&gt;

&lt;p&gt;It's structured &lt;code&gt;try/except&lt;/code&gt; with two properties a naive version lacks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fast failure under sustained outage.&lt;/strong&gt; Without a breaker, every call during an outage still waits for its full timeout — 30, 60, sometimes 120 seconds — before giving up. Multiply by your QPS and you have effectively DoS'd yourself. A breaker fails in microseconds once it is open. During OpenAI's four-hour incident, the difference between 30-second timeouts and 1-microsecond fast-fails was the difference between a pipeline that queued a backlog it would spend a day draining and one that kept shipping on the primary alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic recovery.&lt;/strong&gt; The half-open probe means you do not need a human to notice the provider recovered. Production systems that require manual re-enabling of a degraded component accumulate incident tickets faster than they close them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What LangChain gives you for free
&lt;/h2&gt;

&lt;p&gt;If you are using LangChain or LangGraph, a meaningful slice of this is already built. LangChain's &lt;a href="https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/" rel="noopener noreferrer"&gt;&lt;code&gt;.with_fallbacks()&lt;/code&gt;&lt;/a&gt; lets you chain models with automatic failover:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;

&lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_fallbacks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This handles provider-level failover but does not solve the pipeline-level question: when the &lt;em&gt;reviewer stage itself&lt;/em&gt; fails, what does the pipeline return? For that you still have to wrap the stage and decide explicitly what pass-through means for your domain.&lt;/p&gt;

&lt;p&gt;LangGraph's &lt;a href="https://machinelearningplus.com/gen-ai/langgraph-error-handling-retries-fallback-strategies/" rel="noopener noreferrer"&gt;state-driven error handling&lt;/a&gt; is the more interesting primitive. You can route failed nodes to dedicated error-handling nodes, categorise errors in the graph state, and make downstream routing depend on whether critical vs. optional stages succeeded. Community production targets: tool error rate under 3%, P95 latency under 5 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  When fail-open is wrong
&lt;/h2&gt;

&lt;p&gt;Circuit-breaking the wrong stage is worse than no breaker at all. Two cases where you want fail-closed and must NOT pass through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Money movement.&lt;/strong&gt; A reviewer that detects "sending $50K to an unknown wallet" should block, not warn. But this logic belongs in a deterministic rules engine, not an LLM. If an LLM is on the critical path of a financial transaction, your architecture has a problem that no opinion about error handling can fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulated output.&lt;/strong&gt; GDPR consent flows, medical advice generation, tax-filing assistance. These require human review on errors, not silent LLM bypass. The correct behaviour is &lt;em&gt;queue and escalate&lt;/em&gt;, not pass-through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everything else — trading signals, content scoring, customer service drafts, product recommendations — the expected cost of shipping a slightly-lower-quality primary output is vastly lower than the expected cost of shipping nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics that actually matter
&lt;/h2&gt;

&lt;p&gt;If you ship fail-open, instrument these four numbers. They are the difference between "we have a circuit breaker" and "we know it is working":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;reviewer_success_rate&lt;/code&gt; — calls where the reviewer produced a valid response.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reviewer_adjustment_rate&lt;/code&gt; — of those, the fraction where the reviewer modified the primary.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reviewer_rejection_rate&lt;/code&gt; — the fraction where the reviewer fully overrode the primary.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reviewer_fallthrough_rate&lt;/code&gt; — the fraction where the breaker opened or the call errored and the primary was passed through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;fallthrough_rate&lt;/code&gt; is the silent killer. If it creeps above 5%, your reviewer stack is degrading quality without anyone noticing. The Anthropic postmortem is instructive: their degradation took &lt;em&gt;weeks&lt;/em&gt; to detect because Claude often recovered from isolated mistakes and the symptoms varied across platforms. &lt;strong&gt;Silent degradation is always the real enemy; full outages are at least obvious.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't fix a high fallthrough rate by wiring a gate. Fix it by making the reviewer faster (lower temperature, smaller model for a first-pass triage, local model for the classification step before the expensive one) or more available (multi-provider fallback at the reviewer layer). Research by the &lt;a href="https://atlarge-research.com/pdfs/2025-hotcloudperf-fails.pdf" rel="noopener noreferrer"&gt;AtLarge group at TU Delft on LLM service incidents&lt;/a&gt; shows median MTTR for the major providers ranges 0.77–1.23 hours — meaning your fail-open window is measured in hours per month, not minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing principle
&lt;/h2&gt;

&lt;p&gt;Fault-tolerant systems have one rule: every stage degrades to the simplest correct behaviour when its neighbour fails. For an LLM reviewer stage, that behaviour is pass-through with a warning. For an authorisation check, it is fail-closed with escalation. For content generation, it is queue-and-retry. Engineering for the explicit case per stage is what separates production systems from demos that fall over the first time the provider has a bad day — which, as both the Anthropic and OpenAI postmortems this year remind us, happens to everyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Michael Nygard, &lt;em&gt;Release It! Second Edition&lt;/em&gt; (Pragmatic Bookshelf) — the canonical text on stability patterns for distributed systems.&lt;/li&gt;
&lt;li&gt;Martin Fowler, &lt;a href="https://martinfowler.com/bliki/CircuitBreaker.html" rel="noopener noreferrer"&gt;Circuit Breaker&lt;/a&gt; — the post that popularised the pattern outside Netflix.&lt;/li&gt;
&lt;li&gt;Netflix Hystrix, &lt;a href="https://github.com/Netflix/Hystrix/wiki/How-it-Works" rel="noopener noreferrer"&gt;how it works&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.implicator.ai/anthropics-postmortem-three-bugs-pushed-claude-degradation-to-16-at-peak/" rel="noopener noreferrer"&gt;Anthropic's three-bug degradation postmortem (summary)&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/" rel="noopener noreferrer"&gt;LangChain fallbacks documentation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://machinelearningplus.com/gen-ai/langgraph-error-handling-retries-fallback-strategies/" rel="noopener noreferrer"&gt;LangGraph error handling: retries &amp;amp; fallback strategies&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://atlarge-research.com/pdfs/2025-hotcloudperf-fails.pdf" rel="noopener noreferrer"&gt;FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents&lt;/a&gt; — TU Delft AtLarge group.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.requesty.ai/blog/handling-llm-platform-outages-what-to-do-when-openai-anthropic-deepseek-or-others-go-down" rel="noopener noreferrer"&gt;Handling LLM Platform Outages&lt;/a&gt; — Requesty operational guide.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;A3E Ecosystem builds AI-native trading and content pipelines in production. Every LLM stage in our trading signal engine ships fail-open; every customer service response pipeline, too. The reviewer stage we shipped this week uses exactly the pattern above — primary decision from one model, reviewer from a different model, breaker wrapping the reviewer call, primary passes through if the breaker opens.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover image by &lt;a href="https://www.pexels.com/photo/server-racks-on-data-center-5480781/" rel="noopener noreferrer"&gt;Brett Sayles&lt;/a&gt; via Pexels.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>python</category>
    </item>
    <item>
      <title>Replacing Strategy Gates With Intelligence Amplifiers: A Reviewer Stage for LLM-Driven Trading</title>
      <dc:creator>A3E Ecosystem</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:40:25 +0000</pubDate>
      <link>https://dev.to/a3e_ecosystem/replacing-strategy-gates-with-intelligence-amplifiers-a-reviewer-stage-for-llm-driven-trading-3l0m</link>
      <guid>https://dev.to/a3e_ecosystem/replacing-strategy-gates-with-intelligence-amplifiers-a-reviewer-stage-for-llm-driven-trading-3l0m</guid>
      <description>&lt;p&gt;Most automated trading stacks look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;signal -&amp;gt; filter_gates -&amp;gt; (pass or drop) -&amp;gt; execution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each gate is a hard-coded rule. Volume above X. RSI below Y. Correlation below Z.&lt;br&gt;
The problem is the gates have no context. A rule that says "skip signals during&lt;br&gt;
low-volume hours" will drop the one asymmetric setup that only fires during&lt;br&gt;
low-volume hours.&lt;/p&gt;

&lt;p&gt;We spent the last two weeks rewriting this. The new shape looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;signal -&amp;gt; context_pack -&amp;gt; reviewer(LLM) -&amp;gt; (accept | modify | veto) -&amp;gt; execution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gates are still there as a cheap pre-filter, but they no longer make the&lt;br&gt;
final decision. A reviewer stage does. This post is about why we changed,&lt;br&gt;
what the reviewer actually does, and how we made it safe to put an LLM on&lt;br&gt;
the hot path without it becoming a single point of failure.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why gates aren't enough
&lt;/h2&gt;

&lt;p&gt;Rule-based gates are great at one thing: keeping the obviously wrong stuff&lt;br&gt;
out. If your signal's 24h volume is below some floor, you don't want to&lt;br&gt;
trade it. No argument.&lt;/p&gt;

&lt;p&gt;They fall apart on the trades that matter. The trades that matter are&lt;br&gt;
usually the ones that don't look like the training set. They're the 2am&lt;br&gt;
breakout on a coin that just got listed on a new venue, or the unusual&lt;br&gt;
order-book shape you've never seen before, or the third identical signal&lt;br&gt;
from the same strategy that just stopped working last week and you&lt;br&gt;
haven't noticed yet.&lt;/p&gt;

&lt;p&gt;A gate can't tell the difference between "this is the exception the rule&lt;br&gt;
was designed to handle" and "this is a new regime we haven't modeled yet".&lt;br&gt;
An LLM with the right context can, at least some of the time.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the reviewer actually sees
&lt;/h2&gt;

&lt;p&gt;The reviewer is not a chat model. It's a scoped prompt that gets a context&lt;br&gt;
pack and returns a structured verdict. The pack is built at signal time&lt;br&gt;
and contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The signal itself (strategy id, direction, size hint, entry/SL/TP)&lt;/li&gt;
&lt;li&gt;Recent trades from the same strategy (last 20, with outcomes)&lt;/li&gt;
&lt;li&gt;A regime tag (trending, mean-reverting, chop) derived from realized vol&lt;/li&gt;
&lt;li&gt;Top-5 correlated open positions across the portfolio&lt;/li&gt;
&lt;li&gt;News flags for the symbol in the last 4 hours (if any)&lt;/li&gt;
&lt;li&gt;A one-line note from the gate stage explaining why it passed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output shape is strict JSON with three possible verdicts and a&lt;br&gt;
confidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verdict"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"accept"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"modify"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"veto"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rationale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"modified_size_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"modified_sl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No free-form reasoning in production. The model can write&lt;br&gt;
whatever rationale it wants, but the executor only reads &lt;code&gt;verdict&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;confidence&lt;/code&gt;, &lt;code&gt;modified_size_pct&lt;/code&gt;, and &lt;code&gt;modified_sl&lt;/code&gt;. Everything else&lt;br&gt;
gets logged for review.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why modify, not just accept/reject
&lt;/h2&gt;

&lt;p&gt;The third verdict, &lt;code&gt;modify&lt;/code&gt;, is where most of the value shows up. A&lt;br&gt;
classic gate-based system is binary. A signal either gets through at&lt;br&gt;
full size or doesn't get through at all. But most of the hard cases&lt;br&gt;
aren't binary. They're "yes, but smaller" or "yes, but with a tighter&lt;br&gt;
stop because the regime is trending against you."&lt;/p&gt;

&lt;p&gt;The reviewer can return &lt;code&gt;modify&lt;/code&gt; with a &lt;code&gt;modified_size_pct&lt;/code&gt; (bounded&lt;br&gt;
between 10% and 100% of the original) and a &lt;code&gt;modified_sl&lt;/code&gt; (bounded to&lt;br&gt;
be tighter, never looser, than the original). The executor clamps&lt;br&gt;
these on read — we never trust the LLM to size a trade without hard&lt;br&gt;
caps. The model is suggesting, not commanding.&lt;/p&gt;

&lt;p&gt;Early data from our paper run: about 60% of signals get &lt;code&gt;accept&lt;/code&gt;, 25%&lt;br&gt;
get &lt;code&gt;modify&lt;/code&gt; (usually smaller size during mixed-regime conditions),&lt;br&gt;
and 15% get &lt;code&gt;veto&lt;/code&gt;. The veto rate is higher than we expected. Most&lt;br&gt;
vetoes come from the "three bad trades in a row on this strategy"&lt;br&gt;
pattern the reviewer sees in the context pack.&lt;/p&gt;
&lt;h2&gt;
  
  
  Fail-open routing: the LLM cannot be a SPOF
&lt;/h2&gt;

&lt;p&gt;The scariest thing about putting a model on the hot path is that&lt;br&gt;
models break. Rate limits, outages, Cloudflare incidents,&lt;br&gt;
token-pricing changes that silently degrade a cheap tier. If the&lt;br&gt;
reviewer becomes a single point of failure, one incident kills the&lt;br&gt;
whole pipeline.&lt;/p&gt;

&lt;p&gt;We solved this with a tiered router that fails open, not fails closed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt4_mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phi4_local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;8.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RateLimitError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UpstreamError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tier_failure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="c1"&gt;# All tiers down. Don't block the signal — hand it back with a
&lt;/span&gt;    &lt;span class="c1"&gt;# fallback verdict and let the old gate system make the call.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Verdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accept&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;rationale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer_unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two important choices here.&lt;/p&gt;

&lt;p&gt;First, the fallback returns &lt;code&gt;accept&lt;/code&gt; with confidence 0, not &lt;code&gt;veto&lt;/code&gt;. If&lt;br&gt;
the review layer is down, the trading system should behave exactly&lt;br&gt;
like it did before we added the reviewer. Fail-closed would mean an&lt;br&gt;
LLM outage = trading halt, which is worse than not having a reviewer&lt;br&gt;
at all.&lt;/p&gt;

&lt;p&gt;Second, the confidence 0 signal matters. The executor treats any&lt;br&gt;
verdict with confidence below 0.3 as "use the pre-existing gate&lt;br&gt;
decision only." So the reviewer's influence scales with its&lt;br&gt;
confidence, and scales to zero when it's offline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking a local fallback on purpose
&lt;/h2&gt;

&lt;p&gt;The last tier in the cascade is &lt;code&gt;phi4_local&lt;/code&gt; — a Phi-4 14B quantized&lt;br&gt;
model running on a local GPU. It's slower than Sonnet and less&lt;br&gt;
sharp, but it has three properties the hosted tiers don't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It's free. Every signal reviewed by the local tier costs nothing.&lt;/li&gt;
&lt;li&gt;It can't rate-limit us. Concurrent request cap is set by our
hardware, not by a vendor's billing engine.&lt;/li&gt;
&lt;li&gt;It can't be deprecated out from under us overnight.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cascade isn't "hosted first, local as a sad fallback." It's&lt;br&gt;
"use the sharpest tier available for each signal, and if nothing is&lt;br&gt;
available, still have a real model in the loop." A lot of days, the&lt;br&gt;
cheap hosted tier is plenty. But on the day Sonnet has an outage&lt;br&gt;
and GPT-4-mini is queued three minutes deep, phi4_local keeps the&lt;br&gt;
lights on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measure
&lt;/h2&gt;

&lt;p&gt;We don't claim this is a better trading strategy. We claim it's a&lt;br&gt;
better decision stage. The questions we measure are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Veto precision: of the trades the reviewer vetoed, what fraction
actually would have lost money? (target: &amp;gt;60%)&lt;/li&gt;
&lt;li&gt;Modify improvement: on trades where we took a modified size,
did the expected-loss reduction justify the expected-return
reduction? (target: yes, on average)&lt;/li&gt;
&lt;li&gt;Fallback rate: what fraction of signals got the &lt;code&gt;fallback&lt;/code&gt; path
because all tiers were down? (target: &amp;lt;1% over a 30-day window)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We publish these numbers on an operations dashboard, not in&lt;br&gt;
marketing copy. If the reviewer's veto precision drops below 50%&lt;br&gt;
for two weeks running, we take it off the hot path until we&lt;br&gt;
understand why. The spec isn't "always use the LLM." The spec is&lt;br&gt;
"use it only when it's measurably earning its place."&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest disclaimer
&lt;/h2&gt;

&lt;p&gt;This setup does not promise better returns. We have no backtested&lt;br&gt;
accuracy number to offer because we believe backtests over-fit and&lt;br&gt;
we don't want to sell one. What we offer is a decision stage that&lt;br&gt;
can see more of the context than a rule-based gate can, with a&lt;br&gt;
fail-open path so it can't take down the rest of the system, and&lt;br&gt;
a measurement plan that will retire it if the veto precision&lt;br&gt;
stops paying rent.&lt;/p&gt;

&lt;p&gt;If you're building something similar and have notes on how the&lt;br&gt;
context pack should be shaped, or what tier-cascade behavior you&lt;br&gt;
found works best when all hosted tiers are degraded at once, we'd&lt;br&gt;
genuinely like to hear them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Posted from the ops side of a live trading platform. No affiliate&lt;br&gt;
links, no course to sell, no promise of returns. Just a design&lt;br&gt;
pattern we wish we'd had two years ago.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>architecture</category>
      <category>trading</category>
    </item>
  </channel>
</rss>
