A3E Ecosystem

Posted on Apr 15

Fail-Open LLM Architecture: Why Your Reviewer Stage Should Never Block a Decision

#ai #python #architecture #llm

The outage that made me rewrite my pipeline

On December 11, 2024, OpenAI's API went fully down for roughly four hours. The culprit, per their incident report: a newly deployed telemetry service whose configuration caused every node across hundreds of Kubernetes clusters to execute resource-intensive API operations simultaneously. The control plane collapsed. Every OpenAI-dependent product collapsed with it.

Nine months later, Anthropic disclosed three separate infrastructure bugs that degraded Claude responses for weeks — at peak, 16% of Sonnet 4 requests were affected. A routing error sent short-context requests to 1M-token servers. TPU corruption caused Thai characters to appear in English responses. A compiler bug returned wrong tokens. Detection took weeks because symptoms varied across platforms and Claude often recovered from isolated mistakes.

If your production pipeline chains two or three LLM calls in series — primary decision, reviewer, formatter — it doesn't take a full outage to break you. A 16% quality degradation on one stage, unhandled, is enough to push your user-facing output below acceptable.

And yet most production LLM code I still read does this:

def decide(input):
    primary = call_primary_llm(input)       # fine
    review = call_reviewer_llm(primary)     # oh no
    if review.approved:
        return primary
    return None                              # silence

When the reviewer errors, the whole pipeline returns None. You lose both models for the price of the weaker stage's reliability.

The pattern: fail-open with a circuit breaker

Circuit breakers predate LLMs by two decades. Michael Nygard introduced the pattern in Release It! (Pragmatic Bookshelf, 2007). Martin Fowler canonised it in 2014. Netflix productionised it in Hystrix, which protected every inter-service call in their microservice fleet before entering maintenance mode in 2018. Research summarised by groundcover puts the reduction in cascading failures at 83.5% for well-instrumented breakers in production distributed systems.

The state machine is simple:

Closed (normal): requests pass through to the downstream.
Open: the breaker has seen enough failures to stop trying. Requests fail immediately with a fallback value.
Half-open: after a cooldown, one probe request goes through. If it succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker stays open.

Applied to an LLM reviewer stage, "fail immediately with a fallback value" means pass the primary decision through unmodified. Not retry-until-exhausted. Not queue-for-human-review. Not None. Through.

from circuitbreaker import circuit  # or pybreaker, or hatch your own

@circuit(failure_threshold=5, recovery_timeout=60)
def _review_with_breaker(decision, context):
    return call_reviewer_llm(decision, context)

def decide(input):
    primary = call_primary_llm(input)
    try:
        review = _review_with_breaker(primary, input)
        if review.verdict == "reject":
            return downgrade_to_safe(primary, review.reasons)
        if review.verdict == "adjust":
            return apply_adjustments(primary, review)
        # "approve" falls through unchanged
    except Exception as e:
        logger.warning("reviewer unavailable, pass-through: %s", e)
    return primary

The except block is the spec, not the bug. When the breaker is open or the reviewer raises for any reason, primary is returned as-is with a warning logged. The pipeline ships what it can.

But isn't this just try/except?

It's structured try/except with two properties a naive version lacks:

Fast failure under sustained outage. Without a breaker, every call during an outage still waits for its full timeout — 30, 60, sometimes 120 seconds — before giving up. Multiply by your QPS and you have effectively DoS'd yourself. A breaker fails in microseconds once it is open. During OpenAI's four-hour incident, the difference between 30-second timeouts and 1-microsecond fast-fails was the difference between a pipeline that queued a backlog it would spend a day draining and one that kept shipping on the primary alone.
Automatic recovery. The half-open probe means you do not need a human to notice the provider recovered. Production systems that require manual re-enabling of a degraded component accumulate incident tickets faster than they close them.

What LangChain gives you for free

If you are using LangChain or LangGraph, a meaningful slice of this is already built. LangChain's .with_fallbacks() lets you chain models with automatic failover:

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

primary = ChatOpenAI(model="gpt-4o").with_fallbacks(
    [ChatAnthropic(model="claude-sonnet-4-5")]
)

This handles provider-level failover but does not solve the pipeline-level question: when the reviewer stage itself fails, what does the pipeline return? For that you still have to wrap the stage and decide explicitly what pass-through means for your domain.

LangGraph's state-driven error handling is the more interesting primitive. You can route failed nodes to dedicated error-handling nodes, categorise errors in the graph state, and make downstream routing depend on whether critical vs. optional stages succeeded. Community production targets: tool error rate under 3%, P95 latency under 5 seconds.

When fail-open is wrong

Circuit-breaking the wrong stage is worse than no breaker at all. Two cases where you want fail-closed and must NOT pass through:

Money movement. A reviewer that detects "sending $50K to an unknown wallet" should block, not warn. But this logic belongs in a deterministic rules engine, not an LLM. If an LLM is on the critical path of a financial transaction, your architecture has a problem that no opinion about error handling can fix.
Regulated output. GDPR consent flows, medical advice generation, tax-filing assistance. These require human review on errors, not silent LLM bypass. The correct behaviour is queue and escalate, not pass-through.

For everything else — trading signals, content scoring, customer service drafts, product recommendations — the expected cost of shipping a slightly-lower-quality primary output is vastly lower than the expected cost of shipping nothing.

The metrics that actually matter

If you ship fail-open, instrument these four numbers. They are the difference between "we have a circuit breaker" and "we know it is working":

reviewer_success_rate — calls where the reviewer produced a valid response.
reviewer_adjustment_rate — of those, the fraction where the reviewer modified the primary.
reviewer_rejection_rate — the fraction where the reviewer fully overrode the primary.
reviewer_fallthrough_rate — the fraction where the breaker opened or the call errored and the primary was passed through.

fallthrough_rate is the silent killer. If it creeps above 5%, your reviewer stack is degrading quality without anyone noticing. The Anthropic postmortem is instructive: their degradation took weeks to detect because Claude often recovered from isolated mistakes and the symptoms varied across platforms. Silent degradation is always the real enemy; full outages are at least obvious.

Don't fix a high fallthrough rate by wiring a gate. Fix it by making the reviewer faster (lower temperature, smaller model for a first-pass triage, local model for the classification step before the expensive one) or more available (multi-provider fallback at the reviewer layer). Research by the AtLarge group at TU Delft on LLM service incidents shows median MTTR for the major providers ranges 0.77–1.23 hours — meaning your fail-open window is measured in hours per month, not minutes.

Closing principle

Fault-tolerant systems have one rule: every stage degrades to the simplest correct behaviour when its neighbour fails. For an LLM reviewer stage, that behaviour is pass-through with a warning. For an authorisation check, it is fail-closed with escalation. For content generation, it is queue-and-retry. Engineering for the explicit case per stage is what separates production systems from demos that fall over the first time the provider has a bad day — which, as both the Anthropic and OpenAI postmortems this year remind us, happens to everyone.

DEV Community