Claude Went Down Twice in 48 Hours Last Week. If You Noticed, Your Fallback Failed.

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Claude went down twice in 48 hours last week. A major outage on April 7, 2026 took the API and the chat interface offline for users worldwide. Less than 24 hours later, on April 8, Anthropic was investigating "renewed connectivity issues" while the first postmortem was still being drafted.

If you are reading this because your users noticed, your fallback did not work. Here is what it should have done.

This post is not about whether Anthropic is reliable. Every provider has outages. OpenAI had its November 2024 global outage. Gemini had a multi-hour regional failure in February. AWS Bedrock has the usual quarterly blips. The question you actually care about is: when the provider your product depends on takes an unscheduled break, does your product also take one?

For most teams, the honest answer is yes. The fallback config exists. It sits in a YAML file next to the app. It has never been exercised. A config that has never failed over has never failed over correctly.

The availability case versus the quality case

There are two shapes of provider failure. Last week was one of them.

The availability case is what happened on April 7 and 8. The API returns 5xx, or times out, or the connection hangs. Your transport layer sees it. Your retry loop sees it. Your metrics dashboard lights up in red. This is the easy case — there is a signal, and if you wire the signal to a decision, the decision gets made.

The quality case is harder. The API returns 200. Tokens flow. Latency is normal. The output is wrong. The canonical public example is the Anthropic three-bug cascade from August 2025 — a context-window routing error, a TPU misconfiguration, an XLA:TPU miscompile, none of which triggered a server-side error. Users caught it by eye. Anthropic's own evals missed it.

The fallback you need has to handle both. A circuit breaker that watches only HTTP status will happily keep serving a provider that has gone silently stupid. A failover tier that only kicks in on 5xx will sit idle during a brownout where 30% of responses are garbage but all of them are 200.

A real multi-provider failover config

The cleanest pattern is a routing layer that treats providers as interchangeable tiers with declared ordering, budget, and retry policy. LiteLLM's Router does this. So does Portkey's configs. The shape of a working LiteLLM config for a summarization task with three tiers:

# litellm-config.yaml
model_list:
  - model_name: summarize-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      timeout: 20
  - model_name: summarize-secondary
    litellm_params:
      model: openai/gpt-4.1
      api_key: os.environ/OPENAI_API_KEY
      timeout: 20
  - model_name: summarize-tertiary
    litellm_params:
      model: gemini/gemini-2.5-pro
      api_key: os.environ/GEMINI_API_KEY
      timeout: 25

router_settings:
  fallbacks:
    - summarize-primary:
        - summarize-secondary
        - summarize-tertiary
  context_window_fallbacks:
    - summarize-primary:
        - summarize-tertiary
  allowed_fails: 2
  cooldown_time: 60
  num_retries: 1

Two providers from two different cloud vendors. Short per-call timeouts so the fallback triggers before your user's request times out. A third tier with a larger context window for the edge case. allowed_fails and cooldown_time build an implicit breaker at the routing layer — after two failures, the primary is benched for a minute.

This config is not the point. The point is the next paragraph.

Exercise it every month or it does not exist

Pick a Tuesday that is not a launch day. Declare a maintenance window nobody will notice. Flip the primary's API key to an intentionally broken value. Watch your traces.

You are looking for three things:

The first request to the primary fails fast (inside the timeout window, not hanging).
The secondary picks up the traffic within one retry.
Your product's p95 latency bumps but does not break the SLO.

If any of those three things does not happen, fix it before the next real outage uses you as the test harness.

The teams that were fine last week had run this drill. The teams that were on Hacker News explaining to their users why a chatbot was dark had not. This is not a clever insight — it is muscle memory. A runbook that has only been read is a document. A runbook that has been executed is a capability.

The three rungs of graceful degradation

Failover between model providers is one lever. It is not the only one, and for some request shapes it is not even the best one. Three rungs, in descending order of user-visible quality:

Rung 1: cached answer. A large fraction of production LLM traffic is repeat traffic. Identical prompts, near-identical prompts, prompts that normalize to the same canonical form. A semantic-cache layer (GPTCache, Portkey's cache, a hand-rolled Redis layer keyed on an embedding hash) serves the duplicate traffic directly. During an outage, the cache is the only thing standing between your users and a hard error for any request that has been seen before. The cache does not know the provider is down. It just serves.

Rung 2: cheaper model or smaller provider. If the cache misses and the primary is down, the routing layer sends the request to a secondary. This is the rung most teams think of as "the fallback." It is the middle rung. For most traffic, a cheaper or smaller model is a graceful degradation — the answer is slightly worse, the UX survives. Log the downgrade as a span attribute (fallback.tier = "secondary") so you can measure how much of your traffic is running degraded during an incident.

Rung 3: canned human response. When the cache misses and every tier is failing, do not return an error. Return a pre-written response that tells the user the system is under pressure, gives them something to do (come back in 5 minutes, try a simpler question, email support), and does not pretend to be the model. This rung exists for the 1% of incidents where every provider is struggling at once, which is rare but not zero. The canned response is the airbag. You hope it never deploys. When it does, the user walks away.

Most teams implement rung 2. Some implement rung 1. Almost nobody implements rung 3. The first time rung 3 saves a product launch is the last time anyone on the team calls it over-engineering.

The quality-aware circuit breaker

All of the above handles the availability case. For the quality case, the routing layer needs a second signal: a rolling judge score.

A standard circuit breaker tracks error rate and trips when it crosses a threshold. A quality-aware breaker tracks both error rate and a sampled LLM-judge score on a subset of responses, and trips when either signal goes bad. The pattern, trimmed to the essentials:

class QualityAwareBreaker:
    def __init__(
        self,
        err_threshold: float = 0.02,
        judge_threshold: float = 0.70,
        cooldown_s: int = 300,
    ):
        self.err_threshold = err_threshold
        self.judge_threshold = judge_threshold
        self.cooldown_s = cooldown_s
        self.state = "CLOSED"

    def record(self, ok: bool, judge_score: float | None):
        err_rate = self._rolling_err_rate()
        judge_1h = self._rolling_judge_mean()
        if err_rate > self.err_threshold:
            self._open(reason="error_rate")
            return
        if (
            judge_1h is not None
            and judge_1h < self.judge_threshold
        ):
            self._open(reason="quality")
            return

    def allow(self) -> bool:
        if self.state == "OPEN":
            if self._elapsed() > self.cooldown_s:
                self.state = "HALF_OPEN"
                return True
            return False
        return True

Two signals. One breaker. The judge score has to be computed on a small sample (5% of traffic is usually enough) by a different provider than the one being watched — an OpenAI-served judge watching an Anthropic primary, a Gemini-served judge watching an OpenAI primary. Self-bias is a real effect; a model is a terrible judge of its own output when things go sideways.

The breaker trips pessimistically and recovers conservatively. When it opens, requests route to the next tier. When cooldown elapses, it probes with a single request. If that request comes back clean — 200, judge score above threshold — the breaker half-opens and ramps traffic back. If the probe itself looks bad, the breaker stays open and logs it.

The output of this system is a graph you can look at during an incident. Error rate per provider, judge score per provider, breaker state per provider. When Claude went down on April 7, the breaker on summarize-primary should have tripped within the first minute and stayed open. Traffic should have routed to OpenAI. When the April 8 issues started, the breaker should have tripped again, from a lower baseline, without waking anybody up.

That is what "handled the outage" means. Not that nothing happened. That nothing reached the user.

What to check this week

Three things to run before next Tuesday:

Look at your routing config. Count the tiers. If there is only one, the rest of this post is a feature list.
Check when the last failover drill ran. If the answer is "never" or "I don't know," book the drill for this week. Twenty minutes, a broken API key, a watched dashboard.
Check whether your judge-score signal is wired into the same decision that your error-rate signal is wired into. If the judge is a metric on a dashboard but not an input to a breaker, it is a vanity signal.

Outages are not rare events in LLM production. They are quarterly events with a long tail of brownouts in between. The providers know this. Your users do not. The gap between those two facts is where your fallback lives.

If this was useful

This post is the short version of two chapters from the book. Observability for LLM Applications covers the routing layer, the three rungs of degradation, and the incident-response playbook in depth. Chapter 17 has the quality-aware breaker with the full rolling-window math. Chapter 18 is the production-readiness checklist, including the monthly failover drill and the pinned "normal Tuesday" dashboard.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Also by me: Thinking in Go — Book 1: Go Programming + Book 2: Hexagonal Architecture
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.