How we route around a 20-minute Anthropic outage

#ai #api #reliability #failover

The shape of the bad day is always the same. A status page goes red, or doesn't go red but should have, or goes red 40 minutes after the customer's first failed request. The AI provider you depend on is having a moment. Your application's error rate spikes. You spend the next hour explaining to your own customer that "the model API is degraded" — a phrase that means absolutely nothing to them.

The previous Prism release — v1.4 Policy + Governance — made the cost side predictable. Today's release, v1.5, makes the reliability side predictable. The bet is the same: provider outages should be a routing problem, not a customer problem. Whether Anthropic, OpenAI, or Google is having a moment, the request that lands at api.ssimplifi.com should get a response or a structured error — never a silent hang, never a stream that never closes, never a stale "the model API is degraded" conversation.

The three problems the old failover had

The v1.0 failover code did roughly the right thing on paper: try the primary provider, retry once, then walk the fallback chain. It had three problems that only show up when something actually goes wrong.

Problem 1: per-process in-memory state. Health was tracked in a Python dict that lived inside one uvicorn worker. Two workers handling traffic on the same EC2 instance had independent views — each had to learn that Anthropic was failing three times on its own. A container restart wiped the dict entirely; the first three requests after a deploy ate the outage all over again.

Problem 2: binary healthy/unhealthy. A provider that returned 500s on every other request but succeeded on the next retry stayed marked healthy forever. The consecutive-failures counter never reached three. The customer experienced 50% error rate for as long as the issue persisted.

Problem 3: no latency awareness. A provider that went from 500ms p95 to 8 seconds p95 was identical to a snappy one in the router's eyes. We'd happily send the customer's request to the slow provider while a healthy one sat idle.

And the streaming path had its own special version of failure: a provider that died after first-token had flowed left the customer's SSE hanging indefinitely. The connection didn't close cleanly. The customer's client had no signal except "I haven't received a chunk in 30 seconds, maybe something's wrong."

v1.5 in four shipped pieces

Redis-backed rolling-window health. Every provider call — success or failure, with its latency — gets recorded in a Redis sorted set keyed provider:health:{provider_key}. Entries older than 5 minutes get pruned on every write. The weight curve is intentionally aggressive on the recovery side:

sample_count < 5         → weight=1.0   (no signal yet, optimistic)
success_rate >= 0.95     → weight=1.0   (healthy — full traffic)
success_rate >= 0.50     → weight=0.1   (degraded — 10% probe traffic)
success_rate <  0.50     → weight=0.0   (skip on next routing decision)

The 10% probe traffic on a degraded provider matters. Without it, a brief blip darkens the provider for the full 5-minute window because no fresh requests get sent to test recovery. With it, the provider gets steady-but-low traffic until it proves healthy, at which point the weight rolls back to 1.0 within minutes of the success rate climbing past 95%.

Both workers on the same EC2 instance now share one truth via Upstash. Container restart preserves the window. The first request after a deploy doesn't have to relearn anything.

Streaming mid-stream drop detection. When the upstream provider's connection dies after first-token has started flowing, the wrapped async generator catches the exception, records observe(success=False) against the provider's health, and re-raises. The outer event generator in completions.py was already wired to emit data: {"error": "stream_error"} + data: [DONE] on any exception — that path is now actually traversed, so the SSE closes cleanly and the customer's client gets a definite signal that the stream is over.

The cost is one extra try/except around the chunk loop. The benefit is that a provider that's now failing 50% of streams mid-flight gets a 50% success rate in our health window — not the 100% it would have shown when we only observed at first-token. The next request routes accordingly.

Speculative parallel routing. On X-Prism-Mode: sport, Pro and Team accounts get a different routing strategy: the primary and the first healthy fallback fire in parallel via asyncio.create_task, and whichever responds first wins. The loser is cancelled. The response includes X-Prism-Speculative: true so the customer can see when it kicked in.

This trades token cost for latency hedging. The loser keeps generating tokens until our asyncio.cancel() propagates through to the underlying HTTP request — typically a few hundred milliseconds, a few dozen tokens of waste. We absorb that cost; the customer is billed only for the winner's response. The benefit shows up when one provider's p99 latency spikes: instead of waiting 8 seconds for the slow one, the customer sees 800ms from the fast one.

The health-observation rule here is subtle: only the winner is recorded. The loser was racing fairly — cancelling it doesn't mean it was unhealthy, and recording it as a failure would pollute the next routing decision with phantom outages. The customer's request was served; both providers behaved correctly; we just didn't wait for the second one.

Speculative routing only applies to non-streaming sport-mode requests in v1.5. Hedging two SSE streams is messy: the customer sees first-token from whichever arrives first, and then we can't switch streams mid-flight without confusing the SSE client. Eco and balanced modes stay serial — they route to cheap models where the token-cost overhead isn't worth the latency hedge.

Public health badge on the dashboard. The top of /dashboard now shows a small status pill: "All providers healthy" / "Partial degrade" / "Provider outage" with a colored dot per provider. Polls every 30 seconds. The dot color tells you what the router is currently doing — green is full traffic, yellow is probe-only, red is fully skipped. Hover any dot for the rolling success rate, p95 latency, and sample count.

This is visible to all tiers, intentionally. Health awareness is the kind of thing that should be shared — customers shouldn't be the last to know their proxy is routing around an outage.

The exit criteria

We wrote this down in the v1.4 roadmap before starting v1.5: an Anthropic outage doesn't cause customer-visible failure for any account routing through Prism. As of today's deploy, that's true. The same is true for OpenAI and Google. The only failure mode that still escapes the loop is all three providers being down for your routed model class at the same moment — and that hasn't happened since the v1.0 launch.

We were never going to be able to make AI providers themselves stop having outages. What we can do is make sure the outage stops at the proxy. The customer never sees it. They don't even know which provider is having a moment, because by the time they would have noticed, the response from the other one is already on its way back.

Live today on every tier. Speculative routing kicks in on sport mode for Pro and Team subscribers. The blog post /blog/how-to-stop-your-ai-bill-from-surprising-you covers the cost-control half of the picture; this one is the reliability half. Together they're what we mean when we say production-grade.

The next pillar is multi-region — the only piece left where Prism's positioning is rhetorical for US/EU traffic ("Vercel for AI") rather than literal. That's a v1.6 problem.

DEV Community

How we route around a 20-minute Anthropic outage

The three problems the old failover had

v1.5 in four shipped pieces

The exit criteria

Top comments (0)