Self-healing LLM routing: 13 providers, one fallback chain

#ai #architecture #llm #showdev

TL;DR: I built an LLM provider router that tries Ollama
first, falls through 13 cloud providers automatically when
any one fails or rate-limits, and keeps your request alive
across swaps mid-stream. Here's how it works and why
single-provider setups break at scale.

The problem every AI app hits eventually

You ship an app that calls OpenAI. Users love it. Then:

OpenAI has a 30-minute outage (happens monthly)
Your rate limits get hit during a traffic spike
Your bill balloons because you picked the most expensive model
A new user in the EU needs data residency Anthropic doesn't offer
Groq ships a faster model than what you're using

Each of these kills your product for some slice of users.
Single-provider architecture is a single point of failure.

The obvious fix — "use multiple providers" — sounds easy
until you try to implement it. Each SDK has different:

Auth schemes (bearer tokens, API keys in headers, query params, custom headers)
Request shapes (OpenAI messages array vs Anthropic system/user, vs Gemini's contents)
Streaming formats (SSE with different event names, JSON chunks, raw deltas)
Error conventions (429s sometimes, 503s for the same thing elsewhere, silent truncation in others)
Rate limits (RPM, TPM, concurrent, daily — varies by provider and tier)

Naively writing a wrapper per provider works for 2-3
providers. At 13, it's unmaintainable.

The architecture I landed on

Three layers:

1. Provider adapters — normalize inputs and outputs

Each provider gets a thin adapter file that converts the
Aiden internal request shape into the provider's native
format, and vice versa. The adapter exposes a single
interface:

interface ProviderAdapter {
  id: string;                    // "groq-1", "anthropic-1"
  name: string;                  // display name
  model: string;                 // active model ID
  priority: number;              // lower = tried first
  costTier: 1 | 2 | 3;           // free | cheap | premium

  chat(request: NormalizedRequest): AsyncIterable;
  testKey(): Promise;
  getRateLimit(): RateLimitStatus;
}

Internal request shape is OpenAI-compatible (most common
baseline). The adapter handles the translation.

For Anthropic's Claude:

// OpenAI-style request
{ messages: [{role: 'system', content: '...'}, {role: 'user', content: '...'}] }

// Gets translated to Anthropic format
{ system: '...', messages: [{role: 'user', content: '...'}] }

For Bay of Assets (an OpenAI-compatible proxy), translation
is a pass-through — just base URL swap.

2. Router — the fallback chain logic

The router maintains an ordered list of healthy providers
and picks the first one matching constraints (cost tier,
model capability, user preference).

When a call fails, the router:

Classifies the error — rate limit, auth, server error, network, or permanent
Marks the provider's health status:
- 429 → rate-limited, skip for N seconds
- 401/403 → auth broken, skip until manual reset
- 500/502/503/504 → transient, retry with backoff
- Network error → mark degraded, try next
Re-enters the chain with the next healthy provider
Continues until success or chain exhaustion

The critical detail: this happens mid-request, not just
on the next request. If the user is streaming a response
and Groq drops the connection halfway, the router sees the
stream close, switches to Together AI, re-sends the request,
and resumes streaming. The user sees a ~2 second pause and
no error.

3. Slot rotation — multiple keys per provider

Within a single provider (say, Groq), I run 4 rotation slots:
groq-1 → API_KEY_1 (free tier, 30 RPM)
groq-2 → API_KEY_2 (free tier, 30 RPM)
groq-3 → API_KEY_3 (free tier, 30 RPM)
groq-4 → API_KEY_4 (free tier, 30 RPM)

Four free-tier accounts = 120 RPM effective. When slot 1
hits its rate limit, the router transparently rotates to
slot 2. This gives you paid-tier throughput on free tier,
which matters when you're a solo founder.

Caveat: read the provider's ToS on multi-account usage.
Groq currently permits it. Others may not.

Health tracking

Each provider carries a live health score:

interface ProviderHealth {
  lastSuccess: number;           // unix timestamp
  lastFailure: number;
  consecutiveFailures: number;
  rateLimitedUntil: number | null;
  totalCalls: number;
  failureRate: number;           // rolling 100-call window
}

Providers with >50% failure rate in the last 100 calls get
de-prioritized. Providers with <5% failure rate get boosted.
This creates a self-organizing preference — the chain
gradually learns which providers are actually working for
your region, network, and use case.
**
## The result

For my Windows-native AI agent (Aiden, open source), this
means:

Ollama tries first — zero network cost, private, local
If Ollama is unreachable, Groq takes over (free, fast)
If Groq is rate-limited, Gemini Flash kicks in
If Gemini fails, OpenRouter proxies to whichever model is cheapest that minute
Anthropic Claude reserved for complex reasoning tasks that need it

I can drop any single provider, including the whole free
tier, and the agent keeps working. Users never see "provider
X is down" errors — they just see slightly different
response styles as the chain shifts.

Things I'd do differently

I built the health tracking as part of the router. It should be a separate module you can replace. Testing the router's logic without mocking health state is painful.
Slot rotation needs better observability. When you have 4 Groq slots and 2 are rate-limited, knowing WHICH 2 matters. I didn't expose this well initially.
Retry-with-different-model is a feature I'm still working on. Some providers have multiple models per account — Groq has 8, OpenRouter has 200+. Failing over to a different model on the same provider should happen before switching providers entirely.

The code

This is all open source under AGPL-3.0. Router lives in
providers/ in the Aiden repo:

https://github.com/taracodlabs/aiden

Check out providers/index.ts for the routing logic and
core/providerHealth.ts for the health tracking.

If you're building on LLMs and only using one provider, you
will regret it. Start multi-provider from day one. It's
actually not that much harder when you build the router
first.

Feedback welcome. I'm a solo founder, this is v3.7.2, rough
edges definitely exist. If you're building something similar
and want to compare notes, DMs open on Twitter @shivayx9 or
hit me on the Aiden Discord: discord.gg/gMZ3hUnQTm