DEV Community

Shiva
Shiva

Posted on

Self-healing LLM routing: 13 providers, one fallback chain

TL;DR: I built an LLM provider router that tries Ollama
first, falls through 13 cloud providers automatically when
any one fails or rate-limits, and keeps your request alive
across swaps mid-stream. Here's how it works and why
single-provider setups break at scale.

The problem every AI app hits eventually

You ship an app that calls OpenAI. Users love it. Then:

  • OpenAI has a 30-minute outage (happens monthly)
  • Your rate limits get hit during a traffic spike
  • Your bill balloons because you picked the most expensive model
  • A new user in the EU needs data residency Anthropic doesn't offer
  • Groq ships a faster model than what you're using

Each of these kills your product for some slice of users.
Single-provider architecture is a single point of failure.

The obvious fix — "use multiple providers" — sounds easy
until you try to implement it. Each SDK has different:

  • Auth schemes (bearer tokens, API keys in headers, query params, custom headers)
  • Request shapes (OpenAI messages array vs Anthropic system/user, vs Gemini's contents)
  • Streaming formats (SSE with different event names, JSON chunks, raw deltas)
  • Error conventions (429s sometimes, 503s for the same thing elsewhere, silent truncation in others)
  • Rate limits (RPM, TPM, concurrent, daily — varies by provider and tier)

Naively writing a wrapper per provider works for 2-3
providers. At 13, it's unmaintainable.

The architecture I landed on

Three layers:

1. Provider adapters — normalize inputs and outputs

Each provider gets a thin adapter file that converts the
Aiden internal request shape into the provider's native
format, and vice versa. The adapter exposes a single
interface:

interface ProviderAdapter {
  id: string;                    // "groq-1", "anthropic-1"
  name: string;                  // display name
  model: string;                 // active model ID
  priority: number;              // lower = tried first
  costTier: 1 | 2 | 3;           // free | cheap | premium

  chat(request: NormalizedRequest): AsyncIterable;
  testKey(): Promise;
  getRateLimit(): RateLimitStatus;
}
Enter fullscreen mode Exit fullscreen mode

Internal request shape is OpenAI-compatible (most common
baseline). The adapter handles the translation.

For Anthropic's Claude:

// OpenAI-style request
{ messages: [{role: 'system', content: '...'}, {role: 'user', content: '...'}] }

// Gets translated to Anthropic format
{ system: '...', messages: [{role: 'user', content: '...'}] }
Enter fullscreen mode Exit fullscreen mode

For Bay of Assets (an OpenAI-compatible proxy), translation
is a pass-through — just base URL swap.

2. Router — the fallback chain logic

The router maintains an ordered list of healthy providers
and picks the first one matching constraints (cost tier,
model capability, user preference).

When a call fails, the router:

  1. Classifies the error — rate limit, auth, server error, network, or permanent
  2. Marks the provider's health status:
    • 429 → rate-limited, skip for N seconds
    • 401/403 → auth broken, skip until manual reset
    • 500/502/503/504 → transient, retry with backoff
    • Network error → mark degraded, try next
  3. Re-enters the chain with the next healthy provider
  4. Continues until success or chain exhaustion

The critical detail: this happens mid-request, not just
on the next request.
If the user is streaming a response
and Groq drops the connection halfway, the router sees the
stream close, switches to Together AI, re-sends the request,
and resumes streaming. The user sees a ~2 second pause and
no error.

3. Slot rotation — multiple keys per provider

Within a single provider (say, Groq), I run 4 rotation slots:
groq-1 → API_KEY_1 (free tier, 30 RPM)
groq-2 → API_KEY_2 (free tier, 30 RPM)
groq-3 → API_KEY_3 (free tier, 30 RPM)
groq-4 → API_KEY_4 (free tier, 30 RPM)

Four free-tier accounts = 120 RPM effective. When slot 1
hits its rate limit, the router transparently rotates to
slot 2. This gives you paid-tier throughput on free tier,
which matters when you're a solo founder.

Caveat: read the provider's ToS on multi-account usage.
Groq currently permits it. Others may not.

Health tracking

Each provider carries a live health score:

interface ProviderHealth {
  lastSuccess: number;           // unix timestamp
  lastFailure: number;
  consecutiveFailures: number;
  rateLimitedUntil: number | null;
  totalCalls: number;
  failureRate: number;           // rolling 100-call window
}
Enter fullscreen mode Exit fullscreen mode

Providers with >50% failure rate in the last 100 calls get
de-prioritized. Providers with <5% failure rate get boosted.
This creates a self-organizing preference — the chain
gradually learns which providers are actually working for
your region, network, and use case.
**
## The result

For my Windows-native AI agent (Aiden, open source), this
means:

  • Ollama tries first — zero network cost, private, local
  • If Ollama is unreachable, Groq takes over (free, fast)
  • If Groq is rate-limited, Gemini Flash kicks in
  • If Gemini fails, OpenRouter proxies to whichever model is cheapest that minute
  • Anthropic Claude reserved for complex reasoning tasks that need it

I can drop any single provider, including the whole free
tier, and the agent keeps working. Users never see "provider
X is down" errors — they just see slightly different
response styles as the chain shifts.

Things I'd do differently

  • I built the health tracking as part of the router. It should be a separate module you can replace. Testing the router's logic without mocking health state is painful.
  • Slot rotation needs better observability. When you have 4 Groq slots and 2 are rate-limited, knowing WHICH 2 matters. I didn't expose this well initially.
  • Retry-with-different-model is a feature I'm still working on. Some providers have multiple models per account — Groq has 8, OpenRouter has 200+. Failing over to a different model on the same provider should happen before switching providers entirely.

The code

This is all open source under AGPL-3.0. Router lives in
providers/ in the Aiden repo:

https://github.com/taracodlabs/aiden

Check out providers/index.ts for the routing logic and
core/providerHealth.ts for the health tracking.

If you're building on LLMs and only using one provider, you
will regret it. Start multi-provider from day one. It's
actually not that much harder when you build the router
first.


Feedback welcome. I'm a solo founder, this is v3.7.2, rough
edges definitely exist. If you're building something similar
and want to compare notes, DMs open on Twitter @shivayx9 or
hit me on the Aiden Discord: discord.gg/gMZ3hUnQTm

Top comments (0)