TL;DR: I built an LLM provider router that tries Ollama
first, falls through 13 cloud providers automatically when
any one fails or rate-limits, and keeps your request alive
across swaps mid-stream. Here's how it works and why
single-provider setups break at scale.
The problem every AI app hits eventually
You ship an app that calls OpenAI. Users love it. Then:
- OpenAI has a 30-minute outage (happens monthly)
- Your rate limits get hit during a traffic spike
- Your bill balloons because you picked the most expensive model
- A new user in the EU needs data residency Anthropic doesn't offer
- Groq ships a faster model than what you're using
Each of these kills your product for some slice of users.
Single-provider architecture is a single point of failure.
The obvious fix — "use multiple providers" — sounds easy
until you try to implement it. Each SDK has different:
- Auth schemes (bearer tokens, API keys in headers, query params, custom headers)
- Request shapes (OpenAI messages array vs Anthropic
system/user, vs Gemini's
contents) - Streaming formats (SSE with different event names, JSON chunks, raw deltas)
- Error conventions (429s sometimes, 503s for the same thing elsewhere, silent truncation in others)
- Rate limits (RPM, TPM, concurrent, daily — varies by provider and tier)
Naively writing a wrapper per provider works for 2-3
providers. At 13, it's unmaintainable.
The architecture I landed on
Three layers:
1. Provider adapters — normalize inputs and outputs
Each provider gets a thin adapter file that converts the
Aiden internal request shape into the provider's native
format, and vice versa. The adapter exposes a single
interface:
interface ProviderAdapter {
id: string; // "groq-1", "anthropic-1"
name: string; // display name
model: string; // active model ID
priority: number; // lower = tried first
costTier: 1 | 2 | 3; // free | cheap | premium
chat(request: NormalizedRequest): AsyncIterable;
testKey(): Promise;
getRateLimit(): RateLimitStatus;
}
Internal request shape is OpenAI-compatible (most common
baseline). The adapter handles the translation.
For Anthropic's Claude:
// OpenAI-style request
{ messages: [{role: 'system', content: '...'}, {role: 'user', content: '...'}] }
// Gets translated to Anthropic format
{ system: '...', messages: [{role: 'user', content: '...'}] }
For Bay of Assets (an OpenAI-compatible proxy), translation
is a pass-through — just base URL swap.
2. Router — the fallback chain logic
The router maintains an ordered list of healthy providers
and picks the first one matching constraints (cost tier,
model capability, user preference).
When a call fails, the router:
- Classifies the error — rate limit, auth, server error, network, or permanent
- Marks the provider's health status:
- 429 → rate-limited, skip for N seconds
- 401/403 → auth broken, skip until manual reset
- 500/502/503/504 → transient, retry with backoff
- Network error → mark degraded, try next
- Re-enters the chain with the next healthy provider
- Continues until success or chain exhaustion
The critical detail: this happens mid-request, not just
on the next request. If the user is streaming a response
and Groq drops the connection halfway, the router sees the
stream close, switches to Together AI, re-sends the request,
and resumes streaming. The user sees a ~2 second pause and
no error.
3. Slot rotation — multiple keys per provider
Within a single provider (say, Groq), I run 4 rotation slots:
groq-1 → API_KEY_1 (free tier, 30 RPM)
groq-2 → API_KEY_2 (free tier, 30 RPM)
groq-3 → API_KEY_3 (free tier, 30 RPM)
groq-4 → API_KEY_4 (free tier, 30 RPM)
Four free-tier accounts = 120 RPM effective. When slot 1
hits its rate limit, the router transparently rotates to
slot 2. This gives you paid-tier throughput on free tier,
which matters when you're a solo founder.
Caveat: read the provider's ToS on multi-account usage.
Groq currently permits it. Others may not.
Health tracking
Each provider carries a live health score:
interface ProviderHealth {
lastSuccess: number; // unix timestamp
lastFailure: number;
consecutiveFailures: number;
rateLimitedUntil: number | null;
totalCalls: number;
failureRate: number; // rolling 100-call window
}
Providers with >50% failure rate in the last 100 calls get
de-prioritized. Providers with <5% failure rate get boosted.
This creates a self-organizing preference — the chain
gradually learns which providers are actually working for
your region, network, and use case.
**
## The result

For my Windows-native AI agent (Aiden, open source), this
means:
- Ollama tries first — zero network cost, private, local
- If Ollama is unreachable, Groq takes over (free, fast)
- If Groq is rate-limited, Gemini Flash kicks in
- If Gemini fails, OpenRouter proxies to whichever model is cheapest that minute
- Anthropic Claude reserved for complex reasoning tasks that need it
I can drop any single provider, including the whole free
tier, and the agent keeps working. Users never see "provider
X is down" errors — they just see slightly different
response styles as the chain shifts.
Things I'd do differently
- I built the health tracking as part of the router. It should be a separate module you can replace. Testing the router's logic without mocking health state is painful.
- Slot rotation needs better observability. When you have 4 Groq slots and 2 are rate-limited, knowing WHICH 2 matters. I didn't expose this well initially.
- Retry-with-different-model is a feature I'm still working on. Some providers have multiple models per account — Groq has 8, OpenRouter has 200+. Failing over to a different model on the same provider should happen before switching providers entirely.
The code
This is all open source under AGPL-3.0. Router lives in
providers/ in the Aiden repo:
https://github.com/taracodlabs/aiden
Check out providers/index.ts for the routing logic and
core/providerHealth.ts for the health tracking.
If you're building on LLMs and only using one provider, you
will regret it. Start multi-provider from day one. It's
actually not that much harder when you build the router
first.
Feedback welcome. I'm a solo founder, this is v3.7.2, rough
edges definitely exist. If you're building something similar
and want to compare notes, DMs open on Twitter @shivayx9 or
hit me on the Aiden Discord: discord.gg/gMZ3hUnQTm
Top comments (0)