<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shiva</title>
    <description>The latest articles on DEV Community by Shiva (@shivayxa).</description>
    <link>https://dev.to/shivayxa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3892013%2F85baedec-6427-4fe8-b2dd-d7e20e94a619.png</url>
      <title>DEV Community: Shiva</title>
      <link>https://dev.to/shivayxa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shivayxa"/>
    <language>en</language>
    <item>
      <title>Self-healing LLM routing: 13 providers, one fallback chain</title>
      <dc:creator>Shiva</dc:creator>
      <pubDate>Wed, 22 Apr 2026 08:09:03 +0000</pubDate>
      <link>https://dev.to/shivayxa/self-healing-llm-routing-13-providers-one-fallback-chain-3gg3</link>
      <guid>https://dev.to/shivayxa/self-healing-llm-routing-13-providers-one-fallback-chain-3gg3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I built an LLM provider router that tries Ollama &lt;br&gt;
first, falls through 13 cloud providers automatically when &lt;br&gt;
any one fails or rate-limits, and keeps your request alive &lt;br&gt;
across swaps mid-stream. Here's how it works and why &lt;br&gt;
single-provider setups break at scale.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem every AI app hits eventually
&lt;/h2&gt;

&lt;p&gt;You ship an app that calls OpenAI. Users love it. Then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI has a 30-minute outage (happens monthly)&lt;/li&gt;
&lt;li&gt;Your rate limits get hit during a traffic spike&lt;/li&gt;
&lt;li&gt;Your bill balloons because you picked the most expensive model&lt;/li&gt;
&lt;li&gt;A new user in the EU needs data residency Anthropic doesn't offer&lt;/li&gt;
&lt;li&gt;Groq ships a faster model than what you're using&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these kills your product for some slice of users. &lt;br&gt;
Single-provider architecture is a single point of failure.&lt;/p&gt;

&lt;p&gt;The obvious fix — "use multiple providers" — sounds easy &lt;br&gt;
until you try to implement it. Each SDK has different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auth schemes (bearer tokens, API keys in headers, query 
params, custom headers)&lt;/li&gt;
&lt;li&gt;Request shapes (OpenAI messages array vs Anthropic 
system/user, vs Gemini's &lt;code&gt;contents&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Streaming formats (SSE with different event names, JSON 
chunks, raw deltas)&lt;/li&gt;
&lt;li&gt;Error conventions (429s sometimes, 503s for the same thing 
elsewhere, silent truncation in others)&lt;/li&gt;
&lt;li&gt;Rate limits (RPM, TPM, concurrent, daily — varies by 
provider and tier)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Naively writing a wrapper per provider works for 2-3 &lt;br&gt;
providers. At 13, it's unmaintainable.&lt;/p&gt;
&lt;h2&gt;
  
  
  The architecture I landed on
&lt;/h2&gt;

&lt;p&gt;Three layers:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Provider adapters — normalize inputs and outputs
&lt;/h3&gt;

&lt;p&gt;Each provider gets a thin adapter file that converts the &lt;br&gt;
Aiden internal request shape into the provider's native &lt;br&gt;
format, and vice versa. The adapter exposes a single &lt;br&gt;
interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ProviderAdapter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                    &lt;span class="c1"&gt;// "groq-1", "anthropic-1"&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                  &lt;span class="c1"&gt;// display name&lt;/span&gt;
  &lt;span class="nl"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                 &lt;span class="c1"&gt;// active model ID&lt;/span&gt;
  &lt;span class="nl"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;// lower = tried first&lt;/span&gt;
  &lt;span class="nl"&gt;costTier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// free | cheap | premium&lt;/span&gt;

  &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;NormalizedRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;AsyncIterable&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;testKey&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;getRateLimit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nx"&gt;RateLimitStatus&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Internal request shape is OpenAI-compatible (most common &lt;br&gt;
baseline). The adapter handles the translation.&lt;/p&gt;

&lt;p&gt;For Anthropic's Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// OpenAI-style request&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Gets translated to Anthropic format&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Bay of Assets (an OpenAI-compatible proxy), translation &lt;br&gt;
is a pass-through — just base URL swap.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Router — the fallback chain logic
&lt;/h3&gt;

&lt;p&gt;The router maintains an ordered list of healthy providers &lt;br&gt;
and picks the first one matching constraints (cost tier, &lt;br&gt;
model capability, user preference).&lt;/p&gt;

&lt;p&gt;When a call fails, the router:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Classifies the error — rate limit, auth, server error, 
network, or permanent&lt;/li&gt;
&lt;li&gt;Marks the provider's health status:

&lt;ul&gt;
&lt;li&gt;429 → rate-limited, skip for N seconds&lt;/li&gt;
&lt;li&gt;401/403 → auth broken, skip until manual reset&lt;/li&gt;
&lt;li&gt;500/502/503/504 → transient, retry with backoff&lt;/li&gt;
&lt;li&gt;Network error → mark degraded, try next&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Re-enters the chain with the next healthy provider&lt;/li&gt;
&lt;li&gt;Continues until success or chain exhaustion&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The critical detail: &lt;strong&gt;this happens mid-request, not just &lt;br&gt;
on the next request.&lt;/strong&gt; If the user is streaming a response &lt;br&gt;
and Groq drops the connection halfway, the router sees the &lt;br&gt;
stream close, switches to Together AI, re-sends the request, &lt;br&gt;
and resumes streaming. The user sees a ~2 second pause and &lt;br&gt;
no error.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Slot rotation — multiple keys per provider
&lt;/h3&gt;

&lt;p&gt;Within a single provider (say, Groq), I run 4 rotation slots:&lt;br&gt;
groq-1 → API_KEY_1 (free tier, 30 RPM)&lt;br&gt;
groq-2 → API_KEY_2 (free tier, 30 RPM)&lt;br&gt;
groq-3 → API_KEY_3 (free tier, 30 RPM)&lt;br&gt;
groq-4 → API_KEY_4 (free tier, 30 RPM)&lt;/p&gt;

&lt;p&gt;Four free-tier accounts = 120 RPM effective. When slot 1 &lt;br&gt;
hits its rate limit, the router transparently rotates to &lt;br&gt;
slot 2. This gives you paid-tier throughput on free tier, &lt;br&gt;
which matters when you're a solo founder.&lt;/p&gt;

&lt;p&gt;Caveat: read the provider's ToS on multi-account usage. &lt;br&gt;
Groq currently permits it. Others may not.&lt;/p&gt;
&lt;h2&gt;
  
  
  Health tracking
&lt;/h2&gt;

&lt;p&gt;Each provider carries a live health score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ProviderHealth&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;lastSuccess&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// unix timestamp&lt;/span&gt;
  &lt;span class="nl"&gt;lastFailure&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;consecutiveFailures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;rateLimitedUntil&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;totalCalls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;failureRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;           &lt;span class="c1"&gt;// rolling 100-call window&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Providers with &amp;gt;50% failure rate in the last 100 calls get &lt;br&gt;
de-prioritized. Providers with &amp;lt;5% failure rate get boosted. &lt;br&gt;
This creates a self-organizing preference — the chain &lt;br&gt;
gradually learns which providers are actually working for &lt;br&gt;
your region, network, and use case.&lt;br&gt;
**&lt;br&gt;
&lt;strong&gt;## The result&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgudhncxhj7wftjtl3j8p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgudhncxhj7wftjtl3j8p.png" alt=" " width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For my Windows-native AI agent (Aiden, open source), this &lt;br&gt;
means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ollama tries first — zero network cost, private, local&lt;/li&gt;
&lt;li&gt;If Ollama is unreachable, Groq takes over (free, fast)&lt;/li&gt;
&lt;li&gt;If Groq is rate-limited, Gemini Flash kicks in&lt;/li&gt;
&lt;li&gt;If Gemini fails, OpenRouter proxies to whichever model is 
cheapest that minute&lt;/li&gt;
&lt;li&gt;Anthropic Claude reserved for complex reasoning tasks 
that need it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I can drop any single provider, including the whole free &lt;br&gt;
tier, and the agent keeps working. Users never see "provider &lt;br&gt;
X is down" errors — they just see slightly different &lt;br&gt;
response styles as the chain shifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things I'd do differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I built the health tracking as part of the router. It 
should be a separate module you can replace. Testing the 
router's logic without mocking health state is painful.&lt;/li&gt;
&lt;li&gt;Slot rotation needs better observability. When you have 4 
Groq slots and 2 are rate-limited, knowing WHICH 2 matters. 
I didn't expose this well initially.&lt;/li&gt;
&lt;li&gt;Retry-with-different-model is a feature I'm still working 
on. Some providers have multiple models per account — 
Groq has 8, OpenRouter has 200+. Failing over to a 
different model on the same provider should happen before 
switching providers entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;p&gt;This is all open source under AGPL-3.0. Router lives in &lt;br&gt;
&lt;code&gt;providers/&lt;/code&gt; in the Aiden repo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/taracodlabs/aiden" rel="noopener noreferrer"&gt;https://github.com/taracodlabs/aiden&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out &lt;code&gt;providers/index.ts&lt;/code&gt; for the routing logic and &lt;br&gt;
&lt;code&gt;core/providerHealth.ts&lt;/code&gt; for the health tracking.&lt;/p&gt;

&lt;p&gt;If you're building on LLMs and only using one provider, you &lt;br&gt;
will regret it. Start multi-provider from day one. It's &lt;br&gt;
actually not that much harder when you build the router &lt;br&gt;
first.&lt;/p&gt;




&lt;p&gt;Feedback welcome. I'm a solo founder, this is v3.7.2, rough &lt;br&gt;
edges definitely exist. If you're building something similar &lt;br&gt;
and want to compare notes, DMs open on Twitter @shivayx9 or &lt;br&gt;
hit me on the Aiden Discord: discord.gg/gMZ3hUnQTm&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
