<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lin Z.</title>
    <description>The latest articles on DEV Community by Lin Z. (@alltoken).</description>
    <link>https://dev.to/alltoken</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3857565%2F55b7d5e3-b904-43f4-8459-d80cab13fe4f.jpg</url>
      <title>DEV Community: Lin Z.</title>
      <link>https://dev.to/alltoken</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alltoken"/>
    <language>en</language>
    <item>
      <title>A Developer's Checklist for Multi-Model LLM Routing</title>
      <dc:creator>Lin Z.</dc:creator>
      <pubDate>Sat, 02 May 2026 01:41:47 +0000</pubDate>
      <link>https://dev.to/alltoken/a-developers-checklist-for-multi-model-llm-routing-1g7a</link>
      <guid>https://dev.to/alltoken/a-developers-checklist-for-multi-model-llm-routing-1g7a</guid>
      <description>&lt;p&gt;I wrote an intro to AI API gateways on &lt;a href="https://medium.com/@linz-alltoken/what-is-an-ai-api-gateway-architecture-and-examples" rel="noopener noreferrer"&gt;Medium&lt;/a&gt; last day. This is the practical follow-up: the checklist I wish I had before I built &lt;a href="https://alltoken.ai" rel="noopener noreferrer"&gt;AllToken&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Built AllToken for all developers. Many models, one decision.&lt;br&gt;
But that decision only makes sense if your routing layer doesn't become a nightmare to maintain. After managing five different provider SDKs in production — and watching our internal abstraction layer grow into its own microservice — I realized there's a standard checklist every team should run before they commit to a multi-model stack.&lt;/p&gt;

&lt;p&gt;Here's mine.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. One Schema to Rule Them All
&lt;/h2&gt;

&lt;p&gt;If your application code branches on &lt;code&gt;if provider == "openai"&lt;/code&gt;, you've already lost. Every new provider becomes a refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The check:&lt;/strong&gt; Your app should send one request shape regardless of the target model.&lt;/p&gt;

&lt;p&gt;At AllToken, we expose an OpenAI-compatible endpoint, but the principle matters more than the vendor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ALLTOKEN_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.alltoken.ai/v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Same code, any provider underneath&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;minimax-m2.7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Red flag:&lt;/strong&gt; If adding a new provider requires touching more than one line (the model string), your abstraction is leaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Failover That Doesn't Wake Your On-Call
&lt;/h2&gt;

&lt;p&gt;Provider outages are not edge cases. They're Tuesday.&lt;br&gt;
&lt;strong&gt;The check:&lt;/strong&gt; When your primary provider 500s or times out, does your app retry automatically? Or does it bubble the error to the user?&lt;br&gt;
A production gateway should handle this without your application knowing it happened. That means health checks on each provider, some form of circuit-breaking logic when a provider is clearly degraded, and automatic fallback to a secondary option.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This request should succeed even if the primary provider is having issues&lt;/span&gt;
curl https://api.alltoken.ai/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$ALLTOKEN_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "minimax-m2.7",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Red flag:&lt;/strong&gt; Your failover logic lives in a 200-line try/catch block that only you understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Cost Routing, Not Just Cost Tracking
&lt;/h2&gt;

&lt;p&gt;Tracking spend after the fact is accounting. Routing by cost in real time is engineering.&lt;br&gt;
&lt;strong&gt;The check:&lt;/strong&gt; Can you send a cheap query to a cheap model and a complex query to a strong model — without changing application code?&lt;br&gt;
Most teams end up with an informal tiering system whether they plan for it or not:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Request Type&lt;/th&gt;
&lt;th&gt;Latency Budget&lt;/th&gt;
&lt;th&gt;Cost Ceiling&lt;/th&gt;
&lt;th&gt;Typical Route&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;&amp;lt; 2s&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Budget model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;&amp;lt; 5s&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Strong model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII-sensitive&lt;/td&gt;
&lt;td&gt;Flexible&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Your gateway should ideally classify the request and match it against provider capabilities. If you're doing this with if/else in your backend, you're building a gateway whether you call it one or not.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Latency Budgets, Not Just Speed
&lt;/h2&gt;

&lt;p&gt;"Fast" is meaningless. "Fast enough for this specific user flow" is a requirement.&lt;br&gt;
&lt;strong&gt;The check:&lt;/strong&gt; Can you set a hard timeout per request and have the gateway respect it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;minimax-m2.7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Tell me a story&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// SSE streaming&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a provider starts streaming slowly, your gateway should know when to cut bait and failover — not when your user is already angry.&lt;br&gt;
&lt;strong&gt;Red flag:&lt;/strong&gt; You only find out about latency issues from user complaints.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Observability Per Request
&lt;/h2&gt;

&lt;p&gt;"How much did we spend on OpenAI last month?" is a finance question.&lt;br&gt;
"How much did User 8473 spend on embedding requests in the last hour?" is an engineering question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The check:&lt;/strong&gt; Can you attribute cost, latency, and token usage down to the individual request or user?&lt;br&gt;
At minimum, a production gateway should give you:&lt;br&gt;
 • Request ID propagation across the stack&lt;br&gt;
 • Per-user or per-feature cost attribution&lt;br&gt;
 • Provider-specific error tracking&lt;br&gt;
If your gateway doesn't expose this, you're flying blind at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Rate Limiting at the Gateway, Not the Provider
&lt;/h2&gt;

&lt;p&gt;Managing rate limits across five different dashboards is not a job. It's a punishment.&lt;br&gt;
The check: Do you have one throttle layer that protects your app and your wallet?&lt;br&gt;
A proper gateway should handle:&lt;br&gt;
 • Global rate limits (protect your budget)&lt;br&gt;
 • Per-user rate limits (prevent abuse)&lt;br&gt;
 • Per-provider rate limits (respect upstream quotas)&lt;br&gt;
One API key. One set of rules. Not five different UIs with different semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. An Escape Hatch from Vendor Lock-In
&lt;/h2&gt;

&lt;p&gt;This is the one everyone claims to care about and nobody tests.&lt;br&gt;
The check: If you needed to swap your primary provider next week, how many files would you touch?&lt;br&gt;
With a proper gateway: ideally zero. You change a config. Maybe a model string.&lt;br&gt;
Without one: every file that touches an LLM. Which, if you're like us, was most of the backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Evaluated
&lt;/h2&gt;

&lt;p&gt;Before we built AllToken, we looked at what was already out there. OpenRouter has an incredible model catalog and is great for experimentation. Other teams roll their own with Nginx and Lua scripts. Some just accept the SDK sprawl.&lt;br&gt;
None of them handled production failover, cost routing, and unified billing the way we needed. So we built it.&lt;br&gt;
But I'm not here to tell you to use AllToken. I'm here to tell you that if you're running more than one model in production, you're going to end up building or buying a gateway eventually. Run this checklist first so you know what you're actually solving for.&lt;br&gt;
What's missing from this checklist? If you've run multi-model LLMs in production, you've probably hit edge cases I haven't. Drop them in the comments, I read every one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I built &lt;a href="https://alltoken.ai/" rel="noopener noreferrer"&gt;alltoken.ai&lt;/a&gt; because I got tired of writing the same routing logic for every new project. Many models. One decision. Smart routing, transparent pricing, no platform fees.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>api</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
