DEV Community

Lin Z.
Lin Z.

Posted on

A Developer's Checklist for Multi-Model LLM Routing

I wrote an intro to AI API gateways on Medium last day. This is the practical follow-up: the checklist I wish I had before I built AllToken.

Built AllToken for all developers. Many models, one decision.
But that decision only makes sense if your routing layer doesn't become a nightmare to maintain. After managing five different provider SDKs in production — and watching our internal abstraction layer grow into its own microservice — I realized there's a standard checklist every team should run before they commit to a multi-model stack.

Here's mine.

1. One Schema to Rule Them All

If your application code branches on if provider == "openai", you've already lost. Every new provider becomes a refactor.

The check: Your app should send one request shape regardless of the target model.

At AllToken, we expose an OpenAI-compatible endpoint, but the principle matters more than the vendor:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.ALLTOKEN_API_KEY,
  baseURL: 'https://api.alltoken.ai/v1',
});

// Same code, any provider underneath
const completion = await client.chat.completions.create({
  model: 'minimax-m2.7',
  messages: [{ role: 'user', content: 'Hello!' }],
});
Enter fullscreen mode Exit fullscreen mode

Red flag: If adding a new provider requires touching more than one line (the model string), your abstraction is leaking.

2. Failover That Doesn't Wake Your On-Call

Provider outages are not edge cases. They're Tuesday.
The check: When your primary provider 500s or times out, does your app retry automatically? Or does it bubble the error to the user?
A production gateway should handle this without your application knowing it happened. That means health checks on each provider, some form of circuit-breaking logic when a provider is clearly degraded, and automatic fallback to a secondary option.

# This request should succeed even if the primary provider is having issues
curl https://api.alltoken.ai/v1/chat/completions \
  -H "Authorization: Bearer $ALLTOKEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2.7",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

Red flag: Your failover logic lives in a 200-line try/catch block that only you understand.

3. Cost Routing, Not Just Cost Tracking

Tracking spend after the fact is accounting. Routing by cost in real time is engineering.
The check: Can you send a cheap query to a cheap model and a complex query to a strong model — without changing application code?
Most teams end up with an informal tiering system whether they plan for it or not:

Request Type Latency Budget Cost Ceiling Typical Route
Simple Q&A < 2s Low Budget model
Code generation < 5s Medium Strong model
PII-sensitive Flexible N/A Self-hosted

Your gateway should ideally classify the request and match it against provider capabilities. If you're doing this with if/else in your backend, you're building a gateway whether you call it one or not.

4. Latency Budgets, Not Just Speed

"Fast" is meaningless. "Fast enough for this specific user flow" is a requirement.
The check: Can you set a hard timeout per request and have the gateway respect it?

const stream = await client.chat.completions.create({
  model: 'minimax-m2.7',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // SSE streaming
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}
Enter fullscreen mode Exit fullscreen mode

If a provider starts streaming slowly, your gateway should know when to cut bait and failover — not when your user is already angry.
Red flag: You only find out about latency issues from user complaints.

5. Observability Per Request

"How much did we spend on OpenAI last month?" is a finance question.
"How much did User 8473 spend on embedding requests in the last hour?" is an engineering question.

The check: Can you attribute cost, latency, and token usage down to the individual request or user?
At minimum, a production gateway should give you:
• Request ID propagation across the stack
• Per-user or per-feature cost attribution
• Provider-specific error tracking
If your gateway doesn't expose this, you're flying blind at scale.

6. Rate Limiting at the Gateway, Not the Provider

Managing rate limits across five different dashboards is not a job. It's a punishment.
The check: Do you have one throttle layer that protects your app and your wallet?
A proper gateway should handle:
• Global rate limits (protect your budget)
• Per-user rate limits (prevent abuse)
• Per-provider rate limits (respect upstream quotas)
One API key. One set of rules. Not five different UIs with different semantics.

7. An Escape Hatch from Vendor Lock-In

This is the one everyone claims to care about and nobody tests.
The check: If you needed to swap your primary provider next week, how many files would you touch?
With a proper gateway: ideally zero. You change a config. Maybe a model string.
Without one: every file that touches an LLM. Which, if you're like us, was most of the backend.

What We Evaluated

Before we built AllToken, we looked at what was already out there. OpenRouter has an incredible model catalog and is great for experimentation. Other teams roll their own with Nginx and Lua scripts. Some just accept the SDK sprawl.
None of them handled production failover, cost routing, and unified billing the way we needed. So we built it.
But I'm not here to tell you to use AllToken. I'm here to tell you that if you're running more than one model in production, you're going to end up building or buying a gateway eventually. Run this checklist first so you know what you're actually solving for.
What's missing from this checklist? If you've run multi-model LLMs in production, you've probably hit edge cases I haven't. Drop them in the comments, I read every one.

I built alltoken.ai because I got tired of writing the same routing logic for every new project. Many models. One decision. Smart routing, transparent pricing, no platform fees.

Top comments (1)

Collapse
 
__ancheyang profile image
Ann Y

Insightful