Most LLM integrations start as a single provider call.
That is usually the right move. You pick one strong model, wire up a chat completions request, ship the feature, and learn from real users.
The problem starts later.
Your support assistant needs better latency. Your document workflow needs a larger context window. Your extraction job is too expensive on the flagship model. A provider returns rate-limit errors during a launch. A new model is cheaper for background tasks but not good enough for customer-facing reasoning.
At that point, model choice is no longer a one-time SDK decision. It becomes application infrastructure.
This post walks through a practical way to build a small multi-model fallback layer so your product can use more than one provider without spreading provider-specific logic through the codebase.
The mistake: provider logic inside product features
A first integration often looks like this:
const response = await client.chat.completions.create({
model: "gpt-4.1",
messages,
});
That is fine for a prototype. In production, the feature usually grows around the provider call:
- retries
- rate-limit handling
- usage metering
- customer quotas
- model-specific parameters
- prompt templates
- latency tracking
- fallback behavior
- cost attribution
If each product feature owns those details, every model change becomes a product change. You do not only switch a model name. You update error handling, logging, pricing assumptions, quality tests, and maybe even prompt shape.
The goal is not to hide every model difference. Some differences matter. The goal is to keep provider decisions in one place.
A better boundary: route by task, not by feature
Instead of letting every feature pick a provider directly, define the type of work the request represents.
For example:
type LlmTask =
| "support_chat"
| "document_summary"
| "data_extraction"
| "title_generation"
| "long_context_analysis";
Then map tasks to model policies:
type ModelRoute = {
primary: string;
fallback?: string[];
maxLatencyMs?: number;
maxInputTokens?: number;
allowFallback: boolean;
};
const routes: Record<LlmTask, ModelRoute> = {
support_chat: {
primary: "anthropic/claude-sonnet",
fallback: ["openai/gpt-4.1", "google/gemini-pro"],
maxLatencyMs: 5000,
allowFallback: true,
},
data_extraction: {
primary: "openai/gpt-4.1-mini",
fallback: ["qwen/qwen-plus"],
maxLatencyMs: 3000,
allowFallback: true,
},
long_context_analysis: {
primary: "google/gemini-pro",
fallback: [],
maxInputTokens: 1_000_000,
allowFallback: false,
},
document_summary: {
primary: "openai/gpt-4.1-mini",
fallback: ["deepseek/deepseek-chat"],
allowFallback: true,
},
title_generation: {
primary: "qwen/qwen-plus",
fallback: ["openai/gpt-4.1-mini"],
allowFallback: true,
},
};
This gives your application a stable interface:
const result = await llm.generate({
task: "data_extraction",
messages,
customerId,
});
The feature does not need to know whether the request went to OpenAI, Anthropic, Gemini, Qwen, or another provider. It only needs the result and the metadata required for debugging.
Keep fallback conservative
Fallback sounds simple: if the primary model fails, try another one.
In practice, fallback rules need to be conservative because not all failures are the same.
You can usually retry or fall back on:
- transient network errors
- provider 5xx responses
- rate-limit errors
- timeouts
- temporary capacity issues
You should be careful with fallback on:
- safety refusals
- structured output failures
- tool-calling workflows
- tasks where model behavior affects money, legal decisions, or user trust
- workflows where consistency matters more than availability
Here is a simplified fallback runner:
type GenerateRequest = {
task: LlmTask;
messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
customerId: string;
};
async function generateWithFallback(request: GenerateRequest) {
const route = routes[request.task];
const candidates = [route.primary, ...(route.fallback ?? [])];
let lastError: unknown;
for (const model of candidates) {
try {
const startedAt = Date.now();
const response = await callModelProvider({
model,
messages: request.messages,
});
await logUsage({
customerId: request.customerId,
task: request.task,
model,
latencyMs: Date.now() - startedAt,
inputTokens: response.usage.inputTokens,
outputTokens: response.usage.outputTokens,
fallback: model !== route.primary,
});
return response;
} catch (error) {
lastError = error;
if (!route.allowFallback || !isFallbackSafe(error)) {
throw error;
}
}
}
throw lastError;
}
The important part is the policy, not the exact code. You want the fallback decision to be explicit, observable, and different for each workload.
Log cost before it hurts
LLM cost visibility is easy to postpone when usage is small. That is a trap.
By the time token cost is visible on your cloud bill, it is usually harder to know which feature, model, customer, or prompt caused the increase.
At minimum, log:
- customer or workspace ID
- feature or task name
- provider and model
- input tokens
- output tokens
- cached tokens if supported
- latency
- fallback status
- request outcome
This lets you answer practical questions:
- Which feature is most expensive?
- Which customers generate the highest token cost?
- Which background jobs can move to a cheaper model?
- Which model has the worst tail latency?
- How often are fallbacks happening?
You do not need a complicated system to start. A database table or analytics event is enough:
await db.llmUsage.create({
data: {
customerId,
task,
model,
inputTokens,
outputTokens,
latencyMs,
fallback,
createdAt: new Date(),
},
});
Do not pretend all models are identical
An OpenAI-compatible API can reduce integration work, but compatibility is not the same as interchangeability.
Models can differ in:
- context window size
- tool calling behavior
- structured output reliability
- latency by region
- output style
- refusal behavior
- tokenization
- price
The abstraction should keep common product code clean while still exposing model-specific facts where they matter.
A good rule: hide provider plumbing, not product-relevant behavior.
Where a gateway fits
You can build this layer yourself if you have specific routing, compliance, or observability requirements.
You can also use an OpenAI-compatible AI gateway if you want the model catalog, routing, pricing, and fallback surface managed outside your app. For example, datallmlab is one implementation option for teams that want access to GPT, Claude, Gemini, Qwen, DeepSeek, and other models through a single API.
The architectural point is the same either way: keep model selection outside feature code.
Checklist
Before adding a second provider, decide:
- Which workloads are allowed to fall back?
- Which workloads need consistent model behavior?
- Where will model routing be configured?
- How will you measure cost per customer and feature?
- How will you test output quality before switching a route?
- What errors are safe to retry?
- What errors should stop immediately?
- Who can change model routes in production?
Final thought
The best model for your product today may not be the best model next quarter.
That does not mean you should rewrite your app every time the model landscape changes. It means the app should treat model choice as a routing decision, not a hard-coded dependency.
Start small: one routing function, one usage log, one conservative fallback policy.
That is enough to keep your AI features flexible without turning your codebase into provider glue.
Top comments (0)