Fan Chuanyu

Posted on Jun 15 • Originally published at datallmlab.com

How to Build a Multi-Model LLM Fallback Layer Without Rewriting Your App

#architecture

Most LLM integrations start as a single provider call.

That is usually the right move. You pick one strong model, wire up a chat completions request, ship the feature, and learn from real users.

The problem starts later.

Your support assistant needs better latency. Your document workflow needs a larger context window. Your extraction job is too expensive on the flagship model. A provider returns rate-limit errors during a launch. A new model is cheaper for background tasks but not good enough for customer-facing reasoning.

At that point, model choice is no longer a one-time SDK decision. It becomes application infrastructure.

This post walks through a practical way to build a small multi-model fallback layer so your product can use more than one provider without spreading provider-specific logic through the codebase.

The mistake: provider logic inside product features

A first integration often looks like this:

const response = await client.chat.completions.create({
  model: "gpt-4.1",
  messages,
});

That is fine for a prototype. In production, the feature usually grows around the provider call:

retries
rate-limit handling
usage metering
customer quotas
model-specific parameters
prompt templates
latency tracking
fallback behavior
cost attribution

If each product feature owns those details, every model change becomes a product change. You do not only switch a model name. You update error handling, logging, pricing assumptions, quality tests, and maybe even prompt shape.

The goal is not to hide every model difference. Some differences matter. The goal is to keep provider decisions in one place.

A better boundary: route by task, not by feature

Instead of letting every feature pick a provider directly, define the type of work the request represents.

For example:

type LlmTask =
  | "support_chat"
  | "document_summary"
  | "data_extraction"
  | "title_generation"
  | "long_context_analysis";

Then map tasks to model policies:

type ModelRoute = {
  primary: string;
  fallback?: string[];
  maxLatencyMs?: number;
  maxInputTokens?: number;
  allowFallback: boolean;
};

const routes: Record<LlmTask, ModelRoute> = {
  support_chat: {
    primary: "anthropic/claude-sonnet",
    fallback: ["openai/gpt-4.1", "google/gemini-pro"],
    maxLatencyMs: 5000,
    allowFallback: true,
  },
  data_extraction: {
    primary: "openai/gpt-4.1-mini",
    fallback: ["qwen/qwen-plus"],
    maxLatencyMs: 3000,
    allowFallback: true,
  },
  long_context_analysis: {
    primary: "google/gemini-pro",
    fallback: [],
    maxInputTokens: 1_000_000,
    allowFallback: false,
  },
  document_summary: {
    primary: "openai/gpt-4.1-mini",
    fallback: ["deepseek/deepseek-chat"],
    allowFallback: true,
  },
  title_generation: {
    primary: "qwen/qwen-plus",
    fallback: ["openai/gpt-4.1-mini"],
    allowFallback: true,
  },
};

This gives your application a stable interface:

const result = await llm.generate({
  task: "data_extraction",
  messages,
  customerId,
});

The feature does not need to know whether the request went to OpenAI, Anthropic, Gemini, Qwen, or another provider. It only needs the result and the metadata required for debugging.

Keep fallback conservative

Fallback sounds simple: if the primary model fails, try another one.

In practice, fallback rules need to be conservative because not all failures are the same.

You can usually retry or fall back on:

transient network errors
provider 5xx responses
rate-limit errors
timeouts
temporary capacity issues

You should be careful with fallback on:

safety refusals
structured output failures
tool-calling workflows
tasks where model behavior affects money, legal decisions, or user trust
workflows where consistency matters more than availability

Here is a simplified fallback runner:

type GenerateRequest = {
  task: LlmTask;
  messages: Array<{ role: "system" | "user" | "assistant"; content: string }>;
  customerId: string;
};

async function generateWithFallback(request: GenerateRequest) {
  const route = routes[request.task];
  const candidates = [route.primary, ...(route.fallback ?? [])];

  let lastError: unknown;

  for (const model of candidates) {
    try {
      const startedAt = Date.now();

      const response = await callModelProvider({
        model,
        messages: request.messages,
      });

      await logUsage({
        customerId: request.customerId,
        task: request.task,
        model,
        latencyMs: Date.now() - startedAt,
        inputTokens: response.usage.inputTokens,
        outputTokens: response.usage.outputTokens,
        fallback: model !== route.primary,
      });

      return response;
    } catch (error) {
      lastError = error;

      if (!route.allowFallback || !isFallbackSafe(error)) {
        throw error;
      }
    }
  }

  throw lastError;
}

The important part is the policy, not the exact code. You want the fallback decision to be explicit, observable, and different for each workload.

Log cost before it hurts

LLM cost visibility is easy to postpone when usage is small. That is a trap.

By the time token cost is visible on your cloud bill, it is usually harder to know which feature, model, customer, or prompt caused the increase.

At minimum, log:

customer or workspace ID
feature or task name
provider and model
input tokens
output tokens
cached tokens if supported
latency
fallback status
request outcome

This lets you answer practical questions:

Which feature is most expensive?
Which customers generate the highest token cost?
Which background jobs can move to a cheaper model?
Which model has the worst tail latency?
How often are fallbacks happening?

You do not need a complicated system to start. A database table or analytics event is enough:

await db.llmUsage.create({
  data: {
    customerId,
    task,
    model,
    inputTokens,
    outputTokens,
    latencyMs,
    fallback,
    createdAt: new Date(),
  },
});

Do not pretend all models are identical

An OpenAI-compatible API can reduce integration work, but compatibility is not the same as interchangeability.

Models can differ in:

context window size
tool calling behavior
structured output reliability
latency by region
output style
refusal behavior
tokenization
price

The abstraction should keep common product code clean while still exposing model-specific facts where they matter.

A good rule: hide provider plumbing, not product-relevant behavior.

Where a gateway fits

You can build this layer yourself if you have specific routing, compliance, or observability requirements.

You can also use an OpenAI-compatible AI gateway if you want the model catalog, routing, pricing, and fallback surface managed outside your app. For example, datallmlab is one implementation option for teams that want access to GPT, Claude, Gemini, Qwen, DeepSeek, and other models through a single API.

The architectural point is the same either way: keep model selection outside feature code.

Checklist

Before adding a second provider, decide:

Which workloads are allowed to fall back?
Which workloads need consistent model behavior?
Where will model routing be configured?
How will you measure cost per customer and feature?
How will you test output quality before switching a route?
What errors are safe to retry?
What errors should stop immediately?
Who can change model routes in production?

Final thought

The best model for your product today may not be the best model next quarter.

That does not mean you should rewrite your app every time the model landscape changes. It means the app should treat model choice as a routing decision, not a hard-coded dependency.

Start small: one routing function, one usage log, one conservative fallback policy.

That is enough to keep your AI features flexible without turning your codebase into provider glue.

DEV Community