Ryan Carter

Posted on Apr 28 • Edited on May 7 • Originally published at stormcloudy.com

Multi-Model LLM Orchestration with OpenRouter

#ai #llm #webdev #tutorial

Multi-model LLM orchestration is the practice of routing AI requests to different models based on what each task needs — speed, cost, reasoning depth, or code quality. OpenRouter makes it practical by exposing models from Anthropic, OpenAI, Google, Meta, Mistral, and others through a single OpenAI-compatible API: one key, one bill, one client, and you swap models by changing a string. The implementation is a few dozen lines of code on top of the OpenAI SDK.

This post walks through the pattern: defining named model slots, routing by task or complexity, streaming, fallback handling, and tracking cost across providers.

TL;DR

What it is: Routing each AI request to the model best suited for that task instead of using one model for everything.
Why it matters: Cheaper at scale (small models for simple tasks), faster perceived latency (fast models for chat), better quality (right model for the job), and resilient (fall back across providers when one is down).
How OpenRouter helps: One API key gives you access to 100+ models across providers using the OpenAI SDK. Model strings follow provider/model-name.
Two routing strategies: By task type (summarize → fast model, reason → deep model) or by estimated complexity (token count thresholds).
Production essentials: Streaming for chat UIs, try/catch fallbacks for provider outages, and per-request cost logging via the usage object OpenRouter returns.

Why bother with multiple models?

A few real reasons:

Cost. Frontier models like GPT-4o or Claude Opus are expensive at scale. For tasks that don't need that level of reasoning — summarization, classification, simple Q&A — a cheaper, faster model does the job at a fraction of the cost.

Speed. Small models respond faster. If a user is waiting for a response, latency matters. Route quick tasks to a fast model and save the slow, expensive one for when it's actually needed.

Quality. Some models are better at specific things. Code generation, structured output, long-context reasoning, multilingual text — the best model for each task isn't always the same model.

Resilience. If one provider has an outage or rate limit, you can fall back to another without rewriting your integration.

Setting up OpenRouter

Install the OpenAI SDK — OpenRouter is compatible with it:

npm install openai

Point it at OpenRouter's base URL with your API key:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

That's it. Everything else is standard OpenAI SDK calls, just with different model strings.

Defining your model roster

The key to orchestration is deciding upfront which models you'll use and what each one is for. A simple approach is to define a set of "personas" — named roles that map to specific models:

const models = {
  fast: "google/gemini-flash-1.5",      // quick tasks, low latency
  balanced: "openai/gpt-4o-mini",        // everyday reasoning
  deep: "anthropic/claude-opus-4",       // complex reasoning, long context
  code: "anthropic/claude-sonnet-4",     // code generation and review
};

Model strings in OpenRouter follow the pattern provider/model-name. You can find the full list and pricing at openrouter.ai/models.

By mapping names to models rather than hardcoding model strings throughout your codebase, you can swap the underlying model without touching anything else. If a better cheap model comes out next month, you change one line.

Routing by task type

The simplest orchestration strategy is routing based on task type — you decide which model to use before making the call:

async function chat(task, messages) {
  const modelMap = {
    summarize: models.fast,
    classify: models.fast,
    draft: models.balanced,
    reason: models.deep,
    code: models.code,
  };

  const model = modelMap[task] ?? models.balanced;

  const response = await client.chat.completions.create({
    model,
    messages,
  });

  return response.choices[0].message.content;
}

// Usage
const summary = await chat("summarize", [
  { role: "user", content: "Summarize this document: ..." }
]);

const plan = await chat("reason", [
  { role: "user", content: "Help me think through this architecture decision..." }
]);

This is explicit and predictable. You know exactly which model runs for each task type, which makes debugging straightforward and costs easy to reason about.

Routing by estimated complexity

A more dynamic approach is routing based on the size or complexity of the request itself:

function selectModel(prompt) {
  const tokenEstimate = prompt.length / 4; // rough chars-to-tokens estimate

  if (tokenEstimate < 500) return models.fast;
  if (tokenEstimate < 2000) return models.balanced;
  return models.deep;
}

async function chat(prompt) {
  const model = selectModel(prompt);
  const response = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
  });
  return response.choices[0].message.content;
}

You can combine both approaches — route by task type first, then apply complexity thresholds within each category.

Streaming responses

For any user-facing interface, streaming makes responses feel faster even when they aren't:

async function streamChat(model, messages, onChunk) {
  const stream = await client.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content ?? "";
    if (text) onChunk(text);
  }
}

// Usage
await streamChat(models.balanced, messages, (chunk) => {
  process.stdout.write(chunk); // or push to your UI
});

Fallback handling

Models go down. Rate limits happen. Add a fallback layer so a failure from one provider doesn't take your whole app down:

async function chatWithFallback(preferredModel, fallbackModel, messages) {
  try {
    const response = await client.chat.completions.create({
      model: preferredModel,
      messages,
    });
    return response.choices[0].message.content;
  } catch (err) {
    console.warn(`Model ${preferredModel} failed, falling back to ${fallbackModel}`, err.message);
    const response = await client.chat.completions.create({
      model: fallbackModel,
      messages,
    });
    return response.choices[0].message.content;
  }
}

Tracking cost across models

One of the underrated benefits of OpenRouter is that it returns token usage and cost metadata with each response. Log it and you'll know exactly what you're spending per task type:

async function chatWithCostTracking(task, messages) {
  const model = selectModelForTask(task);

  const response = await client.chat.completions.create({
    model,
    messages,
  });

  const usage = response.usage;
  console.log({
    task,
    model,
    inputTokens: usage.prompt_tokens,
    outputTokens: usage.completion_tokens,
    // OpenRouter includes cost in the response
    cost: response.usage?.cost,
  });

  return response.choices[0].message.content;
}

Once you have that data you can see which task types are eating your budget and tune your routing accordingly.

Putting it together

The pattern here is straightforward:

Define named model slots tied to task roles, not specific model strings
Route requests to the right slot based on task type, complexity, or both
Stream responses for user-facing interfaces
Add fallbacks so individual provider failures don't cascade
Log usage so you can optimize over time

OpenRouter removes the vendor lock-in that makes this feel risky. You're not betting on one provider — you're building a routing layer that can point at any model, from any provider, updated as the landscape changes. Given how fast the model landscape moves, that flexibility is worth more than it might seem today.

FAQ

Is OpenRouter more expensive than calling providers directly?

OpenRouter passes through provider pricing with a small markup baked in (typically a few percent), and in exchange you get a single account, single bill, automatic failover, and the ability to swap models without touching keys or SDKs. For most teams the convenience is worth it; for very high-volume workloads on a single model, going direct can be cheaper.

Does OpenRouter support streaming and tool/function calling?

Yes. Streaming works exactly like the OpenAI SDK — set stream: true. Tool/function calling is supported per-model: most modern models from Anthropic, OpenAI, and Google handle it; smaller open models vary. Check the model card on openrouter.ai/models for capability flags.

How does this compare to LangChain or LiteLLM?

LangChain is a much heavier framework with chains, agents, retrievers, and abstractions on top of providers. LiteLLM is the closest comparison — it's a unified provider proxy you self-host. OpenRouter is a hosted version of that idea: less control but zero ops, plus access to models you don't have direct accounts for.

What happens if a model gets deprecated or removed?

OpenRouter announces deprecations in advance and usually keeps a redirect to a sensible successor. Because your code references a model string in one place (the named-slot map), updating to a new model is a one-line change. This is the main argument for the named-slot pattern over hardcoding model names throughout the codebase.

Can I route by user, by feature, or by A/B test?

Yes. The routing function is just code, so you can include any signal in the decision: user tier, feature flag, A/B bucket, time of day. A common pattern is routing premium users to the deeper model and free users to the fast one. Another is shadow-routing — sending a copy of each request to a candidate model and comparing outputs offline.

How do I track which model performed best for a task?

Log the model, task, latency, token usage, and a quality signal (user thumbs-up, downstream success, eval score) for every request. Once you have a few weeks of data, group by task and model and compare. This is how you justify routing decisions empirically instead of guessing.

DEV Community