DEV Community

David C Cavalcante
David C Cavalcante Subscriber

Posted on

ModelChain: Measurable LLM Router with Adaptive Model Selection, Real-Time Scoring, Budget Guards and Failover for Node.js, Edge and Browser

ModelChain: Measurable LLM Router with Adaptive Model Selection, Real-Time Scoring, Budget Guards and Failover for Node.js, Edge and Browser

As a solo LLMOps engineer with over 25 years building production AI systems, I kept hitting the same limitation: when you have access to multiple LLM providers and models, choosing the right one for each request becomes fragile and outdated quickly.

Static if/else rules or fixed fallbacks do not survive real-world changes in pricing, latency, or model quality. Manual benchmarking is time-consuming and error-prone.

ModelChain (@takk/modelchain) was built to solve this.

The Problem

Developers and companies with keys for OpenAI, Anthropic, Gemini, Groq, and others waste time and money because they cannot dynamically route each prompt to the best available model based on current cost, observed latency, and actual response quality. Hard-coded choices quickly become suboptimal.

The Solution

ModelChain is a measurable, adaptive LLM router for Node.js, Edge runtimes, and browser. It selects the best model per request using seven routing strategies, scores every response in real time, feeds those scores back into future decisions, enforces hard budget guards, and includes per-model circuit breakers with automatic failover.

It normalises responses, tool calling, and streaming across providers while remaining zero-runtime-dependency and fully tree-shakable.

Core Features

  • Seven declarative routing strategies (cost-then-quality, cost-first, quality-first, etc.)
  • Six pluggable scorers (latency, token-budget, length-bound, regex-match, exact-match, schema-valid)
  • Native streaming over Web Streams with a unified CompletionChunk type
  • Normalised tool calling across OpenAI, Anthropic, and Gemini
  • Hard budget guard (per-request, per-task, daily ceilings) that throws before any network call
  • Per-model circuit breaker + full-jitter exponential backoff + automatic failover
  • EWMA health scoring that decays on failure and recovers on success
  • Thirteen in-process telemetry events (no external OpenTelemetry required)
  • Vercel AI SDK adapter (toVercelAILanguageModel)
  • CLI proxy, inspect, and bench modes
  • Six tree-shakeable entry points (core, providers, web, edge, ai-sdk, cli)
  • SLSA provenance on every release

Quickstart Examples

1. Basic Router Setup

import { createModelchain } from '@takk/modelchain';
import { openaiModel, anthropicModel, geminiModel } from '@takk/modelchain/providers';

const router = createModelchain({
  models: [
    openaiModel('gpt-4o-mini', {
      cost: { costPer1kInput: 0.00015, costPer1kOutput: 0.00060 },
      keys: process.env.OPENAI_API_KEY ?? '',
    }),
    anthropicModel('claude-3-5-haiku-latest', {
      cost: { costPer1kInput: 0.00080, costPer1kOutput: 0.00400 },
      keys: process.env.ANTHROPIC_API_KEY ?? '',
    }),
    geminiModel('gemini-2.0-flash', {
      cost: { costPer1kInput: 0.00010, costPer1kOutput: 0.00040 },
      keys: process.env.GEMINI_API_KEY ?? '',
    }),
  ],
  strategy: 'cost-then-quality',
  scoring: { built: ['latency', 'token-budget'] },
  budget: { perRequestUsd: 0.02, dailyUsd: 5 },
  telemetry: { enabled: true },
});

const response = await router.complete({
  prompt: 'Summarise X in 3 bullets.',
  maxTokens: 200,
});
console.log(response.text, response.finishReason, response.usage);
Enter fullscreen mode Exit fullscreen mode

2. Streaming

for await (const chunk of router.stream({ prompt: 'Tell me a story.' })) {
  if (chunk.type === 'text-delta') process.stdout.write(chunk.delta);
  if (chunk.type === 'finish') console.log('\nDone:', chunk.finishReason, chunk.usage);
}
Enter fullscreen mode Exit fullscreen mode

3. Vercel AI SDK Integration

import { generateText } from 'ai';
import { toVercelAILanguageModel } from '@takk/modelchain/ai-sdk';

const { text } = await generateText({
  model: toVercelAILanguageModel(router),
  prompt: 'Hello.',
});
Enter fullscreen mode Exit fullscreen mode

4. Tool Calling (normalised)

const result = await router.complete({
  prompt: 'What is the weather in Tokyo?',
  tools: [ /* ToolDefinition shape */ ],
});
Enter fullscreen mode Exit fullscreen mode

How It Works (Request Flow)

  1. Select best model using chosen strategy and current health/scores
  2. Pre-flight budget guard check
  3. Dispatch request through normalised provider adapter
  4. Classify response or error
  5. Update EWMA health score and circuit breaker state
  6. Score response quality and record for future routing
  7. Emit telemetry events

All operations happen in-process with zero external dependencies.

Installation

pnpm add @takk/modelchain
# or
npm install @takk/modelchain
# or
yarn add @takk/modelchain
# or
bun add @takk/modelchain
Enter fullscreen mode Exit fullscreen mode

Optional peer dependencies only if using richer typed adapters.

Why ModelChain Exists

ModelChain is the second building block (after KeyMesh) of a long-term family of high-reliability, open-source-first npm libraries for AI-native infrastructure that I plan to maintain through 2026–2030.

I built it because dynamic, measurable routing is the missing layer between raw LLM providers and production applications that care about cost, latency, quality, and reliability.

Links

If you run multi-provider LLM applications in Node.js, Edge, or with the Vercel AI SDK, I would love your feedback, real-world usage reports, and contributions.

Try ModelChain today and let me know which routing strategy and scorers work best for your workload.

Top comments (1)

Collapse
 
harjjotsinghh profile image
Harjot Singh

This is exactly the layer I think most AI apps are missing. Adaptive model selection + real-time scoring + budget guards is the trifecta, route to the cheapest model that clears the bar, measure whether it actually did, and hard-stop before a runaway loop bills you. The failover piece is underrated too, a provider hiccup shouldn't take your whole feature down. I built essentially this routing-plus-budget-guard logic into Moonshift so a full run stays a few dollars without quality cliffs. Curious how you score quality in real time, a judge model, heuristics on the output, or downstream signal? That scoring is the hard part.