Nicholas Blanchard

Posted on Apr 15

I Built an LLM Gateway That Learns Which Model to Use — Here's How the Routing Works

#ai #aiops #llm #opensource

Last month I was building an app that used three different LLM providers. OpenAI for coding tasks, Anthropic for long-form writing, and a local Ollama model for quick throwaway queries. The logic looked something like this:

if (task === "coding") {
  model = "gpt-4o";
} else if (task === "writing" && length > 2000) {
  model = "claude-sonnet-4-6";
} else {
  model = "llama3";
}

It worked, but it was fragile. Every time a provider shipped a new model or changed pricing, I had to update the routing logic. I had no visibility into which model was actually performing better. And my API keys were scattered across three different .env files.

So I built Provara — an open-source LLM gateway that handles all of this automatically.

What It Does

Provara sits between your app and your LLM providers. You point your existing OpenAI SDK at it, and it handles the rest:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://your-provara-instance/v1",
  apiKey: "pvra_your_token_here",
});

// That's it. Provara routes this to the best available model.
const response = await client.chat.completions.create({
  model: "",  // empty = let the router decide
  messages: [{ role: "user", content: "Write a quicksort in Python" }],
});

Two lines changed. Same SDK. Same interface. But now your requests are being intelligently routed.

How the Routing Actually Works

This is the part I'm most proud of. When a request arrives without a specific model:

Step 1: Classification. A fast heuristic classifier analyzes the prompt and assigns a task type (coding, creative, summarization, Q&A, general) and complexity level (simple, medium, complex). It looks for code blocks, technical keywords, instruction patterns, and message length. When the heuristic is ambiguous, it falls back to an LLM-based classifier using the cheapest available model.

Step 2: Cell lookup. The task type + complexity combination forms a "routing cell" — like coding/complex or qa/simple. Each cell independently tracks which model performs best.

Step 3: Adaptive routing. If the cell has enough quality data (5+ feedback scores), the router picks the highest-scoring model using a weighted composite of quality, cost, and latency. The weights are configurable per API token — you can optimize for cost, quality, or a custom balance.

Step 4: Fallback. If there's not enough data yet, it falls back to the cheapest available model. This is intentional — you don't want to burn money on premium models until you know they're actually better for that task type.

The key insight: the system starts dumb and gets smarter over time. No configuration needed. Quality data accumulates from two sources:

An LLM-as-judge that automatically scores a sample of responses (configurable sample rate)
Manual ratings from the dashboard where you can score any response 1-5

After a few hundred requests, the routing matrix starts filling in. Simple Q&A goes to the cheapest model. Complex coding goes to the premium one. And it's all based on real measured quality, not assumptions.

A/B Testing Built In

Before the adaptive router has enough data, you can run explicit A/B tests. Create a test, assign two models as variants, scope it to a routing cell, and Provara splits traffic by weight:

curl -X POST http://localhost:4000/v1/ab-tests \
  -H "Content-Type: application/json" \
  -d '{
    "name": "GPT-4o vs Claude for coding",
    "taskType": "coding",
    "complexity": "complex",
    "variants": [
      { "provider": "openai", "model": "gpt-4o", "weight": 1 },
      { "provider": "anthropic", "model": "claude-sonnet-4-6", "weight": 1 }
    ]
  }'

The dashboard shows per-variant stats and a winner recommendation once you have enough feedback. And the A/B test results feed directly into the adaptive routing engine — so the learning is continuous.

The Dashboard

This is where it gets fun. Every request through the gateway is logged with full metadata, and the dashboard gives you visibility into everything:

Request Logs — Browse every request with prompt, response, tokens, latency, cost, and routing decision. Click any request to see the full detail. Hit "Replay" to send the same prompt to a different model and see a side-by-side comparison — even with a word-level diff view.

Analytics — Time-series charts for request volume, cost breakdown by provider, and latency percentiles (p50/p95/p99). See exactly where your money is going and which models are slow.

Quality Monitoring — Track quality scores over time. See the adaptive routing matrix filling in. Configure the LLM judge. Rate responses manually.

Alerting — Set thresholds for spend, latency, or request volume. Get webhook notifications when something spikes.

Prompt Management — Version your prompt templates, use {{variables}} for dynamic content, publish specific versions, and resolve them by name via the API.

Guardrails — Built-in PII detection (SSN, credit card, email, phone) with block/redact/flag actions. Input is redacted before it reaches the provider. The playground shows a warning when a guardrail fires.

[Add screenshots here — dashboard overview, logs with replay, analytics charts]

The Stack

For the technically curious:

Gateway: Hono (lightweight, fast, runs on Node)
Dashboard: Next.js + Tailwind CSS
Database: SQLite via Drizzle ORM (yes, SQLite — it's plenty fast for this)
Monorepo: Turborepo with npm workspaces
Auth: Google + GitHub OAuth, session-based, role-based access
Encryption: AES-256-GCM for API keys at rest
Providers: OpenAI, Anthropic, Google, Mistral, xAI, Z.ai, Ollama, plus any OpenAI-compatible endpoint

The whole thing deploys with docker compose up -d. Gateway on port 4000, dashboard on port 3000. Five minutes from clone to routing requests.

Self-Hosted by Default

This was a non-negotiable design decision. Your prompts, responses, and API keys never leave your infrastructure. The managed version at provara.xyz exists for convenience, but the self-hosted experience is first-class.

No telemetry. No phone-home. No "please create an account to use the open-source version." Clone it, run it, own it.

What I Learned Building It

A few things surprised me:

The cold start problem is real. When the adaptive router has no data, everything routes to the cheapest model — including complex prompts that deserve a premium model. Documenting this clearly was as important as building the feature. Users need to understand that the system gets smarter over time, not that it's broken on day one.
Streaming + guardrails don't mix. You can't un-send a streamed response. Input guardrails work perfectly (redact before sending to the provider), but output guardrails on streaming responses are architecturally impossible without buffering the entire response first. I chose to document the limitation rather than degrade the streaming experience.
SQLite is underrated. For a single-instance gateway handling thousands of requests, SQLite with WAL mode is more than enough. No Postgres to manage, no connection pooling to configure. The simplicity pays for itself in deployment and maintenance.

Try It

Live demo: provara.xyz
GitHub: github.com/syndicalt/provara
Quick start:

git clone https://github.com/syndicalt/provara.git
cd provara && cp .env.example .env
docker compose up -d

I'm actively building this and would love feedback. What features would make this useful for your workflow? What's missing? Drop a comment or open an issue on GitHub.

Provara is MIT licensed. Star the repo if this is useful to you — it helps more than you'd think.

DEV Community