DEV Community

Ad Man
Ad Man

Posted on • Originally published at github.com

A3M Router: 99.5% LLM Routing Accuracy Without ML — How We Built It

A3M Router: 99.5% LLM Routing Accuracy Without ML

TL;DR: We built an LLM router that achieves 99.5% ±1 tier accuracy across 36 providers — with zero ML models, zero GPU, and a 19.5 KB package. It saves 61.6% on LLM costs compared to sending everything to GPT-4o.

npm install adaptive-memory-multi-model-router   # TypeScript / Node
pip install a3m-router                            # Python
npx a3m-router serve                              # OpenAI proxy at localhost:8787
Enter fullscreen mode Exit fullscreen mode

GitHub: Das-rebel/adaptive-memory-multi-model-router


The Problem With LLM Routing

Every production LLM app faces the same dilemma: which model should handle this query?

  • Send everything to GPT-4o? You're burning money on "What is 2+2?"
  • Send everything to a cheap model? Legal contract review gets garbage output
  • Use RouteLLM? Now you need a GPU and 1.5 GB of BERT weights

The status quo is: pay too much, get too little, or add ML overhead.

Our Approach: Multi-Signal Heuristic Routing

Instead of ML, A3M Router uses 5 orthogonal signals to score query complexity from 0.0 to 1.0:

  1. Domain Detection — Regex + keyword matching for legal, medical, finance, security, architecture, and ML research domains. These queries almost always need premium models.

  2. Task Indicators — Detects coding, math, creative writing, and multilingual tasks. Each bumps complexity predictably.

  3. Query Structure — Length, clause count, and qualifier density. A 50-word question with "considering", "furthermore", "in light of" is more complex than a 5-word question.

  4. Action Verb Intensity — Expert verbs like "design", "architect", "diagnose" add +0.20. Mid verbs like "explain", "compare" add +0.10. Simple verbs like "what is" subtract 0.10.

  5. Multi-Step Detection — Queries with "and then", "also", "additionally", or numbered lists indicate compound tasks.

The Tier System

Complexity 0.0–0.19 → Free tier   (CommandCode, Ollama, local models)
Complexity 0.20–0.44 → Cheap tier  (Groq Llama-70B, DeepSeek, Cerebras)
Complexity 0.45–0.64 → Mid tier    (GPT-4o-mini, Claude Haiku, Mistral)
Complexity 0.65–1.0  → Premium tier (GPT-4o, Claude Sonnet, Grok)
Enter fullscreen mode Exit fullscreen mode

The router picks the cheapest available model in the matched tier, with 2 fallback models ready.

But How Accurate Is It, Really?

We benchmarked on 200 queries across 4 complexity tiers, same methodology as the RouteLLM paper (arXiv:2404.06035):

Metric A3M Router RouteLLM (BERT)
±1 tier accuracy 99.5% ~85%
Exact tier match 64.5% Not published
GPU required No Yes
Model weights 0 KB 500 MB+
Package size 19.5 KB 1.5 GB+
Startup time <100 ms ~2 s

Only 1 in 200 queries missed by more than one tier.

               routed →    free    cheap    mid    premium
actual free (50)             46       4       0       0
actual medium (60)           11      47       2       0
actual complex (50)           0      24      18       8
actual expert (40)            0       1      21      18
Enter fullscreen mode Exit fullscreen mode

What Else Does It Do?

Adaptive Memory

The router learns from your usage. Every real LLM call updates model quality scores using exponential moving average (α=0.2). If Groq consistently gives better results for your coding queries, the router learns to prefer it — no retraining needed.

Semantic Cache (No Vector DB!)

Skips duplicate LLM calls using character trigram Jaccard similarity. No embeddings model, no vector database, no GPU.

import { SemanticCache } from 'adaptive-memory-multi-model-router/cache';

const cache = new SemanticCache({
  similarityThreshold: 0.92,  // 92% similar = cache hit
  ttl: 3600000,               // 1 hour
});
Enter fullscreen mode Exit fullscreen mode

Guardrails Engine

  • 17-pattern prompt injection detection — weighted regex scoring, blocks at score ≥80
  • PII detection & redaction — emails, phones, SSNs, credit cards, API keys (sk-*, AKIA*)
  • Content filtering — hate, violence, self-harm categories
  • Hallucination heuristics — empty output, suspiciously short, repetitive, echo detection

Cost Analytics

Per-provider spend tracking, daily/monthly budget alerts, and savings projections:

Monthly Queries GPT-4o Only A3M Router You Save Annualized
10K $34 $12 $22 $261
100K $341 $124 $218 $2,610
1M $3,411 $1,236 $2,175 $26,100

36 Providers, 5 Interfaces

Providers: OpenAI, Anthropic, Groq, Cerebras, DeepSeek, Mistral, Google Gemini, Ollama, Together, Fireworks, Novita, SambaNova, Replicate, OpenRouter, xAI, Cohere, AI21, Qwen, and 18 more.

Interfaces:

  • TypeScript SDKimport { A3MRouter } from 'adaptive-memory-multi-model-router/sdk'
  • Python SDKpip install a3m-routerfrom a3m import A3MRouter
  • CLInpx a3m-router route "your query"
  • REST APIGET /v1/route + POST /v1/chat/completions
  • OpenAI-Compatible Proxy — Point any OpenAI SDK at http://localhost:8787/v1
# Works with ANY OpenAI SDK — zero code changes
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="auto",  # ← intelligent routing
    messages=[{"role": "user", "content": "Hello!"}]
)
Enter fullscreen mode Exit fullscreen mode

Why Not ML?

We're not anti-ML. We're anti-overhead. Here's the thing:

  1. Complexity scoring doesn't need neural networks. The features that distinguish "What is 2+2?" from "Design a clinical trial for oncology" are captured by domain keywords, query length, and action verbs. BERT doesn't add meaningful signal here.

  2. Memory > Training. A model trained on general benchmarks can't know that your users ask more coding questions than average. Adaptive memory that learns from your actual usage patterns beats a static trained model.

  3. 19.5 KB vs 1.5 GB. If you're running this in a Lambda function, an edge worker, or embedded device, the size matters. Our entire router fits in a single HTTP request.

Get Started

# Install
npm install adaptive-memory-multi-model-router

# CLI
npx a3m-router route "Explain quantum computing"
npx a3m-router serve --port 8787

# TypeScript
import { A3MRouter } from 'adaptive-memory-multi-model-router/sdk';
const router = new A3MRouter();
const decision = router.route("Review this contract");

# Python
pip install a3m-router
from a3m import A3MRouter
async with A3MRouter() as router:
    decision = await router.route("Write a sort function")
Enter fullscreen mode Exit fullscreen mode

GitHub: Das-rebel/adaptive-memory-multi-model-router

npm: adaptive-memory-multi-model-router

PyPI: a3m-router

License: MIT


If you found this interesting, please star the repo and share it. We're a small team building practical AI infrastructure — no hype, just working code.

Top comments (0)