A3M Router: 99.5% LLM Routing Accuracy Without ML
TL;DR: We built an LLM router that achieves 99.5% ±1 tier accuracy across 36 providers — with zero ML models, zero GPU, and a 19.5 KB package. It saves 61.6% on LLM costs compared to sending everything to GPT-4o.
npm install adaptive-memory-multi-model-router # TypeScript / Node
pip install a3m-router # Python
npx a3m-router serve # OpenAI proxy at localhost:8787
GitHub: Das-rebel/adaptive-memory-multi-model-router
The Problem With LLM Routing
Every production LLM app faces the same dilemma: which model should handle this query?
- Send everything to GPT-4o? You're burning money on "What is 2+2?"
- Send everything to a cheap model? Legal contract review gets garbage output
- Use RouteLLM? Now you need a GPU and 1.5 GB of BERT weights
The status quo is: pay too much, get too little, or add ML overhead.
Our Approach: Multi-Signal Heuristic Routing
Instead of ML, A3M Router uses 5 orthogonal signals to score query complexity from 0.0 to 1.0:
Domain Detection — Regex + keyword matching for legal, medical, finance, security, architecture, and ML research domains. These queries almost always need premium models.
Task Indicators — Detects coding, math, creative writing, and multilingual tasks. Each bumps complexity predictably.
Query Structure — Length, clause count, and qualifier density. A 50-word question with "considering", "furthermore", "in light of" is more complex than a 5-word question.
Action Verb Intensity — Expert verbs like "design", "architect", "diagnose" add +0.20. Mid verbs like "explain", "compare" add +0.10. Simple verbs like "what is" subtract 0.10.
Multi-Step Detection — Queries with "and then", "also", "additionally", or numbered lists indicate compound tasks.
The Tier System
Complexity 0.0–0.19 → Free tier (CommandCode, Ollama, local models)
Complexity 0.20–0.44 → Cheap tier (Groq Llama-70B, DeepSeek, Cerebras)
Complexity 0.45–0.64 → Mid tier (GPT-4o-mini, Claude Haiku, Mistral)
Complexity 0.65–1.0 → Premium tier (GPT-4o, Claude Sonnet, Grok)
The router picks the cheapest available model in the matched tier, with 2 fallback models ready.
But How Accurate Is It, Really?
We benchmarked on 200 queries across 4 complexity tiers, same methodology as the RouteLLM paper (arXiv:2404.06035):
| Metric | A3M Router | RouteLLM (BERT) |
|---|---|---|
| ±1 tier accuracy | 99.5% | ~85% |
| Exact tier match | 64.5% | Not published |
| GPU required | No | Yes |
| Model weights | 0 KB | 500 MB+ |
| Package size | 19.5 KB | 1.5 GB+ |
| Startup time | <100 ms | ~2 s |
Only 1 in 200 queries missed by more than one tier.
routed → free cheap mid premium
actual free (50) 46 4 0 0
actual medium (60) 11 47 2 0
actual complex (50) 0 24 18 8
actual expert (40) 0 1 21 18
What Else Does It Do?
Adaptive Memory
The router learns from your usage. Every real LLM call updates model quality scores using exponential moving average (α=0.2). If Groq consistently gives better results for your coding queries, the router learns to prefer it — no retraining needed.
Semantic Cache (No Vector DB!)
Skips duplicate LLM calls using character trigram Jaccard similarity. No embeddings model, no vector database, no GPU.
import { SemanticCache } from 'adaptive-memory-multi-model-router/cache';
const cache = new SemanticCache({
similarityThreshold: 0.92, // 92% similar = cache hit
ttl: 3600000, // 1 hour
});
Guardrails Engine
- 17-pattern prompt injection detection — weighted regex scoring, blocks at score ≥80
-
PII detection & redaction — emails, phones, SSNs, credit cards, API keys (
sk-*,AKIA*) - Content filtering — hate, violence, self-harm categories
- Hallucination heuristics — empty output, suspiciously short, repetitive, echo detection
Cost Analytics
Per-provider spend tracking, daily/monthly budget alerts, and savings projections:
| Monthly Queries | GPT-4o Only | A3M Router | You Save | Annualized |
|---|---|---|---|---|
| 10K | $34 | $12 | $22 | $261 |
| 100K | $341 | $124 | $218 | $2,610 |
| 1M | $3,411 | $1,236 | $2,175 | $26,100 |
36 Providers, 5 Interfaces
Providers: OpenAI, Anthropic, Groq, Cerebras, DeepSeek, Mistral, Google Gemini, Ollama, Together, Fireworks, Novita, SambaNova, Replicate, OpenRouter, xAI, Cohere, AI21, Qwen, and 18 more.
Interfaces:
-
TypeScript SDK —
import { A3MRouter } from 'adaptive-memory-multi-model-router/sdk' -
Python SDK —
pip install a3m-router→from a3m import A3MRouter -
CLI —
npx a3m-router route "your query" -
REST API —
GET /v1/route+POST /v1/chat/completions -
OpenAI-Compatible Proxy — Point any OpenAI SDK at
http://localhost:8787/v1
# Works with ANY OpenAI SDK — zero code changes
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="not-needed")
response = client.chat.completions.create(
model="auto", # ← intelligent routing
messages=[{"role": "user", "content": "Hello!"}]
)
Why Not ML?
We're not anti-ML. We're anti-overhead. Here's the thing:
Complexity scoring doesn't need neural networks. The features that distinguish "What is 2+2?" from "Design a clinical trial for oncology" are captured by domain keywords, query length, and action verbs. BERT doesn't add meaningful signal here.
Memory > Training. A model trained on general benchmarks can't know that your users ask more coding questions than average. Adaptive memory that learns from your actual usage patterns beats a static trained model.
19.5 KB vs 1.5 GB. If you're running this in a Lambda function, an edge worker, or embedded device, the size matters. Our entire router fits in a single HTTP request.
Get Started
# Install
npm install adaptive-memory-multi-model-router
# CLI
npx a3m-router route "Explain quantum computing"
npx a3m-router serve --port 8787
# TypeScript
import { A3MRouter } from 'adaptive-memory-multi-model-router/sdk';
const router = new A3MRouter();
const decision = router.route("Review this contract");
# Python
pip install a3m-router
from a3m import A3MRouter
async with A3MRouter() as router:
decision = await router.route("Write a sort function")
GitHub: Das-rebel/adaptive-memory-multi-model-router
npm: adaptive-memory-multi-model-router
PyPI: a3m-router
License: MIT
If you found this interesting, please star the repo and share it. We're a small team building practical AI infrastructure — no hype, just working code.
Top comments (0)