Ad Man

Posted on May 19 • Originally published at github.com

A3M Router: 99.5% LLM Routing Accuracy Without ML — How We Built It

#ai #opensource #webdev #llm

A3M Router: 99.5% LLM Routing Accuracy Without ML

TL;DR: We built an LLM router that achieves 99.5% ±1 tier accuracy across 36 providers — with zero ML models, zero GPU, and a 19.5 KB package. It saves 61.6% on LLM costs compared to sending everything to GPT-4o.

npm install adaptive-memory-multi-model-router   # TypeScript / Node
pip install a3m-router                            # Python
npx a3m-router serve                              # OpenAI proxy at localhost:8787

GitHub: Das-rebel/adaptive-memory-multi-model-router

The Problem With LLM Routing

Every production LLM app faces the same dilemma: which model should handle this query?

Send everything to GPT-4o? You're burning money on "What is 2+2?"
Send everything to a cheap model? Legal contract review gets garbage output
Use RouteLLM? Now you need a GPU and 1.5 GB of BERT weights

The status quo is: pay too much, get too little, or add ML overhead.

Our Approach: Multi-Signal Heuristic Routing

Instead of ML, A3M Router uses 5 orthogonal signals to score query complexity from 0.0 to 1.0:

Domain Detection — Regex + keyword matching for legal, medical, finance, security, architecture, and ML research domains. These queries almost always need premium models.
Task Indicators — Detects coding, math, creative writing, and multilingual tasks. Each bumps complexity predictably.
Query Structure — Length, clause count, and qualifier density. A 50-word question with "considering", "furthermore", "in light of" is more complex than a 5-word question.
Action Verb Intensity — Expert verbs like "design", "architect", "diagnose" add +0.20. Mid verbs like "explain", "compare" add +0.10. Simple verbs like "what is" subtract 0.10.
Multi-Step Detection — Queries with "and then", "also", "additionally", or numbered lists indicate compound tasks.

The Tier System

Complexity 0.0–0.19 → Free tier   (CommandCode, Ollama, local models)
Complexity 0.20–0.44 → Cheap tier  (Groq Llama-70B, DeepSeek, Cerebras)
Complexity 0.45–0.64 → Mid tier    (GPT-4o-mini, Claude Haiku, Mistral)
Complexity 0.65–1.0  → Premium tier (GPT-4o, Claude Sonnet, Grok)

The router picks the cheapest available model in the matched tier, with 2 fallback models ready.

But How Accurate Is It, Really?

We benchmarked on 200 queries across 4 complexity tiers, same methodology as the RouteLLM paper (arXiv:2404.06035):

Metric	A3M Router	RouteLLM (BERT)
±1 tier accuracy	99.5%	~85%
Exact tier match	64.5%	Not published
GPU required	No	Yes
Model weights	0 KB	500 MB+
Package size	19.5 KB	1.5 GB+
Startup time	<100 ms	~2 s

Only 1 in 200 queries missed by more than one tier.

               routed →    free    cheap    mid    premium
actual free (50)             46       4       0       0
actual medium (60)           11      47       2       0
actual complex (50)           0      24      18       8
actual expert (40)            0       1      21      18

What Else Does It Do?

Adaptive Memory

The router learns from your usage. Every real LLM call updates model quality scores using exponential moving average (α=0.2). If Groq consistently gives better results for your coding queries, the router learns to prefer it — no retraining needed.

Semantic Cache (No Vector DB!)

Skips duplicate LLM calls using character trigram Jaccard similarity. No embeddings model, no vector database, no GPU.

import { SemanticCache } from 'adaptive-memory-multi-model-router/cache';

const cache = new SemanticCache({
  similarityThreshold: 0.92,  // 92% similar = cache hit
  ttl: 3600000,               // 1 hour
});

Guardrails Engine

17-pattern prompt injection detection — weighted regex scoring, blocks at score ≥80
PII detection & redaction — emails, phones, SSNs, credit cards, API keys (sk-*, AKIA*)
Content filtering — hate, violence, self-harm categories
Hallucination heuristics — empty output, suspiciously short, repetitive, echo detection

Cost Analytics

Per-provider spend tracking, daily/monthly budget alerts, and savings projections:

Monthly Queries	GPT-4o Only	A3M Router	You Save	Annualized
10K	$34	$12	$22	$261
100K	$341	$124	$218	$2,610
1M	$3,411	$1,236	$2,175	$26,100

36 Providers, 5 Interfaces

Providers: OpenAI, Anthropic, Groq, Cerebras, DeepSeek, Mistral, Google Gemini, Ollama, Together, Fireworks, Novita, SambaNova, Replicate, OpenRouter, xAI, Cohere, AI21, Qwen, and 18 more.

Interfaces:

TypeScript SDK — import { A3MRouter } from 'adaptive-memory-multi-model-router/sdk'
Python SDK — pip install a3m-router → from a3m import A3MRouter
CLI — npx a3m-router route "your query"
REST API — GET /v1/route + POST /v1/chat/completions
OpenAI-Compatible Proxy — Point any OpenAI SDK at http://localhost:8787/v1

# Works with ANY OpenAI SDK — zero code changes
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8787/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="auto",  # ← intelligent routing
    messages=[{"role": "user", "content": "Hello!"}]
)

Why Not ML?

We're not anti-ML. We're anti-overhead. Here's the thing:

Complexity scoring doesn't need neural networks. The features that distinguish "What is 2+2?" from "Design a clinical trial for oncology" are captured by domain keywords, query length, and action verbs. BERT doesn't add meaningful signal here.
Memory > Training. A model trained on general benchmarks can't know that your users ask more coding questions than average. Adaptive memory that learns from your actual usage patterns beats a static trained model.
19.5 KB vs 1.5 GB. If you're running this in a Lambda function, an edge worker, or embedded device, the size matters. Our entire router fits in a single HTTP request.

Get Started

# Install
npm install adaptive-memory-multi-model-router

# CLI
npx a3m-router route "Explain quantum computing"
npx a3m-router serve --port 8787

# TypeScript
import { A3MRouter } from 'adaptive-memory-multi-model-router/sdk';
const router = new A3MRouter();
const decision = router.route("Review this contract");

# Python
pip install a3m-router
from a3m import A3MRouter
async with A3MRouter() as router:
    decision = await router.route("Write a sort function")

GitHub: Das-rebel/adaptive-memory-multi-model-router

npm: adaptive-memory-multi-model-router

PyPI: a3m-router

License: MIT

If you found this interesting, please star the repo and share it. We're a small team building practical AI infrastructure — no hype, just working code.

DEV Community