Ad Man

Posted on May 17

I Stopped Paying GPT-4 for Simple Queries — Here's the Router I Built

#programming

I Stopped Paying GPT-4 for Simple Queries

You know that feeling when you send "What's 2+2?" to GPT-4 and watch $0.03 vanish?

I was burning through my OpenAI budget like it was 2021 crypto. So I built something that fixed it.

The Problem Nobody Talks About

Most AI apps use one model for everything. GPT-4 for simple lookups. Claude for code reviews. Gemini for translations.

That's like using a Ferrari to deliver pizza.

I tracked my LLM spending for 2 weeks:

Query Type	% of Queries	Model Used	Cost	Optimal Model
Simple Q&A	45%	GPT-4	$0.03/req	Groq (free)
Code review	20%	GPT-4	$0.03/req	Claude Sonnet ($0.015)
Creative writing	15%	GPT-4	$0.03/req	GPT-4o-mini ($0.00015)
Complex analysis	20%	GPT-4	$0.03/req	GPT-4 ($0.03)

80% of my queries didn't need GPT-4. But I was paying for it anyway.

The Solution: A3M Router

I built Adaptive Memory Multi-Model Router — an open-source LLM router that automatically picks the right model for each query.

import { createA3MRouter } from 'adaptive-memory-multi-model-router';

const router = createA3MRouter({
  memory: true,        // Learns from your patterns
  costBudget: 0.05     // Max cost per request
});

// Simple query → fast + cheap model
const cheap = await router.route({
  prompt: 'What is 2+2?',
  context: { type: 'qa' }
});
// → Provider: groq, Cost: $0.00000, Latency: 89ms

// Complex query → best model
const complex = await router.route({
  prompt: 'Debug this 10k line Python codebase',
  context: { type: 'coding', language: 'python' }
});
// → Provider: openai, Cost: $0.003, Latency: 1200ms

The router learns. After a few hundred requests, it knows:

Simple Q&A → Groq/Cerebras (free, fast)
Code review → Claude/GPT-4 (best quality)
Summarization → GPT-4o-mini (cheap, good enough)

The Architecture

Query → Memory Tree → RouteLLM Scoring → Provider Selection → Response
              ↑                                         ↓
         Learns from                            Records result
         past queries                           in memory

Three research-backed techniques make it work:

RouteLLM (arXiv:2404.06035) — Cost-quality routing that balances price vs performance
RadixAttention (arXiv:2312.07104) — Prefix caching for repeated prompt patterns (5-10x speedup)
Token Compression (arXiv:2403.12968) — Compresses context before sending (20-40% fewer tokens)

Real Numbers

I ran A3M Router for 2 weeks on my production workload:

Metric	Before	After	Improvement
Daily cost	$12.40	$7.20	42% reduction
Avg latency	1800ms	650ms	64% faster
Queries/day	~400	~400	Same
Failed requests	23/day	0.4/day	98% reduction

The latency improvement surprised me most. When you route simple queries to Groq (which runs Llama at 800 tok/s), the average drops dramatically.

14 Providers, 116 Integrations

// All providers available out of the box
const router = createA3MRouter({
  providers: [
    'openai',      // GPT-4o, GPT-4o-mini
    'anthropic',   // Claude 3.5 Sonnet
    'groq',        // Llama-3.3-70B (fastest)
    'cerebras',    // Llama-3.3-70B (ultra-fast)
    'google',      // Gemini Pro/Flash
    'deepseek',    // Coding/Math specialist
    'ollama',      // Local models (free)
    // ... 7 more
  ]
});

Plus 116 integrations for GitHub, Slack, Stripe, Pinecone, Notion, and more.

Quick Start

npm install adaptive-memory-multi-model-router

import { createA3MRouter } from 'adaptive-memory-multi-model-router';

const router = createA3MRouter({ memory: true });

const result = await router.route({
  prompt: 'Your prompt here'
});

console.log(result.output);      // The response
console.log(result.provider);    // Which model was chosen
console.log(result.cost);        // How much it cost

Or use the CLI:

npx a3m-router route "Explain quantum computing"
npx a3m-router parallel "task1" "task2" "task3"
npx a3m-router cost  # Show cost tracking

The Honest Limitations

A3M Router isn't perfect:

Memory is local — no distributed memory sharing between instances yet
First requests are slower — it needs ~50 queries to learn your patterns
No streaming support — working on it for v2.0
Compression is lossy — ~80% reduction but some nuance gets lost

I'm actively working on all of these. PRs welcome.

What's Next

v2.0: Streaming support, distributed memory, WebSocket provider updates
Benchmark dashboard: Real-time cost/latency/quality tracking
LangChain integration: Drop-in replacement for their router

Links:

📦 npm: adaptive-memory-multi-model-router
💻 GitHub: Das-rebel/adaptive-memory-multi-model-router
⭐ Star the repo if this is useful!

What's your LLM routing strategy? Do you manually pick models, or do you use something automated? I'd love to hear what others are doing — drop a comment below. 👇

DEV Community