I Stopped Paying GPT-4 for Simple Queries
You know that feeling when you send "What's 2+2?" to GPT-4 and watch $0.03 vanish?
I was burning through my OpenAI budget like it was 2021 crypto. So I built something that fixed it.
The Problem Nobody Talks About
Most AI apps use one model for everything. GPT-4 for simple lookups. Claude for code reviews. Gemini for translations.
That's like using a Ferrari to deliver pizza.
I tracked my LLM spending for 2 weeks:
| Query Type | % of Queries | Model Used | Cost | Optimal Model |
|---|---|---|---|---|
| Simple Q&A | 45% | GPT-4 | $0.03/req | Groq (free) |
| Code review | 20% | GPT-4 | $0.03/req | Claude Sonnet ($0.015) |
| Creative writing | 15% | GPT-4 | $0.03/req | GPT-4o-mini ($0.00015) |
| Complex analysis | 20% | GPT-4 | $0.03/req | GPT-4 ($0.03) |
80% of my queries didn't need GPT-4. But I was paying for it anyway.
The Solution: A3M Router
I built Adaptive Memory Multi-Model Router — an open-source LLM router that automatically picks the right model for each query.
import { createA3MRouter } from 'adaptive-memory-multi-model-router';
const router = createA3MRouter({
memory: true, // Learns from your patterns
costBudget: 0.05 // Max cost per request
});
// Simple query → fast + cheap model
const cheap = await router.route({
prompt: 'What is 2+2?',
context: { type: 'qa' }
});
// → Provider: groq, Cost: $0.00000, Latency: 89ms
// Complex query → best model
const complex = await router.route({
prompt: 'Debug this 10k line Python codebase',
context: { type: 'coding', language: 'python' }
});
// → Provider: openai, Cost: $0.003, Latency: 1200ms
The router learns. After a few hundred requests, it knows:
- Simple Q&A → Groq/Cerebras (free, fast)
- Code review → Claude/GPT-4 (best quality)
- Summarization → GPT-4o-mini (cheap, good enough)
The Architecture
Query → Memory Tree → RouteLLM Scoring → Provider Selection → Response
↑ ↓
Learns from Records result
past queries in memory
Three research-backed techniques make it work:
- RouteLLM (arXiv:2404.06035) — Cost-quality routing that balances price vs performance
- RadixAttention (arXiv:2312.07104) — Prefix caching for repeated prompt patterns (5-10x speedup)
- Token Compression (arXiv:2403.12968) — Compresses context before sending (20-40% fewer tokens)
Real Numbers
I ran A3M Router for 2 weeks on my production workload:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Daily cost | $12.40 | $7.20 | 42% reduction |
| Avg latency | 1800ms | 650ms | 64% faster |
| Queries/day | ~400 | ~400 | Same |
| Failed requests | 23/day | 0.4/day | 98% reduction |
The latency improvement surprised me most. When you route simple queries to Groq (which runs Llama at 800 tok/s), the average drops dramatically.
14 Providers, 116 Integrations
// All providers available out of the box
const router = createA3MRouter({
providers: [
'openai', // GPT-4o, GPT-4o-mini
'anthropic', // Claude 3.5 Sonnet
'groq', // Llama-3.3-70B (fastest)
'cerebras', // Llama-3.3-70B (ultra-fast)
'google', // Gemini Pro/Flash
'deepseek', // Coding/Math specialist
'ollama', // Local models (free)
// ... 7 more
]
});
Plus 116 integrations for GitHub, Slack, Stripe, Pinecone, Notion, and more.
Quick Start
npm install adaptive-memory-multi-model-router
import { createA3MRouter } from 'adaptive-memory-multi-model-router';
const router = createA3MRouter({ memory: true });
const result = await router.route({
prompt: 'Your prompt here'
});
console.log(result.output); // The response
console.log(result.provider); // Which model was chosen
console.log(result.cost); // How much it cost
Or use the CLI:
npx a3m-router route "Explain quantum computing"
npx a3m-router parallel "task1" "task2" "task3"
npx a3m-router cost # Show cost tracking
The Honest Limitations
A3M Router isn't perfect:
- Memory is local — no distributed memory sharing between instances yet
- First requests are slower — it needs ~50 queries to learn your patterns
- No streaming support — working on it for v2.0
- Compression is lossy — ~80% reduction but some nuance gets lost
I'm actively working on all of these. PRs welcome.
What's Next
- v2.0: Streaming support, distributed memory, WebSocket provider updates
- Benchmark dashboard: Real-time cost/latency/quality tracking
- LangChain integration: Drop-in replacement for their router
Links:
- 📦 npm: adaptive-memory-multi-model-router
- 💻 GitHub: Das-rebel/adaptive-memory-multi-model-router
- ⭐ Star the repo if this is useful!
What's your LLM routing strategy? Do you manually pick models, or do you use something automated? I'd love to hear what others are doing — drop a comment below. 👇
Top comments (0)