DEV Community

Ad Man
Ad Man

Posted on

I Stopped Paying GPT-4 for Simple Queries — Here's the Router I Built

I Stopped Paying GPT-4 for Simple Queries

You know that feeling when you send "What's 2+2?" to GPT-4 and watch $0.03 vanish?

I was burning through my OpenAI budget like it was 2021 crypto. So I built something that fixed it.

The Problem Nobody Talks About

Most AI apps use one model for everything. GPT-4 for simple lookups. Claude for code reviews. Gemini for translations.

That's like using a Ferrari to deliver pizza.

I tracked my LLM spending for 2 weeks:

Query Type % of Queries Model Used Cost Optimal Model
Simple Q&A 45% GPT-4 $0.03/req Groq (free)
Code review 20% GPT-4 $0.03/req Claude Sonnet ($0.015)
Creative writing 15% GPT-4 $0.03/req GPT-4o-mini ($0.00015)
Complex analysis 20% GPT-4 $0.03/req GPT-4 ($0.03)

80% of my queries didn't need GPT-4. But I was paying for it anyway.

The Solution: A3M Router

I built Adaptive Memory Multi-Model Router — an open-source LLM router that automatically picks the right model for each query.

import { createA3MRouter } from 'adaptive-memory-multi-model-router';

const router = createA3MRouter({
  memory: true,        // Learns from your patterns
  costBudget: 0.05     // Max cost per request
});

// Simple query → fast + cheap model
const cheap = await router.route({
  prompt: 'What is 2+2?',
  context: { type: 'qa' }
});
// → Provider: groq, Cost: $0.00000, Latency: 89ms

// Complex query → best model
const complex = await router.route({
  prompt: 'Debug this 10k line Python codebase',
  context: { type: 'coding', language: 'python' }
});
// → Provider: openai, Cost: $0.003, Latency: 1200ms
Enter fullscreen mode Exit fullscreen mode

The router learns. After a few hundred requests, it knows:

  • Simple Q&A → Groq/Cerebras (free, fast)
  • Code review → Claude/GPT-4 (best quality)
  • Summarization → GPT-4o-mini (cheap, good enough)

The Architecture

Query → Memory Tree → RouteLLM Scoring → Provider Selection → Response
              ↑                                         ↓
         Learns from                            Records result
         past queries                           in memory
Enter fullscreen mode Exit fullscreen mode

Three research-backed techniques make it work:

  1. RouteLLM (arXiv:2404.06035) — Cost-quality routing that balances price vs performance
  2. RadixAttention (arXiv:2312.07104) — Prefix caching for repeated prompt patterns (5-10x speedup)
  3. Token Compression (arXiv:2403.12968) — Compresses context before sending (20-40% fewer tokens)

Real Numbers

I ran A3M Router for 2 weeks on my production workload:

Metric Before After Improvement
Daily cost $12.40 $7.20 42% reduction
Avg latency 1800ms 650ms 64% faster
Queries/day ~400 ~400 Same
Failed requests 23/day 0.4/day 98% reduction

The latency improvement surprised me most. When you route simple queries to Groq (which runs Llama at 800 tok/s), the average drops dramatically.

14 Providers, 116 Integrations

// All providers available out of the box
const router = createA3MRouter({
  providers: [
    'openai',      // GPT-4o, GPT-4o-mini
    'anthropic',   // Claude 3.5 Sonnet
    'groq',        // Llama-3.3-70B (fastest)
    'cerebras',    // Llama-3.3-70B (ultra-fast)
    'google',      // Gemini Pro/Flash
    'deepseek',    // Coding/Math specialist
    'ollama',      // Local models (free)
    // ... 7 more
  ]
});
Enter fullscreen mode Exit fullscreen mode

Plus 116 integrations for GitHub, Slack, Stripe, Pinecone, Notion, and more.

Quick Start

npm install adaptive-memory-multi-model-router
Enter fullscreen mode Exit fullscreen mode
import { createA3MRouter } from 'adaptive-memory-multi-model-router';

const router = createA3MRouter({ memory: true });

const result = await router.route({
  prompt: 'Your prompt here'
});

console.log(result.output);      // The response
console.log(result.provider);    // Which model was chosen
console.log(result.cost);        // How much it cost
Enter fullscreen mode Exit fullscreen mode

Or use the CLI:

npx a3m-router route "Explain quantum computing"
npx a3m-router parallel "task1" "task2" "task3"
npx a3m-router cost  # Show cost tracking
Enter fullscreen mode Exit fullscreen mode

The Honest Limitations

A3M Router isn't perfect:

  • Memory is local — no distributed memory sharing between instances yet
  • First requests are slower — it needs ~50 queries to learn your patterns
  • No streaming support — working on it for v2.0
  • Compression is lossy — ~80% reduction but some nuance gets lost

I'm actively working on all of these. PRs welcome.

What's Next

  • v2.0: Streaming support, distributed memory, WebSocket provider updates
  • Benchmark dashboard: Real-time cost/latency/quality tracking
  • LangChain integration: Drop-in replacement for their router

Links:


What's your LLM routing strategy? Do you manually pick models, or do you use something automated? I'd love to hear what others are doing — drop a comment below. 👇

Top comments (0)