I love Claude Code. It's the best AI coding assistant I've used. But after a few weeks of heavy usage, I checked my API costs and nearly fell out of my chair.
$247 in a month.
Most of that was from simple prompts that didn't need Claude Sonnet. Things like "what does this function do?" or "read the file at src/main.py" or "add a test for this function." Basic stuff that any decent LLM can handle.
So I built a router. It sits between Claude Code and the API, classifies each prompt in about 10ms, and sends simple stuff to cheaper models while keeping the complex work on Claude.
After a month of using it, my bill dropped to $98. Same usage pattern. Same quality. 60% less money.
Here's how it works and how you can do the same.
The Problem: Every Prompt Costs the Same
When you use Claude Code (or most AI coding tools), every single request hits the same expensive model. It doesn't matter if you're asking it to refactor a complex async system or just read a file. You pay full price.
In my case, about 65% of my prompts were simple enough that they didn't need Claude Sonnet. But there's no built-in way to route them differently.
I needed a classifier that could decide in real-time: does this prompt need the expensive model, or can it go to something cheaper?
The Solution: A Tiny Embedding Classifier
I didn't want to train a big ML model or add a bunch of latency. The classifier needed to be fast (under 20ms), lightweight (no GPU), and accurate enough to not mess up complex prompts.
Here's what I built:
- Pre-compute two centroid vectors (one for simple prompts, one for complex) using a sentence embedding model
- For each incoming prompt, compute its embedding and measure cosine similarity to both centroids
- Route based on which centroid is closer
The entire classifier is about 200 lines of Python. It uses sentence-transformers with the all-MiniLM-L6-v2 model (80 MB, runs on CPU).
Code: The Classifier
from sentence_transformers import SentenceTransformer
import numpy as np
from pathlib import Path
class PromptClassifier:
def __init__(self, threshold=0.06):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
pkg_dir = Path(__file__).parent
self.simple_centroid = np.load(pkg_dir / "simple_centroid.npy")
self.complex_centroid = np.load(pkg_dir / "complex_centroid.npy")
self.threshold = threshold
def classify(self, prompt):
embedding = self.encoder.encode([prompt], normalize_embeddings=True)[0]
simple_sim = np.dot(embedding, self.simple_centroid)
complex_sim = np.dot(embedding, self.complex_centroid)
score = (complex_sim - simple_sim + 2) / 4
confidence = abs(complex_sim - simple_sim)
if confidence < self.threshold:
tier = "complex"
else:
tier = "complex" if score > 0.5 else "simple"
return {"tier": tier, "score": score, "confidence": confidence}
That's the core logic. The magic is in the centroids.
How I Built the Centroids
I collected about 170 real prompts from my own Claude Code sessions and manually labeled them as simple or complex. Then I computed embeddings for all of them and took the mean of each group:
from sentence_transformers import SentenceTransformer
import numpy as np
encoder = SentenceTransformer('all-MiniLM-L6-v2')
simple_embeddings = encoder.encode(SIMPLE_PROMPTS, normalize_embeddings=True)
complex_embeddings = encoder.encode(COMPLEX_PROMPTS, normalize_embeddings=True)
simple_centroid = np.mean(simple_embeddings, axis=0)
complex_centroid = np.mean(complex_embeddings, axis=0)
np.save("simple_centroid.npy", simple_centroid)
np.save("complex_centroid.npy", complex_centroid)
Those two .npy files are about 1.5 KB each. I ship them with the package. No training step needed when you install it.
Wrapping It in a Proxy Server
The classifier alone doesn't help unless you can actually route requests to different models. I built a FastAPI server that exposes an OpenAI-compatible API and routes requests based on the classification:
from fastapi import FastAPI, Request
import httpx
from nadirclaw.classifier import PromptClassifier
app = FastAPI()
classifier = PromptClassifier()
SIMPLE_MODEL = "gemini-2.5-flash"
COMPLEX_MODEL = "claude-sonnet-4-5"
@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
body = await request.json()
messages = body.get("messages", [])
last_user = next((m["content"] for m in reversed(messages) if m["role"] == "user"), "")
result = classifier.classify(last_user)
target_model = COMPLEX_MODEL if result["tier"] == "complex" else SIMPLE_MODEL
response = await dispatch_to_provider(target_model, body)
return response
Point Claude Code at http://localhost:8856/v1 instead of the Anthropic API, and every request flows through the router.
What Gets Routed Where?
After running this for a month on my real Claude Code usage, here's what the distribution looks like:
Simple tier (65% of requests):
- "What does this function do?"
- "Read the file at src/main.py"
- "Add a docstring to this class"
- "Show me the git log for this file"
Complex tier (35% of requests):
- "Refactor this module to use dependency injection"
- "Design a caching layer for this API"
- "Explain why this async operation deadlocks"
- Multi-file changes
- Architecture discussions
Accuracy: I spot-checked about 200 routed requests. 94% were routed correctly. The 6% that were wrong were borderline cases that worked fine on the cheaper model anyway.
Beyond Basic Classification: Smart Overrides
A pure embedding classifier isn't enough. I added a few rules on top:
1. Agentic Task Detection
If the request includes tool definitions, it always goes to the complex model.
2. Reasoning Detection
If the prompt has multiple reasoning markers ("step by step", "prove that"), it goes to complex.
3. Session Persistence
Once a conversation is routed to a model, follow-up messages stick to that model.
Results: 60% Cost Reduction
Before NadirClaw:
- Total requests: 1,847
- All to Claude Sonnet 4.5
- Total cost: $247.13
After NadirClaw:
- Simple tier (65%): 1,201 requests to Gemini 2.5 Flash = $14.82
- Complex tier (35%): 646 requests to Claude Sonnet 4.5 = $83.19
- Total cost: $98.01
- Savings: $149.12 (60% reduction)
No quality loss. Same conversations. Just smarter routing.
How to Use It Yourself
I open-sourced the whole thing: NadirClaw on GitHub
pip install nadirclaw
nadirclaw setup
nadirclaw serve
export ANTHROPIC_BASE_URL=http://localhost:8856/v1
claude
Or use it with any tool that speaks the OpenAI API format (Cursor, Continue, OpenClaw, etc.).
What I Learned
- Most LLM usage doesn't need the premium model. 65% of prompts were simple enough for a much cheaper model.
- A tiny classifier is enough. Sentence embeddings + cosine similarity gets you 94% accuracy in under 20ms.
- Smart overrides matter. You need rules for agentic tasks, reasoning prompts, and session persistence.
- Local control beats platform lock-in. Your API keys stay on your machine.
The whole classifier is about 200 lines. The cost savings are real. And it works with any LLM tool that speaks the OpenAI API.
If you're spending serious money on Claude (or any other premium LLM), try routing. It's the easiest 60% cost cut I've ever made.
Follow-up questions? Issues? Want to contribute?
GitHub: doramirdor/NadirClaw
Top comments (0)