chnby

Posted on Jun 22 • Originally published at apicalculators.com

How I Cut My LLM API Bill by 80% With a Simple Router

#llm #python #webdev #ai

No fancy infrastructure. Just a 50-line Python function that picks the right model for the right query.

Last month my LLM API bill hit $340. This month: $67.

Same traffic. Same product. The only change was adding a simple router that stops sending every request to Claude Sonnet when GPT-4o mini can handle it just as well.

Here's exactly how it works.

The Problem
When you prototype, you pick one model and hardcode it everywhere. Usually something capable like GPT-4o or Claude Sonnet, because you want good results fast.

Then you ship, traffic grows, and you get a bill that makes you question your life choices.

The thing is — not all queries need a flagship model. In a typical RAG app:

"What is the return policy?" → GPT-4o mini handles this fine
"Summarize these 5 conflicting documents and identify the key disagreement" → needs Sonnet
You're paying Sonnet prices for return policy questions. That's the bug.

The Fix: A Complexity Router

import anthropic
from openai import OpenAI

openai_client = OpenAI()
anthropic_client = anthropic.Anthropic()

def classify_complexity(query: str) -> str:
"""Returns 'simple' or 'complex'."""
simple_indicators = [
len(query.split()) < 15,
query.endswith("?") and query.count("?") == 1,
not any(w in query.lower() for w in [
"compare", "analyze", "summarize", "explain why",
"difference between", "pros and cons", "evaluate"
])
]
return "simple" if sum(simple_indicators) >= 2 else "complex"

def route(query: str, context: str = "") -> str:
complexity = classify_complexity(query)

if complexity == "simple":
    # $0.15/M input — GPT-4o mini
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": context},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content
else:
    # $3.00/M input — Claude Sonnet (only when needed)
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=context,
        messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text

Adding a Cache Layer
The router alone saved me ~50%. The cache pushed it to 80%.

import hashlib
import json
from functools import lru_cache

In production: use Redis. For prototyping: this works fine.

_cache: dict = {}

def get_cache_key(query: str, context: str) -> str:
payload = json.dumps({"q": query, "c": context}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()

def route_cached(query: str, context: str = "") -> str:
key = get_cache_key(query, context)

if key in _cache:
    return _cache[key]  # free

result = route(query, context)
_cache[key] = result
return result

Turns out ~30% of queries in my app were near-identical. "What are your hours?" gets asked constantly. Paying for the same LLM call 200 times/day is just burning money.

Logging Costs in Real Time
You can't optimize what you don't measure. I added cost tracking so I know exactly what each call costs:

COST_PER_1K_TOKENS = {
"gpt-4o-mini": {"input": 0.000150, "output": 0.000600},
"claude-sonnet-4-6": {"input": 0.003000, "output": 0.015000},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = COST_PER_1K_TOKENS.get(model, {"input": 0, "output": 0})
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1000

def route_with_logging(query: str, context: str = "") -> dict:
complexity = classify_complexity(query)
model = "gpt-4o-mini" if complexity == "simple" else "claude-sonnet-4-6"

if complexity == "simple":
    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": context},
            {"role": "user", "content": query}
        ]
    )
    content = response.choices[0].message.content
    usage = response.usage
else:
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=1024,
        system=context,
        messages=[{"role": "user", "content": query}]
    )
    content = response.content[0].text
    usage = response.usage

cost = calculate_cost(model, usage.input_tokens, usage.output_tokens)

print(f"[{model}] {complexity} | ${cost:.5f} | {query[:50]}...")

return {"content": content, "cost": cost, "model": model}

Sample output:

[gpt-4o-mini] simple | $0.00008 | What are your business hours?...
[claude-sonnet-4-6] complex | $0.00340 | Compare the refund policies across...
[gpt-4o-mini] simple | $0.00006 | How do I reset my password?...
Results After 30 Days
Metric Before After
Avg cost per query $0.0034 $0.0007
% queries → mini model 0% 73%
Cache hit rate 0% 31%
Monthly bill $340 $67
Answer quality complaints 2 3
The quality delta was negligible. Three users in a month said an answer felt shallow — all three were simple factual queries that I probably should have cached anyway.

When This Doesn't Work
Be honest about the limits:

Creative writing / long-form content — mini models struggle here, don't route these down
Multi-document synthesis — always route to the capable model
Anything with high stakes (medical, legal, financial) — don't optimize cost here, use the best model
The classify_complexity function above is naive on purpose. You know your query patterns better than I do. Tune the keywords list to your domain.

Next Step
Before you do any of this, model your current costs to know where the money is actually going. I used APICalculators LLM cost calculator — free, no signup, shows cost per model at your actual token volumes. Knowing the delta between models makes it obvious which optimization to prioritize.

Questions or a different routing approach that worked for you? Drop it in the comments.

Top comments (5)

Alex Shev • Jun 22

Routing gets much more useful when it is tied to task risk, not just price. Cheap models are great for classification and formatting, but anything that changes state or advises a user needs a clearer confidence and escalation path.

chnby • Jun 22

Completely agree — risk-based routing is the more principled approach.
Cost complexity is a decent proxy for simple cases, but you're right
that "does this query mutate state or advise a user?" is a much better
escalation signal. I've been thinking about adding a risk_level flag
alongside complexity in the router. Might write a follow-up on that.

Alex Shev • Jun 23

A risk_level flag would be a strong next step. Even a simple split between read-only, advisory, state-changing, and user-facing output would make routing decisions easier to defend.

chnby • Jun 23

Exactly — and that four-way split (read-only / advisory / state-changing / user-facing) maps almost perfectly onto the real failure modes. Read-only gets it wrong, you re-run. State-changing gets it wrong, you're debugging a corrupted record at 2am. User-facing gets it wrong, a person made a decision based on it.

The routing logic almost writes itself once those categories are explicit. Might make risk_level the centerpiece of the follow-up rather than a footnote.

Alex Shev • Jun 23

Yes, I would make risk_level the center of the follow-up. It turns routing from a cost trick into an operational policy: cheap model for low-risk reads, stronger model or human gate when state, money, or user-facing output is involved.