DEV Community

Sam Chen
Sam Chen

Posted on

How I Cut My AI API Bill by 90% With a Multi-Model Routing System

Last month my Claude API bill was $847. This month it's $73. Same output quality. Here's the system I built.

The Problem

I run multiple AI-powered services — content generation, email classification, SEO optimization, data extraction. Every call was going to Claude Sonnet because "it works." But most of those calls didn't need Sonnet-level intelligence.

Classifying an email as spam? That's a Haiku job. Generating embeddings? Ollama handles that for free. Writing a full article? OK, that's Sonnet. But only 15% of my calls actually needed the expensive model.

The Architecture: Empire Router

I built a routing layer that sits between my application code and the LLM providers. Every request gets classified by complexity, then routed to the cheapest model that can handle it.

from empire_router import router

# Auto-routed based on task complexity
response = router.complete(
    prompt="Classify this email: ...",
    task="classify"  # Routes to Haiku ($0.80/M tokens)
)

response = router.complete(
    prompt="Write a 1500-word article about...",
    task="generate"  # Routes to Sonnet ($3/M tokens)
)

embedding = router.embed("text to embed")  # Routes to Ollama (FREE)
Enter fullscreen mode Exit fullscreen mode

The Routing Decision Tree

Task classification:
├── Binary/classification → Haiku ($0.80/$4 per M tokens)
├── Embeddings → Ollama on VPS (FREE)
├── Simple extraction → DeepSeek ($0.27/M) or Groq (FREE)
├── Content generation → Sonnet ($3/$15 per M tokens)
└── Complex reasoning → Opus ($15/$75) — used <2% of calls
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

1. Task-type routing, not content-length routing

My first attempt routed by prompt length. Terrible idea. A short prompt like "Is this email spam?" needs a cheap model regardless of length. A short prompt like "Design the architecture for a distributed cache" needs an expensive one.

The task type is what determines model selection, not the token count.

2. Fallback chains, not single-model assignments

ROUTING_CHAINS = {
    "classify": ["ollama/llama3.1:8b", "groq/llama3", "haiku", "sonnet"],
    "generate": ["sonnet", "opus"],
    "embed": ["ollama/nomic-embed-text", "voyage-3"],
}
Enter fullscreen mode Exit fullscreen mode

If the primary model is down or rate-limited, it cascades to the next option. No failed requests, just slightly higher cost on fallback.

3. Quality gates on cheap models

The router doesn't blindly trust cheap model output. For tasks where accuracy matters, it runs a quality check:

  • Send to cheap model first
  • Score the response (confidence, format validity, coherence)
  • If score < threshold → retry on next model in chain
  • Log the escalation for future routing optimization

In practice, Haiku handles 94% of classification tasks without escalation.

4. Prompt caching for repeated patterns

System prompts that exceed 500 characters get cached. For classification tasks that run the same system prompt thousands of times, this cuts input costs by 90% after the first call.

Results After 30 Days

Metric Before After Change
Monthly cost $847 $73 -91%
Avg latency 2.1s 0.8s -62%
Failed requests 12/day 0.3/day -97%
Quality (human eval) 4.2/5 4.1/5 -2%

The quality dip is within noise. The latency improvement comes from Haiku being faster than Sonnet, plus Ollama embeddings having no network round-trip.

Self-Hosted vs. API: Where the Line Is

I run Ollama on a Contabo VPS (CPU-only, $15/mo). It handles:

  • All embeddings (nomic-embed-text)
  • Simple classification fallback (llama3.1:8b)
  • Data extraction on non-sensitive content

Everything that needs quality or handles sensitive data goes to API providers. The VPS pays for itself in 2 days of avoided API calls.

Try It

The routing pattern works with any LLM provider combination. The key insight: treat model selection as a runtime decision, not a deployment decision.

I write about practical AI cost optimization and infrastructure at wealthfromai.com. The full router is open for anyone building similar multi-model systems — DM me if you want the architecture details.

Top comments (0)