And why we turned the fix into a product anyone can use.
The Problem Started With Our Own Product
My team and I were building Saasio — a no-code SaaS platform that lets users launch products without writing a single line of code. Like most modern platforms, we added AI capabilities so users could get things done faster: generate UI components, write copy, answer questions about their workspace, and more.
At first, everything worked. Then our AI bill arrived.
We were using Claude Sonnet and GPT for every request — from complex, multi-step code generation all the way down to a user asking: "What does the 'Publish' button do?"
That's like hiring a neurosurgeon to put on a band-aid. Technically it works. Financially, it's absurd.
But there was a second, subtler problem we almost missed entirely.
The Mistake Nobody Talks About: Context Blindness
Imagine one of our users building a car rental landing page with our AI assistant. The conversation starts strong:
"Build me a car rental business landing page with a hero section, pricing cards, and a booking form."
This is a complex coding task. High reasoning required. Claude Opus or GPT-4o are the right call here.
Then, ten messages later, in the same chat window:
"Rewrite my hero headline and make the CTA button more compelling."
At first glance: same conversation, same context, same model — right?
Wrong. Look closer.
The first message is a coding task — architecture, component structure, logic. The second is copywriting — creativity, persuasion, tone. They're completely different cognitive tasks. And here's the kicker: that second message doesn't need any of the previous conversation history to answer well. The user is asking a standalone creative question.
Yet we were sending the full conversation — thousands of tokens of code, HTML, planning messages — to an expensive reasoning model, just to rewrite two sentences.
We were making this mistake hundreds of times a day, across every user on the platform.
That's when we decided to build a solution internally. It worked so well — cutting our AI costs dramatically while improving response quality — that we realized: every team building with AI is facing this exact problem. So we turned it into a product. That product is LLM Router.
Before We Go Further: What Do Benchmarks Actually Tell Us?
To understand why smart routing matters, you first need to understand what AI model benchmarks are really saying — and what they're not saying.
Take Claude Opus 4.6. On MMLU (Massive Multitask Language Understanding), one of the most respected AI benchmarks, it scores around 86.8%. Impressive. Now look at a mid-tier model like DeepSeek V3 — it scores around 74.2% on the same benchmark.
Most people see that gap and think: "Obviously use Opus."
But here's what that 74.2% actually means: for roughly 3 out of 4 questions, the affordable model gives a perfectly correct answer. The expensive model is only meaningfully better for the hardest, most nuanced questions — the ones that make up a fraction of your actual traffic.
Put another way: if a user asks your app "Summarize this paragraph" or "What's the plural of 'analysis'?" — you do not need Opus. You're paying a Ferrari price to drive to the corner store.
Every Model Has a Specialty
What benchmarks also reveal — when you look closely — is that AI models don't all excel at the same things:
| Task type | Where models diverge most |
|---|---|
| Complex reasoning & code | Opus-class models dominate |
| UI / frontend generation | Gemini Pro performs surprisingly well |
| Creative copywriting | Grok and GPT show strong results |
| Fast, simple Q&A | DeepSeek, Mistral — cheap and accurate |
| Legal / compliance text | Requires specific instruction + retrieval |
This isn't speculation — it shows up consistently in task-specific evaluations like HumanEval (coding), MT-Bench (conversation), and design-focused evals.
The implication is clear: routing the right task to the right model isn't just about cost — it's about getting better answers by using each model where it actually shines.
Introducing LLM Router
LLM Router is an intelligent AI gateway. One API key. Access to 400+ models. And — critically — the infrastructure to stop wasting money and start getting the best answer for every request.
Here's what it does.
1. Tag-Based Smart Routing
Instead of hardcoding model: "claude-opus-4-6" into every request and hoping for the best, LLM Router lets you label your requests by task type — and automatically routes each one to the model best suited for that job.
You define tags in the LLM Router dashboard. Something like:
coding → anthropic/claude-opus-4-6
ui-design → google/gemini-3.1-pro
copywriting → x-ai/grok-4
fast-cheap → deepseek/deepseek-v3-lite
legal → anthropic/claude-opus-4-6 + Legal Skill
Then in your code, you simply pass the tag with the request. LLM Router does the rest.
Real-world example — a development team:
-
codingtag → Claude Opus 4.6 (best reasoning and code quality) -
ui-designtag → Gemini 3.1 Pro (exceptional at Tailwind and component design) -
testingtag → Claude Sonnet 4.6 (fast, accurate, and cheaper) -
fast-cheaptag → DeepSeek Lite (for anything simple)
Back to the car rental scenario: with smart routing, the "build me a landing page" request gets tagged coding and routes to Opus. The "rewrite my CTA" request gets tagged copywriting and routes to Grok. Same user experience. A fraction of the cost.
2. Chat Optimization
Here's a stat that might surprise you: in a typical Claude Code or Cursor session, over 60% of the tokens you're paying for are redundant.
Old messages that are no longer relevant. Repeated code blocks. Tool definitions that never get used. All of it gets sent to the model on every request — and you pay for every single token.
LLM Router's Chat Optimization automatically cleans your conversation history before it reaches the model. It:
- Removes off-topic history when the user has clearly changed subjects
- Deduplicates repeated code blocks or instructions
- Truncates long conversations intelligently, keeping the most critical context
- Filters unused tool definitions so you're not paying to describe tools the model won't need
- Applies middle-out compression for very long contexts, preserving the beginning and end while compressing the middle
This happens automatically, with zero changes to your code. You just pay less.
3. Tool Optimization
Modern AI applications often define 20, 30, even 50+ tools (function calls) for a model to use. The problem? Every tool definition costs tokens — and they all get sent with every request, even when the user's query only needs 2 or 3 of them.
LLM Router analyzes the incoming prompt and dynamically filters the tool list before forwarding the request. If a user asks a billing question, your code-execution tools stay home. If they ask a coding question, your CRM tools stay home.
The result: dramatically fewer tokens per request, with no loss in capability.
4. Automatic PII Redaction
Sending user data to AI providers is one of the biggest compliance risks teams face today — and most teams handle it poorly, if at all.
LLM Router automatically detects and masks sensitive information before the prompt ever leaves your infrastructure:
- API keys and passwords
- Credit card numbers
- Social Security / National ID numbers
- Email addresses and phone numbers
- Private keys and credentials
- Personal names in sensitive contexts
This isn't opt-in. It's on by default. Your users' data doesn't reach OpenAI, Anthropic, or anyone else — only clean, redacted prompts do.
For teams building in regulated industries (fintech, healthtech, legal), this alone is worth the switch.
5. Skills — Modular Knowledge Injection
This is our favorite feature, and arguably the most unique thing LLM Router offers.
The usual approach to customizing AI behavior is a giant system prompt — one massive block of instructions, company rules, brand guidelines, coding standards, and domain knowledge stuffed into every request. It's expensive, it's messy, and it means the model is processing your entire knowledge base even when it's answering a simple question.
Skills solve this elegantly. A Skill is a small, focused knowledge package — a SKILL.md file with optional supporting documents — that LLM Router injects only when relevant to the current request.
Install Skills from the public catalog at skills.sh or connect your own GitHub repository. LLM Router acts like an intelligent librarian: it reads the user's prompt, decides which Skills are actually needed, and injects only those — nothing more.
Example:
- User asks a coding question → LLM Router injects your
coding-standardsSkill - User asks about pricing → LLM Router injects your
pricing-rulesSkill - User asks a legal question → LLM Router injects your
complianceSkill and routes to Claude Opus
The result is a model that feels deeply knowledgeable about your product — without paying to send everything on every request.
Pricing — Transparent by Design
Most AI gateways mark up token prices silently. You pay $X per million tokens thinking that's the provider rate — it isn't.
LLM Router doesn't do that. We pass through tokens at the exact list prices from our upstream providers, with zero markup added by us. Claude Opus costs what Anthropic charges for Claude Opus. GPT-4o costs what OpenAI charges for GPT-4o. Nothing hidden.
The only fees you'll see are Stripe's standard payment processing fee and any applicable routing costs — both of which are clearly disclosed. That's it.
This matters more than it sounds: when you're routing hundreds of thousands of requests across multiple providers, even a 10–15% silent markup compounds into a significant monthly cost. We'd rather you spend that money on more requests.
The Bottom Line
Every team building AI features is paying more than they should and getting less than they could. Not because they're doing something wrong — but because the tools they're using weren't designed to be smart about where and how AI is deployed.
LLM Router was built by a team that felt this problem firsthand, measured it carefully, and solved it systematically.
One API key. 400+ models. Zero wasted tokens.
Top comments (0)