Most apps I’ve worked on treat LLMs in a very simple way:
You pick a model → send every request to it → hope for the best.
At first, that works.
But over time I kept running into the same problems:
simple queries hitting expensive models
provider outages breaking entire flows
no control over cost vs quality tradeoffs
So I started building a small LLM routing layer that sits in front of model calls and decides which model should handle each request.
This turned out to be way more interesting (and harder) than I expected.
The core idea
Instead of this:
app → single LLM → response
I wanted:
app → router → (cheap model / reasoning model / fallback) → response
The router decides based on the prompt:
simple → cheaper / faster model
complex → reasoning model
failure → fallback provider
What I built
The system is a self-hostable gateway with:
multi-provider support (Groq, Gemini fallback)
intent-based routing (embedding similarity)
semantic caching to avoid repeated calls
health-aware failover across providers
multi-tenant API keys + quotas
For embeddings, I experimented with running a local BGE model via Transformers.js instead of using external APIs.
The hardest problem: routing decisions
This is where things get tricky.
At first I used embedding similarity to classify prompts into categories like:
simple question
summarization
code / reasoning
It works well for clear cases.
But ambiguous prompts break everything.
Example:
“Explain this system design in simple terms”
Is that:
summarization?
reasoning?
both?
This is where simple heuristics start to fall apart.
Local embeddings: great idea, annoying reality
Running embeddings locally felt like a big win:
no external API
no rate limits
more control
But in practice:
cold start takes ~2–5 seconds (ONNX init)
memory overhead (~30–50MB even for small models)
scaling becomes tricky
Once the model is warm, performance is fine.
But that first request penalty is very real, especially for user-facing systems.
What actually worked
A few things that made a noticeable difference:
semantic caching → avoids recomputing embeddings and responses
fallback logic → makes the system much more reliable
cheap-first routing → try fast/cheap models, escalate if needed
What didn’t work (yet)
purely heuristic routing (not reliable enough)
static thresholds for classification
assuming “simple vs complex” is easy to define
Where I think this goes next
The obvious direction is moving toward learning-based routing:
track which responses get escalated
use retries / failures as signals
gradually learn which model performs best per prompt type
Instead of hardcoding rules, let the system adapt over time.
Biggest takeaway
Building around LLMs isn’t just about prompts.
It’s about:
cost control
reliability
system design
The model is just one part of the system.
Curious to hear from others
If you’ve worked on something similar:
How are you deciding which model to use?
Are you running embeddings locally or using APIs?
Have you tried any learning-based routing approaches?
Would love to hear how others are tackling this.
Top comments (0)