I caught myself doing something stupid the other day. I used a 30B parameter model to summarize a two-paragraph email. The GPU spun up, the fans kicked in, and 8 seconds later I had my summary. The same summary my 4B model would've given me in 0.3 seconds.
That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.
So I built a model router. Not a fancy orchestrator with Kubernetes and service meshes — just a simple function that looks at what you're asking and sends it to the right model on the right machine. It's the single most impactful thing I've done for my local AI setup this year.
The Problem
I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:
- Need something → open terminal
- Think about which model is good enough
- Think about which machine has it
- Type the full URL:
curl http://192.168.1.106:11434/api/generate -d '{\"model\": \"qwen3-coder:30b\", ...}' - Wait
Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."
Here's the thing: it's usually not better. A 4B model summarizing text is 95% as good as a 30B model summarizing text. But it's 25x faster and uses 1/10th the resources. The big model should earn its keep on tasks where size actually matters.
The Router
I started with a Python function. Nothing fancy:
import re
# Model registry: what's available where
MODELS = {
# Quick chat, summaries, simple Q&A → Mac Mini (fast response)
"quick": {
"model": "qwen3:4b",
"endpoint": "http://localhost:11434",
},
# Vision / image analysis → Windows GPU (needs VRAM)
"vision": {
"model": "granite3.2-vision:2b",
"endpoint": "http://192.168.1.106:11434",
},
# Code generation, complex reasoning → Windows GPU
"code": {
"model": "qwen3-coder:30b",
"endpoint": "http://192.168.1.106:11434",
},
# Deep reasoning, math, logic puzzles → Windows GPU
"reasoning": {
"model": "deepseek-r1:8b",
"endpoint": "http://192.168.1.106:11434",
},
# Fallback for when Windows is rented out → Ubuntu CPU
"fallback": {
"model": "minicpm-v",
"endpoint": "http://192.168.1.100:11434",
},
}
def route(prompt: str) -> dict:
"""Decide which model should handle this prompt."""
p = prompt.lower()
# Vision tasks
if any(w in p for w in ["image", "screenshot", "picture", "see this", "what's on"]):
return MODELS["vision"]
# Code tasks
if any(w in p for w in ["function", "class", "debug", "refactor", "implement",
"write code", "python", "javascript", "typescript",
"api endpoint", "sql", "regex"]):
return MODELS["code"]
# Reasoning tasks
if any(w in p for w in ["prove", "solve", "calculate", "math", "logic",
"reason", "analyze", "compare", "evaluate"]):
return MODELS["reasoning"]
# Default: quick model for everything else
return MODELS["quick"]
That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.
Why This Matters More Than You Think
Before the router, my typical day looked like this:
- 30 quick questions (summarize this, what does this error mean, rephrase this)
- 5 code tasks (write a function, debug this, add tests)
- 2 vision tasks (what's in this screenshot)
- 1 deep reasoning task (complex analysis)
Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting. While my GPU is maxed out and I can't run anything else.
With routing:
- 30 quick → 4B model on Mac Mini, 0.3s each → 9 seconds total
- 5 code → 30B on Windows GPU, 8-12s each → ~50 seconds
- 2 vision → vision model on GPU, 4-13s each → ~15 seconds
- 1 reasoning → 8B model on GPU, 5-8s → ~6 seconds
Total: ~80 seconds instead of ~5 minutes. And my GPU is free for most of that time, so I could be generating images or renting it out simultaneously.
The router doesn't just save time. It frees capacity.
The Fallback System
The real value of routing hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic — all my "important" models were on that machine.
Now the router has fallback logic:
def route_with_fallback(prompt: str) -> dict:
"""Route with health checks and fallbacks."""
primary = route(prompt)
# Check if primary endpoint is alive
try:
requests.get(f"{primary['endpoint']}/api/tags", timeout=2)
return primary
except:
pass # Primary is down, find fallback
# Fallback chain based on task type
if primary == MODELS["code"]:
# Can't run 30B on Mac, but 4B is better than nothing
return MODELS["quick"]
elif primary == MODELS["vision"]:
# Try Ubuntu's slow vision model
try:
requests.get(f"{MODELS['fallback']['endpoint']}/api/tags", timeout=2)
return MODELS["fallback"]
except:
return MODELS["quick"] # No vision available, use text
elif primary == MODELS["reasoning"]:
return MODELS["quick"] # Quick model can reason a bit
return MODELS["quick"]
When Windows reboots, my workflow doesn't stop. The router sends code tasks to the small model (worse but functional), vision tasks to Ubuntu (slow but works), and quick tasks keep humming along on the Mac Mini. Something always answers.
The Token Cost Angle
I also track tokens per model per day. Not because I'm cheap (these are all free — they run locally), but because it tells me where I'm spending compute:
# Simple token counter per route
token_log = defaultdict(int)
def track_tokens(route_name: str, tokens: int):
token_log[route_name] += tokens
# Log daily totals to a file
with open(f"token-usage-{date.today()}.jsonl", "a") as f:
f.write(json.dumps({"route": route_name, "tokens": tokens, "ts": time.time()}) + "\n")
After a week of tracking, I found: 72% of my requests were going to the quick model. Only 15% actually needed the big code model. 8% were vision. 5% reasoning.
I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.
What I'd Do Differently If I Started Over
1. Build the router on day one, not month three.
I wasted months manually routing. The router took an afternoon to write. If you're running multiple models, build the routing first. You can always make it smarter later.
2. Start with keyword routing. Don't over-engineer it.
I considered using embeddings to classify prompts. I considered training a tiny classifier. I considered using an LLM to decide which LLM to use (yes, really). Keyword matching works for 90% of cases. Ship the simple thing.
3. Make the fallback automatic.
My first version just errored when the GPU machine was down. The fallback logic was an afterthought. It should've been the first thing I built — because machines go down, and a degraded response is infinitely better than no response.
4. Log everything.
You can't optimize what you don't measure. Once I started logging which routes were used, the 72% quick-model stat jumped out immediately. Without logs, I would've kept thinking I needed the big model "most of the time."
Beyond Keywords: What I'm Adding Next
The keyword router works, but it has blind spots. A prompt like "help me understand why this function is slow" could be code OR reasoning. So I'm adding:
Confidence scoring: If the prompt matches multiple categories, route to the cheaper model first. If the response quality seems low (response is very short, model seems confused), auto-retry with the bigger model.
Context-aware routing: If the last 3 prompts were all about the same codebase, keep using the code model even if the current prompt doesn't have code keywords. Context continuity matters.
Cost-aware routing for cloud fallback: When local models can't handle something, I fall back to cloud APIs. The router should know: "this task isn't worth a $0.03 GPT-4 call, use the local model." Or: "this is worth it, spend the API credits."
Do You Need This?
If you run one model on one machine: no. You're fine. The router solves a multi-model, multi-machine problem.
If you're starting to accumulate models (a small one for chat, a big one for code, a vision model): yes. The moment you have more than 2 models, you need routing. Otherwise you'll default to the biggest one every time, and that's a waste.
If you have multiple machines: absolutely. The router isn't just about picking the right model — it's about picking the right machine. Code tasks go where the GPU is. Quick tasks stay local. Background tasks go to the always-on box.
The beauty is that the router grows with you. Start with 2 models and a simple if/else. Add more models, add more rules. Add more machines, add more endpoints. The architecture stays the same.
I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.
My AI lab setup: 3 machines, 1 desk, zero cloud dependency
Top comments (0)