Sam Hartley

Posted on May 3

I Built a Model Router That Picks the Right AI for Every Task — Here's Why You Should Too

#ai #ollama #selfhosted #productivity

I caught myself doing something stupid the other day. I used a 30B parameter model to summarize a two-paragraph email. The GPU spun up, the fans kicked in, and 8 seconds later I had my summary. The same summary my 4B model would've given me in 0.3 seconds.

That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.

So I built a model router. Not a fancy orchestrator with Kubernetes and service meshes — just a simple function that looks at what you're asking and sends it to the right model on the right machine. It's the single most impactful thing I've done for my local AI setup this year.

The Problem

I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:

Need something → open terminal
Think about which model is good enough
Think about which machine has it
Type the full URL: curl http://192.168.1.106:11434/api/generate -d '{"model": "qwen3-coder:30b", ...}'
Wait

Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."

Here's the thing: it's usually not better. A 4B model summarizing text is 95% as good as a 30B model summarizing text. But it's 25x faster and uses 1/10th the resources. The big model should earn its keep on tasks where size actually matters.

The Router

I started with a Python function. Nothing fancy:

import re

# Model registry: what's available where
MODELS = {
    "quick": {
        "model": "qwen3:4b",
        "endpoint": "http://localhost:11434",
    },
    "vision": {
        "model": "granite3.2-vision:2b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "code": {
        "model": "qwen3-coder:30b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "reasoning": {
        "model": "deepseek-r1:8b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "fallback": {
        "model": "minicpm-v",
        "endpoint": "http://192.168.1.100:11434",
    },
}

def route(prompt: str) -> dict:
    p = prompt.lower()
    if any(w in p for w in ["image", "screenshot", "picture"]):
        return MODELS["vision"]
    if any(w in p for w in ["function", "class", "debug", "refactor", "implement"]):
        return MODELS["code"]
    if any(w in p for w in ["prove", "solve", "calculate", "math", "logic"]):
        return MODELS["reasoning"]
    return MODELS["quick"]

That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.

Why This Matters More Than You Think

Before the router, my typical day looked like this:

30 quick questions (summarize this, what does this error mean, rephrase this)
5 code tasks (write a function, debug this, add tests)
2 vision tasks (what's in this screenshot)
1 deep reasoning task (complex analysis)

Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting.

With routing:

30 quick → 4B model on Mac Mini, 0.3s each → 9 seconds total
5 code → 30B on Windows GPU, 8-12s each → ~50 seconds
2 vision → vision model on GPU, 4-13s each → ~15 seconds
1 reasoning → 8B model on GPU, 5-8s → ~6 seconds

Total: ~80 seconds instead of ~5 minutes. And my GPU is free for most of that time.

The Fallback System

The real value hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic.

Now the router has fallback logic — when a machine is down, it finds the next best option. Code tasks go to the small model (worse but functional). Vision tasks try Ubuntu. Quick tasks keep humming on the Mac Mini. Something always answers.

The Token Cost Angle

I track tokens per model per day. Not because I'm cheap (these are all free — local), but because it tells me where I'm spending compute:

After a week: 72% of my requests went to the quick model. Only 15% needed the big code model.

I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.

What I'd Do Differently

1. Build the router on day one. It took an afternoon. I wasted months manually routing.

2. Start with keyword routing. I considered embeddings, classifiers, even using an LLM to pick the LLM. Keywords work for 90% of cases. Ship the simple thing.

3. Make the fallback automatic. My first version just errored when the GPU machine was down. A degraded response is infinitely better than no response.

4. Log everything. You can't optimize what you don't measure. The 72% stat jumped out immediately once I started tracking.

Beyond Keywords

The keyword router works but has blind spots. So I'm adding:

Confidence scoring: If a prompt matches multiple categories, try the cheaper model first. Auto-retry with bigger model if quality seems low.
Context-aware routing: If the last 3 prompts were about the same codebase, keep using the code model even without explicit keywords.
Cost-aware fallback: When local models can't handle something, the router should know whether a cloud API call is "worth it" or not.

Do You Need This?

If you run one model on one machine: no. You're fine.

If you have more than 2 models: yes. Otherwise you'll default to the biggest one every time, and that's a waste.

If you have multiple machines: absolutely. The router isn't just about picking the right model — it's about picking the right machine. Code tasks go where the GPU is. Quick tasks stay local. Background tasks go to the always-on box.

Start with 2 models and a simple if/else. Add more as you grow. The architecture stays the same.

I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.

DEV Community