Why I stopped routing AI agents by keyword — and what I built instead

#opensource #ai #llm #python

I spent weeks building a local AI assistant before I admitted my routing system was broken.
The setup was simple: if the user's message contained "code" or "debug", route to the dev agent. If it contained "write" or "edit", route to the editor. If it contained "plan" or "decide", route to the strategist.
It worked fine in demos. It fell apart in real use.
The problem with keyword routing
Natural language doesn't map cleanly to intent. "Can you help me write a function?" triggers the editor instead of the dev agent because "write" appears before "function." "Should I design the API this way?" goes to the designer when it should go to the strategist. "Am I thinking about this wrong?" matches nothing at all.
After enough of these failures I measured it: keyword routing was hitting around 60% accuracy on real conversation data. Tolerable on paper. Maddening to live with.
The fix: LLM classification with a confidence threshold
The approach that actually works is using a small, fast model as a classifier. You give it a system prompt listing your available agents, it reads the message, and it outputs a routing decision with a confidence score.

pythonCLASSIFIER_SYSTEM = """You are a routing engine for a multi-agent AI assistant.
Your only job: read a user message and output a JSON routing decision.

Available agents: architect, strategist, editor, researcher, devil, scheduler

Respond ONLY with valid JSON:
{"agent_id": "<id>", "confidence": <0.0-1.0>, "reason": "<one sentence>"}"""

async def classify(message: str) -> tuple[str, float]:
    resp = await ollama.chat(
        model="llama3.1:8b",  # Fast model for routing only
        messages=[
            {"role": "system", "content": CLASSIFIER_SYSTEM},
            {"role": "user", "content": message}
        ],
        options={"temperature": 0.1, "num_predict": 80}
    )
    parsed = json.loads(resp["message"]["content"])
    return parsed["agent_id"], float(parsed["confidence"])

But the classifier alone isn't enough. The key insight is: only switch agents when confidence clears a threshold.

pythonCONFIDENCE_THRESHOLD = 0.72

async def route(message: str, session) -> str:
    agent_id, confidence = await classify(message)

    if confidence >= CONFIDENCE_THRESHOLD:
        session.set_agent(agent_id)  # Switch
        return agent_id
    else:
        return session.current_agent  # Stay put, optionally suggest

Below 0.72, you keep the current agent. The user might get a suggestion ("Looks like this might be a Strategist question — type /switch strategist to change"), but you don't force a jarring mid-conversation switch.
Why 0.72 specifically?
Through iteration: below 0.65 you get too many false switches. Above 0.80 you get too much stickiness where the agent won't change even when it clearly should. With more agents, consider going slightly lower. With fewer, slightly higher.
One more wrinkle: after 3+ consecutive messages to the same agent, raise the threshold by 0.15. A conversation that's been in "coding mode" for a while needs a stronger signal to switch than a fresh message.
The result
Routing accuracy went from ~60% to consistently above 90% in real use. More importantly, it stopped feeling broken. The system now handles "what's wrong with my code AND should I restructure this whole thing?" correctly — the classifier sees the architectural question as the dominant intent and routes to the strategist, not the dev agent.
The full pattern — including multi-agent pipelines, agentic tool loops, and the 3-layer memory architecture built on top of this router — is documented in a guide I put together at nullfeather.gumroad.com.

DEV Community

Why I stopped routing AI agents by keyword — and what I built instead

Top comments (0)