The Problem
We had four specialist AI agents — math, verbal, data insights, and strategy — each with a different system prompt, RAG namespace, and reasoning style. Every user message needed to land on the right one.
The naive solution: run every message through GPT-4o, ask it to decide, then call the specialist. That added 800–1,200ms of latency before the user saw a single token. On a tutoring app where response feel matters, that was a full second of dead air, every message.
We needed routing to be invisible — no perceived delay, no visible seam between agents.
What We Were Building
SamiWISE is a GMAT prep tutor with four specialist agents: quantitative reasoning, verbal, data insights, and strategy. Each agent has its own system prompt tuned to its domain, a dedicated Pinecone namespace, and different behavior — the math agent scaffolds step-by-step, the verbal agent uses Socratic questioning, the strategy agent answers directly.
Routing wrong has real costs: the verbal agent confidently giving arithmetic advice, or the strategy agent running a full Socratic debrief when a student just needs a direct answer. Getting the right agent matters. But routing itself shouldn't cost a second of latency.
The First Approach (And Why It Failed)
We started with a single GPT-4o call as a router:
// First attempt — routing via GPT-4o
const routingResponse = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `You are a routing agent. Given a user message, return ONLY a JSON object:
{"agent": "quant" | "verbal" | "data_insights" | "strategy"}
No explanation. No other text.`,
},
{ role: "user", content: userMessage },
],
response_format: { type: "json_object" },
});
const { agent } = JSON.parse(routingResponse.choices[0].message.content!);
// then call the specialist...
Two problems:
- Latency: GPT-4o takes 400–1,200ms for even a tiny JSON response. The user stares at a spinner while we decide who should answer them.
- Cost: Every message pays for two LLM calls — the router and the specialist. At scale, routing adds ~35% to our per-message AI cost for a task that returns 12 tokens.
The routing call is fundamentally over-engineered for what it needs to do. It's returning one of four tokens. It doesn't need frontier reasoning ability.
What We Actually Did
We replaced GPT-4o routing with Groq running llama-3.3-70b-versatile. Same prompt, same JSON output format. Median routing latency dropped from ~850ms to ~55ms.
// lib/openai/client.ts
import Groq from "groq-sdk";
export const groq = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
// agents/gmat/orchestrator.ts — routing call
async function routeToAgent(
userMessage: string,
conversationContext: string
): Promise<AgentType> {
const response = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{
role: "system",
content: `Route the user message to one specialist agent.
Return ONLY valid JSON: {"agent": "quant" | "verbal" | "data_insights" | "strategy"}
Routing rules:
- quant: arithmetic, algebra, geometry, word problems, number properties
- verbal: reading comprehension, critical reasoning, sentence correction
- data_insights: table analysis, multi-source reasoning, two-part analysis
- strategy: timing, test-taking approach, score targets, study plan questions
Context (last 2 messages):
${conversationContext}`,
},
{ role: "user", content: userMessage },
],
response_format: { type: "json_object" },
temperature: 0, // key: deterministic routing
max_tokens: 20, // key: we only need 12 tokens, don't let it ramble
});
const result = JSON.parse(response.choices[0].message.content!);
// validate — if Groq returns something unexpected, fall back to quant
const valid = ["quant", "verbal", "data_insights", "strategy"] as const;
return valid.includes(result.agent) ? result.agent : "quant";
}
The specialist agents still use GPT-4o with full streaming. The routing call returns in ~55ms before the first streaming token from the specialist arrives — the user never perceives a gap.
The full orchestration flow:
// agents/gmat/orchestrator.ts — simplified main flow
export async function handleMessage(
userMessage: string,
userId: string,
stream: ReadableStreamDefaultController
) {
// 1. Build routing context from last 2 messages (~5ms, local)
const context = await getRecentContext(userId);
// 2. Route via Groq — fast, cheap, deterministic (~55ms)
const agentType = await routeToAgent(userMessage, context);
// 3. Load specialist config and RAG context in parallel
const [agentConfig, ragContext] = await Promise.all([
getAgentConfig(agentType),
fetchRAGContext(userMessage, agentType), // hits the right Pinecone namespace
]);
// 4. Stream response from GPT-4o specialist
await streamSpecialistResponse(
userMessage,
agentConfig,
ragContext,
userId,
stream
);
}
Steps 3 and 4 overlap with the routing call's processing time in practice — by the time routing returns, the DB read for agent config has already started. Real first-token latency from user submit to first visible character: ~900ms.
What We Learned
Routing is a classification task, not a reasoning task. It needs speed and determinism, not nuance. A 70B model at Groq's inference speed is dramatically overkill in the right direction — fast and accurate without needing frontier quality.
temperature: 0on routing is non-negotiable. We tested with temperature 0.2 and got routing drift on ambiguous messages over time. Determinism matters when the wrong call sends a student to the wrong specialist.max_tokens: 20is a real safeguard. Without it, llama occasionally adds a sentence after the JSON. With it, the response is always parseable. Never let a routing call return free text.Groq's error rate on routing edge cases was 3%, vs 8% for GPT-4o-mini. We expected GPT-4o-mini to win on accuracy since it's trained by OpenAI to follow instructions precisely. The llama model on Groq was actually better at following the strict JSON-only constraint.
The routing/reasoning split is a pattern, not a hack. We now apply it anywhere we need a fast structural decision before an expensive generative response. Categorization, intent detection, form field extraction — all good candidates for a fast model.
What's Next
- [ ] Confidence scoring on routes — right now it's hard-coded 4 categories with a fallback. A better version would return a confidence score and escalate ambiguous messages to a clarifying question instead of guessing.
- [ ] Context-aware routing — we pass 2 messages of context. A multi-turn conversation about one topic should weight recent topic over current message. Not implemented yet.
- [ ] Routing analytics — we log which agent handles each message but don't track routing corrections (when a user re-asks in a way that implies they got the wrong specialist). That signal would improve routing prompt quality over time.
Over to You
- How do you handle routing in multi-agent systems? Do you use a separate model or rely on the primary LLM to route via function calling?
- Has anyone benchmarked other fast inference providers (Cerebras, Together, Fireworks) against Groq for this kind of structural routing task?
- When routing confidence is low, do you ask the user to clarify or just make a best guess and let them redirect if wrong?

Top comments (0)