The Invisible Orchestrator: Cheap Routing + Expensive Reasoning in Multi-Agent Apps

#ai #agents #architecture #nextjs

The Problem

We had four specialist AI agents — math, verbal, data insights, and strategy — each with a different system prompt, RAG namespace, and reasoning style. Every user message needed to land on the right one.

The naive solution: run every message through GPT-4o, ask it to decide, then call the specialist. That added 800–1,200ms of latency before the user saw a single token. On a tutoring app where response feel matters, that was a full second of dead air, every message.

We needed routing to be invisible — no perceived delay, no visible seam between agents.

What We Were Building

SamiWISE is a GMAT prep tutor with four specialist agents: quantitative reasoning, verbal, data insights, and strategy. Each agent has its own system prompt tuned to its domain, a dedicated Pinecone namespace, and different behavior — the math agent scaffolds step-by-step, the verbal agent uses Socratic questioning, the strategy agent answers directly.

Routing wrong has real costs: the verbal agent confidently giving arithmetic advice, or the strategy agent running a full Socratic debrief when a student just needs a direct answer. Getting the right agent matters. But routing itself shouldn't cost a second of latency.

The First Approach (And Why It Failed)

We started with a single GPT-4o call as a router:

// First attempt — routing via GPT-4o
const routingResponse = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "system",
      content: `You are a routing agent. Given a user message, return ONLY a JSON object:
{"agent": "quant" | "verbal" | "data_insights" | "strategy"}
No explanation. No other text.`,
    },
    { role: "user", content: userMessage },
  ],
  response_format: { type: "json_object" },
});

const { agent } = JSON.parse(routingResponse.choices[0].message.content!);
// then call the specialist...

Two problems:

Latency: GPT-4o takes 400–1,200ms for even a tiny JSON response. The user stares at a spinner while we decide who should answer them.
Cost: Every message pays for two LLM calls — the router and the specialist. At scale, routing adds ~35% to our per-message AI cost for a task that returns 12 tokens.

The routing call is fundamentally over-engineered for what it needs to do. It's returning one of four tokens. It doesn't need frontier reasoning ability.

What We Actually Did

We replaced GPT-4o routing with Groq running llama-3.3-70b-versatile. Same prompt, same JSON output format. Median routing latency dropped from ~850ms to ~55ms.

// lib/openai/client.ts
import Groq from "groq-sdk";

export const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

// agents/gmat/orchestrator.ts — routing call
async function routeToAgent(
  userMessage: string,
  conversationContext: string
): Promise<AgentType> {
  const response = await groq.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [
      {
        role: "system",
        content: `Route the user message to one specialist agent.
Return ONLY valid JSON: {"agent": "quant" | "verbal" | "data_insights" | "strategy"}

Routing rules:
- quant: arithmetic, algebra, geometry, word problems, number properties
- verbal: reading comprehension, critical reasoning, sentence correction  
- data_insights: table analysis, multi-source reasoning, two-part analysis
- strategy: timing, test-taking approach, score targets, study plan questions

Context (last 2 messages):
${conversationContext}`,
      },
      { role: "user", content: userMessage },
    ],
    response_format: { type: "json_object" },
    temperature: 0,  // key: deterministic routing
    max_tokens: 20,  // key: we only need 12 tokens, don't let it ramble
  });

  const result = JSON.parse(response.choices[0].message.content!);

  // validate — if Groq returns something unexpected, fall back to quant
  const valid = ["quant", "verbal", "data_insights", "strategy"] as const;
  return valid.includes(result.agent) ? result.agent : "quant";
}

The specialist agents still use GPT-4o with full streaming. The routing call returns in ~55ms before the first streaming token from the specialist arrives — the user never perceives a gap.

The full orchestration flow:

// agents/gmat/orchestrator.ts — simplified main flow
export async function handleMessage(
  userMessage: string,
  userId: string,
  stream: ReadableStreamDefaultController
) {
  // 1. Build routing context from last 2 messages (~5ms, local)
  const context = await getRecentContext(userId);

  // 2. Route via Groq — fast, cheap, deterministic (~55ms)
  const agentType = await routeToAgent(userMessage, context);

  // 3. Load specialist config and RAG context in parallel
  const [agentConfig, ragContext] = await Promise.all([
    getAgentConfig(agentType),
    fetchRAGContext(userMessage, agentType),  // hits the right Pinecone namespace
  ]);

  // 4. Stream response from GPT-4o specialist
  await streamSpecialistResponse(
    userMessage,
    agentConfig,
    ragContext,
    userId,
    stream
  );
}

Steps 3 and 4 overlap with the routing call's processing time in practice — by the time routing returns, the DB read for agent config has already started. Real first-token latency from user submit to first visible character: ~900ms.

What We Learned

Routing is a classification task, not a reasoning task. It needs speed and determinism, not nuance. A 70B model at Groq's inference speed is dramatically overkill in the right direction — fast and accurate without needing frontier quality.
temperature: 0 on routing is non-negotiable. We tested with temperature 0.2 and got routing drift on ambiguous messages over time. Determinism matters when the wrong call sends a student to the wrong specialist.
max_tokens: 20 is a real safeguard. Without it, llama occasionally adds a sentence after the JSON. With it, the response is always parseable. Never let a routing call return free text.
Groq's error rate on routing edge cases was 3%, vs 8% for GPT-4o-mini. We expected GPT-4o-mini to win on accuracy since it's trained by OpenAI to follow instructions precisely. The llama model on Groq was actually better at following the strict JSON-only constraint.
The routing/reasoning split is a pattern, not a hack. We now apply it anywhere we need a fast structural decision before an expensive generative response. Categorization, intent detection, form field extraction — all good candidates for a fast model.

What's Next

[ ] Confidence scoring on routes — right now it's hard-coded 4 categories with a fallback. A better version would return a confidence score and escalate ambiguous messages to a clarifying question instead of guessing.
[ ] Context-aware routing — we pass 2 messages of context. A multi-turn conversation about one topic should weight recent topic over current message. Not implemented yet.
[ ] Routing analytics — we log which agent handles each message but don't track routing corrections (when a user re-asks in a way that implies they got the wrong specialist). That signal would improve routing prompt quality over time.

Over to You

How do you handle routing in multi-agent systems? Do you use a separate model or rely on the primary LLM to route via function calling?
Has anyone benchmarked other fast inference providers (Cerebras, Together, Fireworks) against Groq for this kind of structural routing task?
When routing confidence is low, do you ask the user to clarify or just make a best guess and let them redirect if wrong?