DEV Community

Cover image for How I Built an Intent Classifier to Route Messages Across Multiple LLMs
Abhijeet Verma
Abhijeet Verma

Posted on

How I Built an Intent Classifier to Route Messages Across Multiple LLMs

Most AI chat apps make a quiet assumption that costs them a lot: one model is good enough for everything. It isn't.

When I started building Chymera, I wanted to fix that. The idea was simple — instead of locking the user into a single LLM, the system should figure out what kind of question is being asked and send it to the model best suited to answer it.

This is the story of how I built that routing layer, what I got wrong the first time, and what the working version actually looks like.


The Problem With Single-Model Architectures

Every major AI chat product — ChatGPT, Claude, Gemini — lets you switch models manually. But users don't think in terms of models. They just ask questions. The mental overhead of "hmm, should I use GPT-4o or o1 for this?" is friction that shouldn't exist.

Beyond UX, there's a real capability argument. Llama 3.3 70B via Groq is exceptional at code generation, while Qwen QwQ 32B has unusually strong multi-step reasoning. Gemini 2.5 Flash is fast and has native tool use that pairs well with live web search.

No single model wins every category. So why force users to pick?


Designing the Classifier

The first instinct was to use an LLM to classify intent. Send the query to a lightweight model, get back a category, then route accordingly. I tried this briefly and dropped it immediately. The latency was unacceptable — you're adding a full round trip before the actual answer even starts.

The solution was a rule-based classifier written in plain JavaScript. No API call, no model inference, zero latency. It runs synchronously before any request goes out.

The classifier returns one of seven categories: chitchat, coding, reasoning, creative, search, factual, or general.

How It Works

function classifyQuery(query) {
  const q = query.toLowerCase().trim();
  const words = q.split(/\s+/);
  const len = words.length;

  // Chitchat: short messages matching known conversational phrases
  const cleanQ = q.replace(/[.,!?]/g, '');
  const isChitchat = len <= 5 && chitchatPhrases.some(p =>
    new RegExp(`\\b${p}\\b`, 'i').test(cleanQ) || cleanQ === p
  );
  if (isChitchat) return 'chitchat';

  // Search: time-sensitive signals win before anything else
  if (any(q, realtimeSignals)) return 'search';

  // Recent year + question word = almost always needs live data
  const hasRecentYear = /\b(2024|2025|2026)\b/.test(q);
  const hasQuestionWord = any(q, ['what', 'who', 'when', 'latest', 'current', 'now']);
  if (hasRecentYear && hasQuestionWord && len >= 4) return 'search';

  // Reasoning, coding, creative, factual — keyword matching in priority order
  if (any(q, reasoningSignals)) return 'reasoning';
  if (any(q, codingSignals)) return 'coding';
  if (any(q, creativeSignals)) return 'creative';
  if (len >= 3 && any(q, factualSignals)) return 'factual';

  return 'general';
}

function any(text, keywords) {
  return keywords.some(k => text.includes(k));
}
Enter fullscreen mode Exit fullscreen mode

The ordering matters more than the keyword lists. Search detection runs before factual detection — because "what is the current price of ETH" needs live data even though it matches factual patterns (what is). Getting this priority order wrong was my first significant bug.


Mapping Categories to Models

Once the classifier returns a category, model selection is a simple switch:

function getModel(queryType) {
  switch (queryType) {
    case 'coding':    return 'llama-3.3-70b-versatile';
    case 'reasoning': return 'qwen-qwq-32b-preview';
    case 'creative':  return 'gemini-2.5-flash';
    case 'search':    return 'llama-3.3-70b-versatile';
    case 'factual':   return 'llama-3.3-70b-versatile';
    case 'chitchat':  return 'llama-3.3-70b-versatile';
    default:          return 'gemini-2.5-flash';
  }
}
Enter fullscreen mode Exit fullscreen mode

Groq handles Llama and Qwen. Gemini 2.5 Flash runs through Google's Generative AI SDK. The two providers need different streaming implementations, so I extracted a shared aiCore.js module that normalises both into a single SSE pipeline. The chat handler doesn't need to know which provider it's talking to.


The Rate Limit Problem

Running on multiple free-tier API keys creates an obvious problem: 429 errors. The fix was a key rotation pool:

function getKeyPool(prefix) {
  const keys = [];
  for (let i = 1; i <= 10; i++) {
    const k = process.env[`${prefix}_${i}`];
    if (k) keys.push(k);
  }
  if (process.env[prefix]) keys.push(process.env[prefix]);
  return [...new Set(keys)].filter(Boolean);
}

let groqIdx = 0;

function nextGroqKey() {
  const key = groqKeys[groqIdx % groqKeys.length];
  groqIdx++;
  return key;
}
Enter fullscreen mode Exit fullscreen mode

Each request picks the next key via round-robin. If a key returns 429, the request retries with the next one. With 10 keys in the pool, the effective rate limit headroom multiplies by pool size.


What Runs Where

The architecture has three independently deployed tiers:

  • Frontend — React 18 SPA on Netlify. Handles auth state, renders the chat UI, consumes the SSE stream token by token.
  • API — Express on Railway. Runs the classifier, selects the model, manages the key pool, and pipes the LLM stream back via Server-Sent Events.
  • Data layer — Supabase for PostgreSQL and auth. Row-level security means each user can only read their own chat history — one policy handles it without any application-level filtering.

Mem0 sits alongside the API and handles semantic memory. Before each non-trivial request, the system searches for relevant memories from past sessions and injects them into the system prompt. The user never needs to re-explain who they are or what they're working on.


What I'd Do Differently

1. Handle ambiguity with confidence scores
Queries like "is Next.js better than Remix?" match reasoning signals but are also partly factual. The current system picks one and commits. A confidence score with fallback logic would handle edge cases better.

2. Improve chitchat detection on longer messages
The len <= 5 check prevents false positives but a longer casual message doesn't classify cleanly. A separate tone detector would help here.

3. Stateful key rotation
The round-robin index resets on every serverless cold start. A distributed counter in Redis would distribute load more accurately across concurrent requests.


Try It

Chymera is live here:

Ask it something technical, then ask it something that needs a live answer — you'll see the model badge in the UI change between responses. The classifier runs in under a millisecond and the routing is completely invisible to the user.

If you've built something similar or approached the routing problem differently, I'd like to hear how.

Top comments (1)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

This is a brilliant solve for a problem most people try to throw more "compute" at. We’re so used to the "AI for everything" hammer that we forget how powerful a well-ordered synchronous function can be.

The Zero-Latency Handshake
I’m especially struck by your decision to ditch the LLM-based classifier. In a data flow, the first millisecond is the most expensive. By using plain JavaScript for the "intent handshake," you’ve preserved the user's flow state. You’re essentially treating the query like a packet that needs routing, rather than a puzzle that needs solving—which is exactly how a high-performance system should behave.

Priority as Architecture
Your point about ordering mattering more than keyword lists is a deep systemic insight. It’s not just a list of signals; it’s a hierarchy of needs. "Search" winning over "Factual" is a perfect example of a Causal Decision Graph. If the data is stale, the factual accuracy doesn't matter. By putting time-sensitive signals at the top, you’re ensuring the system’s "eyes" are always open to the present moment before it reaches into its "memory" (the model's training data).

The Key Pool as a Resource Stream
The key rotation logic is a clever bit of "infrastructure-as-code" for the free-tier world. It reminds me that we don't always need a bigger budget; sometimes we just need a smarter distribution protocol. You’ve basically built a load balancer for intelligence.

A Reflective Thought
I wonder if the "Ambiguity" problem you mentioned could be solved by looking at Query Density? If a query matches multiple signal groups with high frequency, maybe that's the trigger to "up-level" to a more expensive reasoning model automatically.

It makes me realize that as we move toward "Agentic" systems, the most important code won't be the prompts themselves, but the invisible routing layers that decide which brain gets to do the thinking.