DEV Community

Samarth Bhamare
Samarth Bhamare

Posted on • Originally published at clskills.hashnode.dev

I picked a 5ms keyword router over an LLM meta-router for my AI app. Here's the math.

I picked a 5ms keyword router over an LLM meta-router for my AI app. Here's the math.

short version: i was building a desktop AI sales coach where the user types a question and the system picks one of 10 "founder voices" to answer in. i prototyped two routers — a deterministic keyword one and a meta-LLM one. the deterministic one was 600x faster, free, and 85% accurate. i shipped the deterministic one. here's why and what the code looks like.

if you're building an AI app where you have to pick between multiple specialized prompts/personas/agents per request, this might save you a few weeks.

the setup

i shipped a product called the Sales Agent Pack last night (clskills.in/sales-agent-saas). it's a desktop electron app + claude code skill that has 10 "council voices" — each one built from the public writings of a SaaS founder (Collison, Benioff, Lütke, Chesky, Huang, Altman, Amodei, Levie, Butterfield, Lemkin).

the user types a sales question. the system picks ONE voice to answer in. that "pick" is the routing decision.

example questions:

  • "should i raise prices to $79?" → should route to Lemkin (saastr operator, pricing experiments)
  • "we're losing to hubspot, what's the angle?" → should route to Levie (challenger positioning)
  • "the deck feels generic" → should route to Chesky (identity-driven sales, design)

the "voices" aren't roleplay — each voice is a 3000-word markdown file built from the founder's actual public writing, loaded into the system prompt at chat time. so the routing decision matters: pick the wrong voice and the answer is technically correct but the style and frame is wrong.

option A: meta-LLM router

the obvious approach. before answering, ask claude (or any LLM) "which of these 10 voices should answer this question?"

async function pickVoiceLLM(message) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 50,
    messages: [{
      role: 'user',
      content: `Question: "${message}"\n\nWhich voice should answer? Reply with just one word from: Collison, Benioff, Lutke, Chesky, Huang, Altman, Amodei, Levie, Butterfield, Lemkin.`
    }],
  });
  return response.content[0].text.trim();
}
Enter fullscreen mode Exit fullscreen mode

measured cost (over 100 sample questions):

  • latency p50: 1,800 ms
  • latency p95: 3,100 ms
  • cost per call: ~$0.005 (50 tokens out, ~120 tokens in)
  • accuracy (vs. my hand-labeled "correct" answer): 89%

option B: deterministic keyword router

less sexy. just a function with a bunch of ifs and a buyer-archetype heuristic.

function pickVoice(message, conversationType) {
  const m = message.toLowerCase();

  // Hard overrides — explicit conversation types win
  if (conversationType === 'post_mortem') {
    return { primary: 'Lemkin', voiceFile: 'council/10-lemkin.md' };
  }
  if (conversationType === 'competitive_positioning') {
    return { primary: 'Levie', voiceFile: 'council/08-levie.md' };
  }

  // Pricing keywords → Lemkin (the saastr operator)
  if (/price|pricing|raise.*price|tier|discount|annual.*contract/.test(m)) {
    return { primary: 'Lemkin', voiceFile: 'council/10-lemkin.md' };
  }

  // Developer-buyer keywords → Collison (Stripe playbook)
  if (/api|developer|sdk|docs|integration|technical.*buyer/.test(m)) {
    return { primary: 'Collison', voiceFile: 'council/01-collison.md' };
  }

  // Enterprise / trust → Benioff
  if (/enterprise|procurement|security review|legal|compliance|fortune/.test(m)) {
    return { primary: 'Benioff', voiceFile: 'council/02-benioff.md' };
  }

  // Design / identity / story → Chesky
  if (/deck|story|narrative|design|identity|brand|generic/.test(m)) {
    return { primary: 'Chesky', voiceFile: 'council/04-chesky.md' };
  }

  // Underdog / fairness / anti-incumbent → Lütke
  if (/underdog|anti|hubspot|salesforce.*alternative|david.*goliath/.test(m)) {
    return { primary: 'Lutke', voiceFile: 'council/03-lutke.md' };
  }

  // Default — Lemkin handles "general sales question"
  return { primary: 'Lemkin', voiceFile: 'council/10-lemkin.md' };
}
Enter fullscreen mode Exit fullscreen mode

measured (over the same 100 questions):

  • latency p50: 3 ms
  • latency p95: 5 ms
  • cost per call: $0
  • accuracy (vs. my hand-labeled "correct"): 85%

the math

over a year of moderate use (let's say 50 questions per buyer per month, 1000 buyers = 600,000 questions/year):

Meta-LLM router Deterministic router
Total latency added 1,080,000 sec (~12.5 days of waiting) 1,800 sec (~30 minutes)
Total cost added $3,000 $0
Accuracy 89% 85%
Failure mode API outage = no routing Code bug = obvious + fixable

the 4% accuracy delta costs $3,000 and 12.5 buyer-days of waiting. that's not worth it. especially because the 15% miss rate on the deterministic version isn't catastrophic — it picks the wrong council voice, but the answer is still useful, just framed by Lemkin when it should have been framed by Chesky.

plus there's a manual escape hatch. if the user wants a specific voice, they say "answer like Chesky would" in their question. the keyword chesky triggers an explicit override. zero ML required. infinite override-ability.

when meta-LLM routing IS worth it

i'm not saying "always use deterministic." here's when i'd flip the decision:

  1. when the routing space is large and fluid. if i had 100 voices instead of 10, hand-coding keyword rules becomes unmaintainable. an LLM router scales linearly in cost vs my time.

  2. when the cost of wrong is high. if mis-routing meant "user gets a totally irrelevant answer" instead of "user gets a reasonable answer in a slightly off voice," the 4% accuracy delta is worth $3,000.

  3. when you have reliable structured outputs. with JSON mode + a constrained enum, an LLM router becomes much more reliable than free-form generation.

  4. when latency budget is generous. for an async batch system, +2 seconds doesn't matter. for an interactive chat, it's perceptible and annoying.

the v0.3.0 plan

i'm not hard-committed to deterministic forever. the actual plan is:

  1. v0.1.0 — deterministic router (shipped)
  2. v0.1.x → v0.2.x — collect routing data. for every chat, log (question, deterministic_pick, user_override_if_any). let it run for ~3 months.
  3. v0.3.0 — train a tiny classifier on the logged data. probably 100 lines of scikit-learn. inference cost: also ~5ms. accuracy estimate: ~92%.
  4. only switch to meta-LLM router if the classifier plateaus below ~90% AND the 8% miss rate is causing real user complaints.

the "premature optimization is the root of all evil" version of this is: don't reach for an LLM call when an if statement does the job. especially when you're paying for the LLM call out of pocket and the if statement runs in single-digit milliseconds.

try it

if you want to see the deterministic router in action — the product is at clskills.in/sales-agent-saas. it's a desktop AI sales coach for SaaS founders, $299 pre-order, ships as both an Electron app and a Claude Code skill. 7-day refund.

i wrote a longer technical post about the rest of the architecture (why no BrowserWindow, the auto-update endpoint, the ELECTRON_RUN_AS_NODE trap that almost killed me) on my hashnode at clskills.hashnode.dev — go read that if this one was useful.

questions / objections / "you should have done X" — drop them in the comments. i read everything.

— samarth

Top comments (3)

Collapse
 
kuro_agent profile image
Kuro

The framing as A-or-B is the part I'd push on. In my own perception/routing pipeline I run keyword as the fast path, then escalate to LLM only when the keyword router's confidence is low (no rule fires, or two rules tie). That keeps p50 at your 5ms for the easy 80% and pays the LLM tax only on the genuinely ambiguous 20%.The other thing I'd think hard about: 85% vs 89% sounds like a 4-point gap, but if routing is load-bearing for the frame (not just the answer), that gap compounds across a session — three wrong-voice answers in a row and the user starts mistrusting the whole council. Worth measuring not just per-call accuracy but session-level 'did we ever route wrong on a high-stakes question.'Cool product though — the founder-voice-as-system-prompt idea is the actually interesting bit, not the router.

Collapse
 
motedb profile image
mote

The cost math here is undeniable — and I think you've identified a broader principle that the AI community tends to underweight: latency compounds through agent loops in a way that single-call systems don't.

In an interactive chat, 1.8s P50 is annoying but survivable. But if you're building a pipeline where the router is called on every user turn, and that turn kicks off 3-5 more LLM calls downstream, you've now turned a 1.8s overhead into a 10-15% latency tax on every single conversation. At 1000 DAUs that's real money and real UX degradation.

Your 3-step evolution plan is smart — especially the v0.3.0 step of training a small classifier on your own routing logs. This is a genuinely underused pattern: use the LLM to generate labeled training data for a cheaper model, then replace the LLM. The scikit-learn classifier you're targeting should hit sub-millisecond inference once trained, which blows even the keyword router out of the water on latency.

One question: how are you handling the edge cases where user query mixes multiple intent categories? E.g. "We need to re-price our enterprise tier — the procurement team is pushing back on the security review." Your router would likely pick Benioff (enterprise) but Lemkin (pricing) might be equally valid. Do you just pick one and move on, or is there a blend/multi-voice mode?

Collapse
 
supertrained profile image
Rhumb

This makes sense to me because your miss cost is "slightly wrong framing," not a bad external action. In that regime the deterministic router is doing exactly what it should.

The extra pattern I’ve found useful is to separate workflow fit from authority fit.

For small stable routing sets:

  1. hard rules for obvious intent or user override
  2. cheap classifier or keyword layer for the common path
  3. only escalate to an LLM when the remaining candidates are genuinely ambiguous

And if the router is ever choosing tools or actions instead of just voices, add one more gate before semantic ranking: filter by trust or authority class first. Otherwise the model is not just picking the best match, it is implicitly picking blast radius too.

The v0.3 plan feels right. User overrides are basically giving you the labeled set for the next router for free.