Wavebro

Posted on Jun 3

Phase 2 Shipped: 5 Things I Got Wrong About Embedding-Based Routing

#ai #llm #machinelearning #python

A follow-up to Teaching an AI to Pick Its Own Brain

In the last post, I ended with a plan: replace the Groq LLM categorizer with local multilingual-e5-large embeddings. Find similar past messages, vote on the category, skip the API call. Simple.

It took a Groq outage to actually make me ship it.

On 2026-05-22, Groq went down for two hours. 503 requests fell back to medium tier silently — no errors surfaced to users, but nobody got the model they should have. That's the kind of "resilience" that feels fine until it isn't.

So I shipped Phase 2. Here's what I got wrong.

Wrong #1: I thought the accuracy metric was about correctness

I measured "tier accuracy" using leave-one-out cross-validation on the embedding pool. The number came back: 83.2%. Decent. But I kept asking myself: 83.2% accuracy against what ground truth?

The answer: against Groq's own past decisions.

The pool is labeled by Groq. The k-NN learns Groq's category boundaries from those labels. When I measure accuracy, I'm measuring "how often does k-NN agree with Groq?" — not "how often is the routing objectively correct."

This is actually the right thing to measure. The goal of Phase 2 is to replace Groq with something local and fast — the quality bar is "indistinguishable from Groq," not "better than Groq." But I spent a week confused about why 83% felt both good and meaningless at the same time, before I understood what I was actually measuring.

Wrong #2: I thought analysis vs research_lookup confusion was a problem

analysis category accuracy: 59%. Terrible-looking number. The embeddings kept predicting research_lookup for analysis prompts and vice versa.

I spent two days trying to fix this. Generated more synthetic data, tweaked the pool, re-ran validation. The number barely moved.

Then I looked at the tier map:

CATEGORY_TIER_MAP = {
    "analysis":         "medium",
    "research_lookup":  "medium",   # same destination
    ...
}

Both categories route to medium tier. The embedding can't distinguish them — and it doesn't need to. It's like being unable to tell two roads apart when both lead to the same city.

The confusion that actually costs something is when coding gets sent to medium instead of strong. That happens in 3% of requests. The analysis/research_lookup confusion? Zero routing impact.

Lesson: measure tier accuracy, not category accuracy. They're different things and only one of them matters for the system's actual job.

Wrong #3: I thought synthetic data was good enough

The pool needs labeled examples to do k-NN. My first instinct: generate 60 synthetic prompts per category using templates, fill the pool fast.

I did this. It looked fine until I checked the actual embedding space. Sixty templates with minor variation produce maybe 15 distinct semantic clusters. The rest are near-duplicates — the same phrasing with a different noun. A k-NN pool full of near-duplicates memorizes instead of generalizing.

What actually worked: real user messages. I filtered 342 prompts from actual chat session transcripts — things real users had genuinely asked, in multiple languages, at varying lengths, covering real tasks. That data has diversity that synthetic templates can't fake.

After mixing in LLM-generated prompts (using claude-haiku with explicit variety constraints: different languages, different lengths, different domains) for the thinner categories, the pool hit 1,309 entries and the tier accuracy became meaningful.

Near-duplicate embeddings are the real enemy of pool quality. Not wrong labels.

Wrong #4: I thought 30% "mislabeled" synthetic prompts were noise

When I generated coding prompts and ran them through Groq for labeling, 30% came back as analysis. My first reaction: Groq is wrong, these are clearly coding prompts, I should override the labels.

I didn't. And that was correct.

Look at what those "mislabeled" prompts actually were: "explain the time complexity of this algorithm", "what's the difference between recursion and iteration", "review this approach for a binary search". These sit right on the boundary between explaining something (analysis) and working with code (coding).

Groq consistently calls them analysis. So the embedding pool correctly learns Groq's boundary — which is the boundary the live system actually uses. The labels aren't wrong. My intuition about where the boundary should be was off.

If your label source has a consistent opinion, trust it over your instinct.

Wrong #5: I thought the disagreement would be symmetric

Of the 17% of requests where embedding k-NN disagrees with Groq on tier:

Upgrade   (k-NN -> stronger model): 10.0%
Downgrade (k-NN -> weaker model):    6.8%

I expected roughly 50/50. Instead, the system naturally leans toward stronger models when it's uncertain. I didn't engineer this. It emerges from the data — the embedding space for casual and simple_lookup prompts is very dense and clean, so cheap-tier predictions are confident. The boundaries around strong tier are fuzzier, so when the k-NN is uncertain there, it tends to pull toward stronger neighbors.

For a routing system, this asymmetry is desirable. Getting a stronger-than-needed model is expensive but silent. Getting a weaker-than-needed model is cheap but potentially visible to the user.

What the Numbers Look Like After 1 Month

Real traffic distribution (messaging bot):
  cheap tier  ████████████████████████  84.9%  (casual conversation)
  strong tier ███                         8.9%  (coding, reasoning)
  medium tier ██                          6.3%  (analysis, creative)

One important caveat before reading into these numbers: crab-bot runs as a messaging bot — the primary use case is casual conversation, quick lookups, and occasional technical questions. The 84.9% cheap-tier traffic is a direct reflection of that usage pattern. If you're routing for a developer tool, a customer support bot, or a research assistant, your distribution will look very different. A coding-heavy workload might flip cheap and strong — and your cost savings curve will shift accordingly.

Rough cost estimate based on this distribution:

The formula is straightforward:

routing_cost = sum(tier_pct x cost_per_request_for_tier)
savings      = (always_medium_cost - routing_cost) / always_medium_cost

Using a typical pricing ratio where cheap ~= 1/15 of medium, and strong ~= 3x medium:

routing_cost = (84.9% x 1/15) + (6.3% x 1) + (8.9% x 3)
             = 0.057 + 0.063 + 0.267
             = 0.387  ->  about 39% of always-medium cost

That's roughly 61% cheaper than always using medium — in this specific traffic pattern.

To estimate your own savings, plug in your tier distribution and your models' actual per-token prices:

Scenario	cheap%	medium%	strong%	Est. saving vs always-medium
Chat bot (ours)	85%	6%	9%	~61%
Developer tool	30%	20%	50%	~15%
Customer support	60%	35%	5%	~50%
Research assistant	20%	60%	20%	~10%

The savings are real, but they're almost entirely driven by how much of your traffic is genuinely cheap-tier.

	Phase 1 (Groq every request)	Phase 2 (k-NN local)
Categorization latency	~380ms	<20ms
External dependency	Groq API	None
Outage impact	503 failures (May 22)	0
Cost vs always-medium	-61%*	-61%*

*Based on this traffic distribution. Your mileage will vary.

What's Next

The analysis/research_lookup finding has a natural conclusion: merge them into a single category. Both go to medium tier, the embedding space can't separate them, and the 7-category taxonomy has an artificial seam that causes confusion without benefit.

Simulating the merge on the current pool: category accuracy goes from 78.6% -> 82.1%, medium-tier routing accuracy from 79.9% -> 82.4%. The taxonomy should match the model's geometry — not the other way around.

That's Phase 3. I'll write it up when it ships.

Happy to share implementation details in the comments if any of this is useful for what you're building.

DEV Community