Taxonomy Surgery, Cosine = 1.0000, and Making Routing Disappear into Infrastructure

#ai #llm #machinelearning #python

This is part 3 of the Adaptive Model Routing series. Part 1 built an LLM categorizer with Groq — 8 categories, 3 tiers. Part 2 added k-NN embedding lookup in shadow mode, discovered 83% tier accuracy, and found 61% cost savings on paper. This post covers what happened next.

When Phase 2 ended, I had a working embedding pool in shadow mode inside crab-bot. The category accuracy was sitting at 78.6%. Not bad — but the breakdown hid something worth looking at.

Phase 3: When Validation Tells You a Category Doesn't Need to Exist

The leave-one-out accuracy by category told the real story:

Category	Accuracy	Tier
casual	94%	cheap
simple_lookup	91%	cheap
creative	88%	medium
coding	92%	strong
reasoning	89%	strong
analysis	59%	medium
research_lookup	61%	medium

Two categories were basically a coin flip. And they were confusing each other — almost all of analysis's misses landed on research_lookup and vice versa.

The obvious move would be to try fixing the categorizer prompt, tuning the LLM, or gathering more labeled data. I was about to go down that road when I noticed the column next to the accuracy: both categories mapped to the same tier. Medium.

That changed everything. The question stopped being "why can't the model tell these apart?" and became: "what routing decision are we actually getting wrong?"

The answer was zero. A misclassification between analysis and research_lookup produces no routing error. The routing outcome is identical either way.

The confusion wasn't a model failure — it was a signal from the embedding space that the boundary between these two categories was artificial. If k-NN can't draw a line between them in 384 dimensions with 1,300 examples, maybe the line doesn't belong there.

Decision: merge research_lookup into analysis.

-- Re-label 243 rows where category was 'research_lookup'
UPDATE routing_log
SET category = 'analysis'
WHERE category = 'research_lookup';

The embeddings didn't change. The vectors were already correct — only the label stored alongside them was wrong. I bumped tier_mapping_version from v1 to v2 in the config so any future audit query can filter by mapping era.

Result: overall category accuracy jumped from 78.6% to 82.0% (+3.4%). Medium-tier accuracy specifically went from 79.9% to 82.1%. Seven categories became six. Zero downtime — just a bot restart.

The principle I walked away with: the taxonomy should match the model's geometry, not the other way around. When your validation metric tells you two categories are indistinguishable AND they share the same destination, the boundary is wrong. Delete it.

Phase 4: Moving the Router into Infrastructure

At this point the routing logic lived inside crab-bot — a specific application. That meant any other client that wanted smart model selection would have to build their own categorizer, maintain their own embedding pool, and manage their own session cache. That's a lot of work to replicate.

thrift-flow is an OpenAI-compatible LLM proxy that already sits in front of all my model calls. It was the natural home for routing.

I added EmbeddingRouter and ModelRouter into thrift-flow's proxy/router.py — same intfloat/multilingual-e5-small model, same query: / passage: prefix convention the e5 family requires. Before I touched the pool migration, though, I needed to answer one question: are the embeddings from crab-bot's instance of the model compatible with the ones thrift-flow will produce?

The five-minute check:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("intfloat/multilingual-e5-small")

# Embed with passage prefix — same as what crab-bot stored
live_emb = model.encode(
    ["passage: debug this Python TypeError"],
    normalize_embeddings=True
)[0].astype(np.float32)

# Load the same prompt's embedding from crab-bot's routing.db
stored_emb = load_from_db(...)  # float32 bytes -> numpy

cosine = np.dot(stored_emb, live_emb)
print(f"cosine: {cosine:.4f}")
# cosine: 1.0000

Cosine similarity of 1.0000. Same model weights, same prefix convention — identical vector space. The pool was fully portable.

I migrated the 1,311 entries from crab-bot's routing.db. After deduplication (same prompt hash appearing multiple times), thrift-flow landed at 876 unique pool entries, well above the 20-entry minimum to enable k-NN lookups. Switched it to shadow mode and deployed.

The server-side wiring is straightforward — when a request comes in with model="auto" and routing is enabled, the ModelRouter intercepts:

if model_requested == "auto" and _model_router is not None:
    _last_user_msg = next(
        (m.get("content") for m in reversed(messages)
         if m.get("role") == "user"),
        None,
    )
    _, model_resolved = await _model_router.route(
        _last_user_msg,
        messages,
        session_key=session_key,
    )
else:
    model_resolved = config.resolve_model(model_requested)

Any client connecting to thrift-flow can now get adaptive routing by setting model="auto". The client doesn't need to know anything about tiers, embeddings, or categorizers.

Phase 5: crab-bot Becomes a Pure Chat Bot

With thrift-flow handling routing, crab-bot's own ModelRouter was now dead weight. Worse, running two routing layers in parallel would mean double the Groq API calls for categorization and potentially conflicting decisions.

The migration was three config changes:

# Before
OPENAI_API_BASE = "https://api.openai.com/v1"
AI_MODEL = "gpt-5.5"

# After
OPENAI_API_BASE = "http://localhost:8888/v1"
AI_MODEL = "auto"

And in crab-bot's routing config:

llm_categorizer_enabled: false
embedding_lookup_enabled: false

That's it. crab-bot stopped being "a chat bot that also does model routing" and became "a chat bot." All the routing logic — categorization, embedding lookup, session caching, logging — now runs in thrift-flow and is invisible to the application layer.

thrift-flow is deployed at port 8888 with model aliases configured:

models:
  aliases:
    cheap:  "openai/gpt-5.4-mini"
    medium: "openai/gpt-5.4"
    strong: "openai/gpt-5.5"

When crab-bot sends a request with model="auto", thrift-flow categorizes it, picks the tier, logs the decision, and forwards to the actual model. The bot's code never touches a tier name again.

What This Series Actually Taught Me

Validation metrics can tell you when a category doesn't need to exist. I spent time worrying about 59% accuracy on analysis. The right thing to worry about was whether that confusion translated into bad routing decisions. It didn't. The taxonomy was wrong, not the model.

Embeddings are portable if you control the model and prefix. The cosine check took five minutes and completely de-risked moving 1,300 training examples across systems. If you're using a model from the same checkpoint with the same input format, you'll get the same vector space. Trust the math.

Re-labeling production data safely is mostly a schema problem. Having tier_mapping_version in the routing log meant I could run the UPDATE with confidence — any future query can filter to only rows under the current mapping. The re-label was a single SQL statement, not a data pipeline.

Routing belongs in infrastructure, not in the application. Before Phase 5, adding smart routing to a new client meant copying a bunch of code. After Phase 5, it means setting model="auto" and pointing at the right base URL. The application layer should be ignorant of routing mechanics.

The pool is now at 876 entries and growing. Next up: flipping thrift-flow's embedding router from shadow to live mode and measuring whether k-NN agreement with the LLM categorizer justifies removing the Groq call entirely for high-confidence pool hits — that's where the real latency savings show up.