For some time, I've been experimenting with the idea that by using an MCP server, we can delegate bounded tasks from Claude Code to cheaper local or cloud models (models I run on a local server in LM Studio). It makes sense, why chew through long, repetitive regression testing tasks when this could be directed by claude, but executed by a simpler, arguably more efficient for the task model instead?
The other worry I have - what if Anthropic added a few zeros to their subscription and half of us had to rethink how we use the flagship models? This is my ongoing experiment. There's no "this is how you have to work from now on" pressure that I feel everytime I read about a new release, I'm just curious to see if we can get to a point where Claude is orchestrating and delegating to whatever local model(s) you have available for the same of token efficiency. It might matter one day!
My v1 was simple - running one model, on one endpoint, instructing Claude to think about handover for specific tasks. After not very long I'd rewritten most of it - as it turns out Claude doesn't want to share much work.
Today, I'm going to give you a tour of my work so far (this is experimental project; I welcome honest feedback, forks and pull requests). The post-mortem on what broke, what wasn't cutting the mustard and how that influenced where I've got to: model routing, think-block stripping, SQLite model caching, and per-model prompt tuning. All TypeScript, all open source.
I wrote about using SQLite as a context saver in MCP servers a couple of weeks ago and the core argument there was: don't cram raw API data into your LLM's context window. It doesn't scale. As the dataset grows, the token cost balloons, the signal-to-noise ratio collapses, and the model starts forgetting becuase of incomplete, compacted data. "Memory" - just stuffing everything in context or a SQLlite dependency and hoping the model sorts it out - is not architecture. It's a different flavour of context bloat through tool descriptions and database entries of what the model did.
This is the same problem showing up in a completely different place. When your MCP server needs to know what twelve different local models are good at - their strengths, weaknesses, best task types, context lengths, quantisation levels - you can either dump all of that into every conversation, or you can cache it locally and query what you need. One approach costs hundreds of tokens per call and gets worse as you add models. The other costs just much less.
The MCP server is houtini-lm. It sits between Claude Code and whatever OpenAI-compatible endpoint you've got running - LM Studio, Ollama, vLLM, cloud APIs, whatever speaks /v1/chat/completions. Claude keeps on top of the reasoning. The cheap(er) model handles the outputs.
The hurdles.. Some of which I haven't quite overcome.
The routing problem
v1 assumed you had one model loaded. You'd set LM_STUDIO_URL, maybe override LM_STUDIO_MODEL, and every delegation call went to the same place. Fine if you're running Qwen Coder and only delegating code tasks.
Then I loaded GLM-4 alongside Qwen Coder because I wanted a general-purpose model for chat-style delegation - code explanations, content rewrites, commit messages. And immediately hit the problem: houtini-lm had no concept of "this is a code task, use the coder model" versus "this is a chat task, use the general model." Everything went to whatever model ID was in the config.
So I wrote a router. Here's the core of routeToModel:
type TaskType = 'code' | 'chat' | 'analysis' | 'embedding';
async function routeToModel(taskType: TaskType): Promise<RoutingDecision> {
const models = await listModelsRaw();
const loaded = models.filter((m) => m.state === 'loaded' || !m.state);
let bestModel = loaded[0];
let bestScore = -1;
for (const model of loaded) {
const hints = getPromptHints(model.id, model.arch);
let score = (hints.bestTaskTypes ?? []).includes(taskType) ? 10 : 0;
// Bonus: code-specialised models for code tasks
const profile = getModelProfile(model);
if (taskType === 'code' && profile?.family.toLowerCase().includes('coder'))
score += 5;
// Bonus: larger context for analysis tasks
if (taskType === 'analysis') {
const ctx = getContextLength(model);
if (ctx && ctx > 100000) score += 2;
}
if (score > bestScore) {
bestScore = score;
bestModel = model;
}
}
return { modelId: bestModel.id, hints: getPromptHints(bestModel.id) };
}
Three things worth noting about this:
1. It queries the LM Studio /v1/models endpoint at routing time. This sounds expensive but the endpoint returns in under 5ms locally and it means model hot-swaps in LM Studio are picked up immediately (even if they can take their good time to load...). I tried caching this and it caused more problems than it solved - we don't want stale model lists when you unload something.
2. It can't JIT-load models. The MCP SDK has a hard ~60-second timeout on tool calls. Loading a model in LM Studio takes minutes. So if the best model for a task isn't loaded, the router uses the best available one and returns a suggestion string: "💡 qwen3-coder-next is downloaded and better suited for code tasks - ask the user to load it in LM Studio." Claude surfaces this to the user. Not ideal, but the alternative (silent timeout) is worse.
3. The scoring is deliberately simple. Current version: does the model's bestTaskTypes include this task type? 10 points. Is it a coder model and this is a code task? 5 bonus. Large context and analysis task? 2 bonus. The highest score wins.
Think-block stripping
GLM-4, Qwen3, Nemotron - these models always emit internal chain-of-thought reasoning wrapped in <think> tags before producing their actual response. When I first loaded GLM-4, every delegation call came back with 400+ tokens of the model arguing with itself before the 50-token answer I actually needed. Watching GLM have a discussion with itself tells you a lot about the model itself - it doesn't seem very confident and really seems to question itself.
The fix is simple:
// Strip <think>...</think> reasoning blocks
let cleanContent = content.replace(/<think>[\s\S]*?<\/think>\s*/g, ''); // closed blocks
cleanContent = cleanContent.replace(/^<think>\s*/, ''); // orphaned opening tag
cleanContent = cleanContent.trim();
It's 2 lines of regex. But that second line took me a bit long to pin down. Sometimes the model runs out of generation tokens mid-think-block. You get <think>The user wants test stubs for... and then the actual output, with no closing </think>. The first regex doesn't match because there's no closing tag. So I was getting leaked reasoning mixed into the response which cause obvious problems.
My orphaned tag regex catches that - it's not elegant, but it works, and it was a hard won breakrthough.
The emitsThinkBlocks flag in the prompt hints system means this only runs for models that produce think blocks. There's no unnecessary processing for LLaMA or other instruct models that don't use this pattern.
SQLite model cache (sql.js WASM, again)
This is where the argument from my previous SQLite post loads back into reader context... The router needs to know what each model is good at. You could stuff model profiles into the system prompt - strengths, weaknesses, best task types for every loaded model. With two models that's maybe 300 tokens. With twelve it's 2,000. And it's the same 2,000 tokens on every single tool call, burning context on metadata the model has already seen. That's the "memory as architecture" trap again: it works at small scale and falls apart the moment your data grows.
So I did what I did with the Search Console data example: cache it locally, query what you need, return only what's relevant to this specific routing decision.
For Qwen Coder or GLM-4, I've got hand-written (copy and pasted) profiles - curated descriptions, strengths, weaknesses, what best task types suit the model. But what about when someone loads a random GGUF they downloaded from HuggingFace? Let's query that and store it to the db.
The cache works in two tiers:
Tier 1: Static profiles - regex-matched against the model ID or architecture field. I maintain these by hand for model families I've used recently:
const MODEL_PROFILES: { pattern: RegExp; profile: ModelProfile }[] = [
{
pattern: /qwen3-coder|qwen3.*coder/i,
profile: {
family: 'Qwen3 Coder',
description: 'Code-specialised model with agentic capabilities.',
strengths: ['code generation', 'code review', 'debugging', 'test writing'],
weaknesses: ['non-code creative tasks'],
bestFor: ['code generation', 'code review', 'test stubs', 'refactoring'],
},
},
// ... 12 more families
];
Tier 2: SQLite cache with HuggingFace auto-profiling - if no static profile matches, the server queries HuggingFace's free API, parses the model card, and generates a profile. This gets cached in a SQLite database with a 7-day TTL.
I chose sql.js (pure WASM) instead of better-sqlite3. For Better Search Console I used better-sqlite3 because it was the data layer - hundreds of thousands of rows, complex queries, WAL mode, the lot. For houtini-lm, the cache holds maybe 20 rows. The priority is zero native dependencies. sql.js compiles to WASM, which means npx -y @houtini/lm works on any machine without needing a C++ toolchain. No node-gypand no build failures on Windows. I wince at whether this would work without a fix on MAC becuase I don't own one and perhaps never will. Still, I've had better-sqlite3 fail on three separate machines because of node-gyp version mismatches - none of this is worth such a small 20-100 row resource.
sql.js is slower for heavy workloads. For a 20-row lookup table, though, the speed difference is not noticeable.
// Schema
const SCHEMA = `
CREATE TABLE IF NOT EXISTS model_profiles (
model_id TEXT PRIMARY KEY,
hf_id TEXT,
pipeline_tag TEXT,
architectures TEXT,
license TEXT,
downloads INTEGER,
likes INTEGER,
library_name TEXT,
family TEXT,
description TEXT,
strengths TEXT,
weaknesses TEXT,
best_for TEXT,
fetched_at INTEGER,
source TEXT DEFAULT 'huggingface'
)
`;
The fetched_at timestamp drives the 7-day TTL. After a week, the cache re-fetches from HuggingFace in case the model card has been updated. In practice this almost never matters, but I've had at least one case where a model's pipeline tag changed after a major update and the stale cache was routing it incorrectly.
Per-model prompt hints
This is the bit that makes the difference to output quality.
Each model family has its own set of annoying quirks. GLM-4 writes a paragraph of introduction before every response unless you tell it not to. Even if you do, there's a chance it'll ignore you - Llama is incredibly hard to get only-what-you-want in the output. Qwen Coder is best at temperature 0.1 for code but that's far too low for chat tasks. Nemotron handles structured output well but needs explicit "no preamble" instructions.
The PromptHints interface:
interface PromptHints {
codeTemp: number; // temperature for code generation
chatTemp: number; // temperature for chat/analysis
outputConstraint: string; // injected into system prompt
emitsThinkBlocks: boolean; // flag for think-block stripping
bestTaskTypes: ('code' | 'chat' | 'analysis' | 'embedding')[];
}
And the per-model configuration:
// GLM-4: great general model, but verbose without constraints
{
pattern: /glm[- ]?4/i,
hints: {
codeTemp: 0.1,
chatTemp: 0.3,
outputConstraint: 'Respond with ONLY the requested output. No step-by-step reasoning. No numbered analysis. No preamble. Go straight to the answer.',
emitsThinkBlocks: true,
bestTaskTypes: ['chat', 'analysis'],
},
},
// Qwen Coder: focused, needs minimal constraint
{
pattern: /qwen3.*coder|qwen.*coder/i,
hints: {
codeTemp: 0.1,
chatTemp: 0.3,
outputConstraint: 'Be direct. Output only what was asked for.',
emitsThinkBlocks: true,
bestTaskTypes: ['code'],
},
},
The outputConstraint string gets injected into the system prompt before every delegation call. Without it, GLM-4 would generate something like:
Let me analyze this code step by step.
Step 1: First, I'll examine the function signature...
Step 2: Next, I'll consider the edge cases...
Step 3: Now I'll write the tests...
Here are the tests:
// actual tests
With the constraint, you get just the tests. That's 200+ tokens of preamble saved on every single call. Over a day of heavy delegation, that adds up.
I tested this properly - ran the same twenty delegation calls with generic settings versus model-specific hints. The tuned version produced usable output on eighteen of them first try, although hallucination seemed to be a massive problem with GLM 4.7. The generic setup managed twelve. Six out of twenty calls needing a retry doesn't sound terrible, but each retry is another round-trip to the model and another chunk of context in the Claude conversation.
Performance measurement: TTFT and tok/s
Every response now includes timing data in the footer:
Model: qwen/qwen3-coder-next | 145→248 tokens (38 tok/s, 340ms TTFT) | Session: 12,450 offloaded across 23 calls
TTFT (time to first token) and tokens per second, measured from the SSE stream. A slower model was running at 12 tok/s on certain prompts - painfully slow for something that normally hits 100-150. The TTFT metric made the problem obvious: the model was spending 8 seconds thinking before it started generating, which meant it was doing extended reasoning (the think blocks again) on prompts that shouldn't have needed it.
Turned out it was temperature. At 0.3, certain prompts triggered the model's reasoning mode, so dropping to 0.1 for code tasks fixed it. Without the TTFT data, I'd probably still be wondering why some calls take 15 seconds and others take 2.
The session totals (tokens offloaded, call count) serve a different purpose. Watching the counter climb to 40,000 offloaded tokens in a heavy coding day made the savings seem tangible rather than theoretical.
The stack
TypeScript, distributed via npm. Uses sql.js (WASM) for the model cache, SSE streaming for the 55-second soft timeout, and the MCP SDK for the protocol layer. Apache-2.0 licence.
- npm: @houtini/lm
- GitHub: houtini-ai/lm
If you're building MCP servers and running into similar problems with model determinism, I'd genuinely like to hear what patterns you've landed on. The routing and prompt hints system works for my setup but I'm under no illusion it's the only approach. PRs welcome - especially model profiles for families I haven't tested.
PS: incase I didn't mention, you can use this to add any external cloud model with an OpenAI compatible API endpoint. Comments welcome - it'd be nice to see this through to being fully useful and a genuine sidekick for Claude.



Top comments (2)
The think-block stripping problem is one of those things that sounds trivial until you actually hit it in production. I run Llama 3 locally for generating stock analysis content across thousands of pages, and the orphaned tag issue you describe — model running out of tokens mid-reasoning — bit me hard when I was batching content generation. You'd get 50 clean outputs, then one with leaked internal reasoning mixed into the financial analysis, and suddenly you have a stock page telling investors about "Step 3: Now I need to consider the P/E ratio..."
Your routing architecture is solving a problem I think a lot of people are going to hit as local model usage grows. The scoring approach (10 for task match, 5 for coder bonus, 2 for context length) is smart because it's transparent — you can actually debug why a model was chosen. I've seen people try ML-based routing and it always turns into a black box that's harder to fix than the original problem.
The sql.js vs better-sqlite3 tradeoff is an underrated decision. Zero native deps for an MCP tool is huge for adoption. The moment you need node-gyp, you lose half your potential users on the install step. Curious whether you've considered exposing the token offload metrics as an MCP resource that Claude could reference to make smarter delegation decisions over time — essentially letting the routing improve based on actual session performance data.
Actually that last point is really good. I absolutely should be logging the token offload metrics over time - thank you