Richard Baxter

Posted on Mar 17

Making a Local LLM MCP Server Deterministic: Model Routing, Think-Block Stripping, and the Problems Nobody Warns You About

#ai #opensource #discuss #llm

For some time, I've been experimenting with the idea that by using an MCP server, we can delegate bounded tasks from Claude Code to cheaper local or cloud models (models I run on a local server in LM Studio). It makes sense, why chew through long, repetitive regression testing tasks when this could be directed by claude, but executed by a simpler, arguably more efficient for the task model instead?

The other worry I have - what if Anthropic added a few zeros to their subscription and half of us had to rethink how we use the flagship models? This is my ongoing experiment. There's no "this is how you have to work from now on" pressure that I feel everytime I read about a new release, I'm just curious to see if we can get to a point where Claude is orchestrating and delegating to whatever local model(s) you have available for the same of token efficiency. It might matter one day!

My v1 was simple - running one model, on one endpoint, instructing Claude to think about handover for specific tasks. After not very long I'd rewritten most of it - as it turns out Claude doesn't want to share much work.

Today, I'm going to give you a tour of my work so far (this is experimental project; I welcome honest feedback, forks and pull requests). The post-mortem on what broke, what wasn't cutting the mustard and how that influenced where I've got to: model routing, think-block stripping, SQLite model caching, and per-model prompt tuning. All TypeScript, all open source.

I wrote about using SQLite as a context saver in MCP servers a couple of weeks ago and the core argument there was: don't cram raw API data into your LLM's context window. It doesn't scale. As the dataset grows, the token cost balloons, the signal-to-noise ratio collapses, and the model starts forgetting becuase of incomplete, compacted data. "Memory" - just stuffing everything in context or a SQLlite dependency and hoping the model sorts it out - is not architecture. It's a different flavour of context bloat through tool descriptions and database entries of what the model did.

This is the same problem showing up in a completely different place. When your MCP server needs to know what twelve different local models are good at - their strengths, weaknesses, best task types, context lengths, quantisation levels - you can either dump all of that into every conversation, or you can cache it locally and query what you need. One approach costs hundreds of tokens per call and gets worse as you add models. The other costs just much less.

The MCP server is houtini-lm. It sits between Claude Code and whatever OpenAI-compatible endpoint you've got running - LM Studio, Ollama, vLLM, cloud APIs, whatever speaks /v1/chat/completions. Claude keeps on top of the reasoning. The cheap(er) model handles the outputs.

The hurdles.. Some of which I haven't quite overcome.

The routing problem

v1 assumed you had one model loaded. You'd set LM_STUDIO_URL, maybe override LM_STUDIO_MODEL, and every delegation call went to the same place. Fine if you're running Qwen Coder and only delegating code tasks.

Then I loaded GLM-4 alongside Qwen Coder because I wanted a general-purpose model for chat-style delegation - code explanations, content rewrites, commit messages. And immediately hit the problem: houtini-lm had no concept of "this is a code task, use the coder model" versus "this is a chat task, use the general model." Everything went to whatever model ID was in the config.

So I wrote a router. Here's the core of routeToModel:

type TaskType = 'code' | 'chat' | 'analysis' | 'embedding';

async function routeToModel(taskType: TaskType): Promise<RoutingDecision> {
  const models = await listModelsRaw();
  const loaded = models.filter((m) => m.state === 'loaded' || !m.state);

  let bestModel = loaded[0];
  let bestScore = -1;

  for (const model of loaded) {
    const hints = getPromptHints(model.id, model.arch);
    let score = (hints.bestTaskTypes ?? []).includes(taskType) ? 10 : 0;

    // Bonus: code-specialised models for code tasks
    const profile = getModelProfile(model);
    if (taskType === 'code' && profile?.family.toLowerCase().includes('coder'))
      score += 5;

    // Bonus: larger context for analysis tasks
    if (taskType === 'analysis') {
      const ctx = getContextLength(model);
      if (ctx && ctx > 100000) score += 2;
    }

    if (score > bestScore) {
      bestScore = score;
      bestModel = model;
    }
  }

  return { modelId: bestModel.id, hints: getPromptHints(bestModel.id) };
}

Three things worth noting about this:

1. It queries the LM Studio /v1/models endpoint at routing time. This sounds expensive but the endpoint returns in under 5ms locally and it means model hot-swaps in LM Studio are picked up immediately (even if they can take their good time to load...). I tried caching this and it caused more problems than it solved - we don't want stale model lists when you unload something.

2. It can't JIT-load models. The MCP SDK has a hard ~60-second timeout on tool calls. Loading a model in LM Studio takes minutes. So if the best model for a task isn't loaded, the router uses the best available one and returns a suggestion string: "💡 qwen3-coder-next is downloaded and better suited for code tasks - ask the user to load it in LM Studio." Claude surfaces this to the user. Not ideal, but the alternative (silent timeout) is worse.

3. The scoring is deliberately simple. Current version: does the model's bestTaskTypes include this task type? 10 points. Is it a coder model and this is a code task? 5 bonus. Large context and analysis task? 2 bonus. The highest score wins.

Think-block stripping

GLM-4, Qwen3, Nemotron - these models always emit internal chain-of-thought reasoning wrapped in <think> tags before producing their actual response. When I first loaded GLM-4, every delegation call came back with 400+ tokens of the model arguing with itself before the 50-token answer I actually needed. Watching GLM have a discussion with itself tells you a lot about the model itself - it doesn't seem very confident and really seems to question itself.

The fix is simple:

// Strip <think>...</think> reasoning blocks
let cleanContent = content.replace(/<think>[\s\S]*?<\/think>\s*/g, '');  // closed blocks
cleanContent = cleanContent.replace(/^<think>\s*/, '');                   // orphaned opening tag
cleanContent = cleanContent.trim();

It's 2 lines of regex. But that second line took me a bit long to pin down. Sometimes the model runs out of generation tokens mid-think-block. You get <think>The user wants test stubs for... and then the actual output, with no closing </think>. The first regex doesn't match because there's no closing tag. So I was getting leaked reasoning mixed into the response which cause obvious problems.

My orphaned tag regex catches that - it's not elegant, but it works, and it was a hard won breakrthough.

The emitsThinkBlocks flag in the prompt hints system means this only runs for models that produce think blocks. There's no unnecessary processing for LLaMA or other instruct models that don't use this pattern.

SQLite model cache (sql.js WASM, again)

This is where the argument from my previous SQLite post loads back into reader context... The router needs to know what each model is good at. You could stuff model profiles into the system prompt - strengths, weaknesses, best task types for every loaded model. With two models that's maybe 300 tokens. With twelve it's 2,000. And it's the same 2,000 tokens on every single tool call, burning context on metadata the model has already seen. That's the "memory as architecture" trap again: it works at small scale and falls apart the moment your data grows.

So I did what I did with the Search Console data example: cache it locally, query what you need, return only what's relevant to this specific routing decision.

For Qwen Coder or GLM-4, I've got hand-written (copy and pasted) profiles - curated descriptions, strengths, weaknesses, what best task types suit the model. But what about when someone loads a random GGUF they downloaded from HuggingFace? Let's query that and store it to the db.

The cache works in two tiers:

Tier 1: Static profiles - regex-matched against the model ID or architecture field. I maintain these by hand for model families I've used recently:

const MODEL_PROFILES: { pattern: RegExp; profile: ModelProfile }[] = [
  {
    pattern: /qwen3-coder|qwen3.*coder/i,
    profile: {
      family: 'Qwen3 Coder',
      description: 'Code-specialised model with agentic capabilities.',
      strengths: ['code generation', 'code review', 'debugging', 'test writing'],
      weaknesses: ['non-code creative tasks'],
      bestFor: ['code generation', 'code review', 'test stubs', 'refactoring'],
    },
  },
  // ... 12 more families
];

Tier 2: SQLite cache with HuggingFace auto-profiling - if no static profile matches, the server queries HuggingFace's free API, parses the model card, and generates a profile. This gets cached in a SQLite database with a 7-day TTL.

I chose sql.js (pure WASM) instead of better-sqlite3. For Better Search Console I used better-sqlite3 because it was the data layer - hundreds of thousands of rows, complex queries, WAL mode, the lot. For houtini-lm, the cache holds maybe 20 rows. The priority is zero native dependencies. sql.js compiles to WASM, which means npx -y @houtini/lm works on any machine without needing a C++ toolchain. No node-gypand no build failures on Windows. I wince at whether this would work without a fix on MAC becuase I don't own one and perhaps never will. Still, I've had better-sqlite3 fail on three separate machines because of node-gyp version mismatches - none of this is worth such a small 20-100 row resource.

sql.js is slower for heavy workloads. For a 20-row lookup table, though, the speed difference is not noticeable.

// Schema
const SCHEMA = `
  CREATE TABLE IF NOT EXISTS model_profiles (
    model_id TEXT PRIMARY KEY,
    hf_id TEXT,
    pipeline_tag TEXT,
    architectures TEXT,
    license TEXT,
    downloads INTEGER,
    likes INTEGER,
    library_name TEXT,
    family TEXT,
    description TEXT,
    strengths TEXT,
    weaknesses TEXT,
    best_for TEXT,
    fetched_at INTEGER,
    source TEXT DEFAULT 'huggingface'
  )
`;

The fetched_at timestamp drives the 7-day TTL. After a week, the cache re-fetches from HuggingFace in case the model card has been updated. In practice this almost never matters, but I've had at least one case where a model's pipeline tag changed after a major update and the stale cache was routing it incorrectly.

Per-model prompt hints

This is the bit that makes the difference to output quality.

Each model family has its own set of annoying quirks. GLM-4 writes a paragraph of introduction before every response unless you tell it not to. Even if you do, there's a chance it'll ignore you - Llama is incredibly hard to get only-what-you-want in the output. Qwen Coder is best at temperature 0.1 for code but that's far too low for chat tasks. Nemotron handles structured output well but needs explicit "no preamble" instructions.

The PromptHints interface:

interface PromptHints {
  codeTemp: number;        // temperature for code generation
  chatTemp: number;        // temperature for chat/analysis
  outputConstraint: string; // injected into system prompt
  emitsThinkBlocks: boolean; // flag for think-block stripping
  bestTaskTypes: ('code' | 'chat' | 'analysis' | 'embedding')[];
}

And the per-model configuration:

// GLM-4: great general model, but verbose without constraints
{
  pattern: /glm[- ]?4/i,
  hints: {
    codeTemp: 0.1,
    chatTemp: 0.3,
    outputConstraint: 'Respond with ONLY the requested output. No step-by-step reasoning. No numbered analysis. No preamble. Go straight to the answer.',
    emitsThinkBlocks: true,
    bestTaskTypes: ['chat', 'analysis'],
  },
},

// Qwen Coder: focused, needs minimal constraint
{
  pattern: /qwen3.*coder|qwen.*coder/i,
  hints: {
    codeTemp: 0.1,
    chatTemp: 0.3,
    outputConstraint: 'Be direct. Output only what was asked for.',
    emitsThinkBlocks: true,
    bestTaskTypes: ['code'],
  },
},

The outputConstraint string gets injected into the system prompt before every delegation call. Without it, GLM-4 would generate something like:

Let me analyze this code step by step.

Step 1: First, I'll examine the function signature...
Step 2: Next, I'll consider the edge cases...
Step 3: Now I'll write the tests...

Here are the tests:
// actual tests

With the constraint, you get just the tests. That's 200+ tokens of preamble saved on every single call. Over a day of heavy delegation, that adds up.

I tested this properly - ran the same twenty delegation calls with generic settings versus model-specific hints. The tuned version produced usable output on eighteen of them first try, although hallucination seemed to be a massive problem with GLM 4.7. The generic setup managed twelve. Six out of twenty calls needing a retry doesn't sound terrible, but each retry is another round-trip to the model and another chunk of context in the Claude conversation.

Performance measurement: TTFT and tok/s

Every response now includes timing data in the footer:

Model: qwen/qwen3-coder-next | 145→248 tokens (38 tok/s, 340ms TTFT) | Session: 12,450 offloaded across 23 calls

TTFT (time to first token) and tokens per second, measured from the SSE stream. A slower model was running at 12 tok/s on certain prompts - painfully slow for something that normally hits 100-150. The TTFT metric made the problem obvious: the model was spending 8 seconds thinking before it started generating, which meant it was doing extended reasoning (the think blocks again) on prompts that shouldn't have needed it.

Turned out it was temperature. At 0.3, certain prompts triggered the model's reasoning mode, so dropping to 0.1 for code tasks fixed it. Without the TTFT data, I'd probably still be wondering why some calls take 15 seconds and others take 2.

The session totals (tokens offloaded, call count) serve a different purpose. Watching the counter climb to 40,000 offloaded tokens in a heavy coding day made the savings seem tangible rather than theoretical.

The stack

TypeScript, distributed via npm. Uses sql.js (WASM) for the model cache, SSE streaming for the 55-second soft timeout, and the MCP SDK for the protocol layer. Apache-2.0 licence.

npm: @houtini/lm
GitHub: houtini-ai/lm

If you're building MCP servers and running into similar problems with model determinism, I'd genuinely like to hear what patterns you've landed on. The routing and prompt hints system works for my setup but I'm under no illusion it's the only approach. PRs welcome - especially model profiles for families I haven't tested.

PS: incase I didn't mention, you can use this to add any external cloud model with an OpenAI compatible API endpoint. Comments welcome - it'd be nice to see this through to being fully useful and a genuine sidekick for Claude.

Top comments (5)

Jhonatan Alves da silva • Mar 17

The think-block stripping problem scales in unexpected ways depending on
your output volume. The pattern Apex Stack describes — clean outputs
followed by one with leaked reasoning — is especially dangerous in
financial or legal content where a single corrupted output can sit
undetected for a long time before someone notices.

The transparent scoring approach for routing (fixed weights you can
actually inspect) is underrated. The moment routing becomes a learned
model, you trade debuggability for marginal accuracy gains that rarely
justify the complexity — especially in production where you need to
explain why a model was chosen, not just that it was.

Richard's point about sql.js vs better-sqlite3 resonates. Zero native
deps should be a design constraint for any MCP tool aiming for broad
adoption, not an afterthought. node-gyp friction is a real adoption
killer.

The suggestion to expose token offload metrics as an MCP resource is
the most interesting thread here — essentially turning passive routing
logs into an active feedback signal. Has anyone actually implemented
something like this in production, or is it still mostly theoretical?

Richard Baxter • Mar 18

Hi Jhonatan - I'm going to add the items the comments have raised, plus a consensus score so that the host llm can evaluate the external model output.

Apex Stack • Mar 17

The think-block stripping problem is one of those things that sounds trivial until you actually hit it in production. I run Llama 3 locally for generating stock analysis content across thousands of pages, and the orphaned tag issue you describe — model running out of tokens mid-reasoning — bit me hard when I was batching content generation. You'd get 50 clean outputs, then one with leaked internal reasoning mixed into the financial analysis, and suddenly you have a stock page telling investors about "Step 3: Now I need to consider the P/E ratio..."

Your routing architecture is solving a problem I think a lot of people are going to hit as local model usage grows. The scoring approach (10 for task match, 5 for coder bonus, 2 for context length) is smart because it's transparent — you can actually debug why a model was chosen. I've seen people try ML-based routing and it always turns into a black box that's harder to fix than the original problem.

The sql.js vs better-sqlite3 tradeoff is an underrated decision. Zero native deps for an MCP tool is huge for adoption. The moment you need node-gyp, you lose half your potential users on the install step. Curious whether you've considered exposing the token offload metrics as an MCP resource that Claude could reference to make smarter delegation decisions over time — essentially letting the routing improve based on actual session performance data.

Richard Baxter • Mar 17

Actually that last point is really good. I absolutely should be logging the token offload metrics over time - thank you

Richard Baxter • Mar 18

Thanks for the comments and messages. Here's an update to v2.8.0 shipped this morning.

Thanks to everyone who commented - I've implemented several of the suggestions raised here and wanted to share the results.

What shipped in v2.8.0:

Request semaphore - inference calls are now serialised. I was having problems with stacked timeouts when parallel requests hit my server. Each call gets the full timeout budget.

Quality metadata - every response now includes structured quality signals: truncation flags, think-block detection, token estimation accuracy, and finish reason. This directly addresses @apexstack's point about leaked reasoning in production - the orchestrator now knows if think-blocks were stripped and whether the output was truncated.

Session metrics as an MCP resource - houtini://metrics/session exposes cumulative offload stats and per-model performance as JSON. This is exactly the "active feedback signal" @apexstack suggested. Claude can now proactively read delegation efficiency before deciding what to offload.

HuggingFace-driven thinking detection - instead of hardcoding which models emit blocks, we now detect this from the HF model card's chat_template at startup and store it in the SQLite cache. If the template supports enable_thinking, we suppress it automatically to reclaim generation budget for actual output. Your expensive LLM does the thinking, I'm not sure that context bloat from thinking via a smaller model is the way to go for this application.

Unflushed SSE buffer fix - the final streaming chunk (often containing usage data) could get stranded in the buffer. This caused the "0 tokens offloaded" display on truncated responses that a few people reported.

Token savings benchmark (real files, not toy examples):

I built a benchmark using actual source files from my repo (sizing per file 581–2022 lines of TypeScript) across realistic delegation patterns: code review, architecture review, cross-repo review, and code explanation.

Task	Claude direct	Delegated	Saved
Code review (1352 lines)	14,466 tok	769 tok	95%
Architecture review (2022 lines)	20,014 tok	983 tok	95%
External repo review (581 lines)	5,344 tok	741 tok	86%
Code explanation (833 lines)	8,678 tok	744 tok	91%

93.3% net token savings across that session. The key insight: savings come from context avoidance - when Claude delegates, it never reads the source file into its context window. A 1352-line file is ~14,000 tokens that never enter the conversation.

Claude still reads the local LLM's result (the review summary, not the source file). That's the ~750 token delegation cost in the table. So the full picture is:

Without delegation: Claude reads 14,000 tokens of source code + generates 500 tokens of review = ~14,500 tokens

With delegation: Claude sends a ~250 token tool call + reads back a ~500 token summary from Qwen = ~750 tokens

The source file never enters Claude's context. Claude only sees the compressed output. That's where the 95% comes from — you're trading 14,000 tokens of raw code for a 500 token summary of it.

The QA step (Claude reviewing Qwen's output for correctness) is real cost, but it's reviewing a short summary, not a 1352-line file.

Small tasks (quick factual answers, commit messages) don't save tokens - the ~250 token MCP overhead dominates. But for anything involving reading and analysing files, which is the majority of real coding sessions, delegation pays for itself immediately.

The benchmark script ships with the package if anyone wants to run it against their own setup: LM_STUDIO_URL=your-server:1234 node benchmark.mjs

Repo: github.com/houtini-ai/lm | npm: @houtini/lm