<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Richard Baxter</title>
    <description>The latest articles on DEV Community by Richard Baxter (@richardbaxter).</description>
    <link>https://dev.to/richardbaxter</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3760451%2Fc496ed81-9047-4edc-b07b-2715c9407d05.jpeg</url>
      <title>DEV Community: Richard Baxter</title>
      <link>https://dev.to/richardbaxter</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/richardbaxter"/>
    <language>en</language>
    <item>
      <title>Making a Local LLM MCP Server Deterministic: Model Routing, Think-Block Stripping, and the Problems Nobody Warns You About</title>
      <dc:creator>Richard Baxter</dc:creator>
      <pubDate>Tue, 17 Mar 2026 12:48:05 +0000</pubDate>
      <link>https://dev.to/richardbaxter/making-a-local-llm-mcp-server-deterministic-model-routing-think-block-stripping-and-the-problems-5bmj</link>
      <guid>https://dev.to/richardbaxter/making-a-local-llm-mcp-server-deterministic-model-routing-think-block-stripping-and-the-problems-5bmj</guid>
      <description>&lt;p&gt;For some time, I've been experimenting with the idea that by using an MCP server, we can delegate bounded tasks from Claude Code to cheaper local or cloud models (models I run on a local server in LM Studio). It makes sense, why chew through long, repetitive regression testing tasks when this could be directed by claude, but executed by a simpler, arguably more efficient for the task model instead?&lt;/p&gt;

&lt;p&gt;The other worry I have - what if Anthropic added a few zeros to their subscription and half of us had to rethink how we use the flagship models? This is my ongoing experiment. There's no "this is how you have to work from now on" pressure that I feel everytime I read about a new release, I'm just curious to see if we can get to a point where Claude is orchestrating and delegating to whatever local model(s) you have available for the same of token efficiency. It might matter one day!&lt;/p&gt;

&lt;p&gt;My v1 was simple - running one model, on one endpoint, instructing Claude to think about handover for specific tasks. After not very long I'd rewritten most of it - as it turns out Claude doesn't want to share much work.&lt;/p&gt;

&lt;p&gt;Today, I'm going to give you a tour of my work so far (this is experimental project; I welcome honest feedback, forks and pull requests). The post-mortem on what broke, what wasn't cutting the mustard and how that influenced where I've got to: model routing, think-block stripping, SQLite model caching, and per-model prompt tuning. All TypeScript, all open source.&lt;/p&gt;

&lt;p&gt;I wrote about &lt;a href="https://dev.to/richardbaxter/sqlite-as-an-mcp-context-saver-stop-cramming-raw-api-data-into-your-llm-2oj4"&gt;using SQLite as a context saver in MCP servers&lt;/a&gt; a couple of weeks ago and the core argument there was: don't cram raw API data into your LLM's context window. It doesn't scale. As the dataset grows, the token cost balloons, the signal-to-noise ratio collapses, and the model starts forgetting becuase of incomplete, compacted data. "Memory" - just stuffing everything in context or a SQLlite dependency and hoping the model sorts it out - is not architecture. It's a different flavour of context bloat through tool descriptions and database entries of what the model did.&lt;/p&gt;

&lt;p&gt;This is the same problem showing up in a completely different place. When your MCP server needs to know what twelve different local models are good at - their strengths, weaknesses, best task types, context lengths, quantisation levels - you can either dump all of that into every conversation, or you can cache it locally and query what you need. One approach costs hundreds of tokens per call and gets worse as you add models. The other costs just much less.&lt;/p&gt;

&lt;p&gt;The MCP server is &lt;a href="https://github.com/houtini-ai/lm" rel="noopener noreferrer"&gt;houtini-lm&lt;/a&gt;. It sits between Claude Code and whatever OpenAI-compatible endpoint you've got running - LM Studio, Ollama, vLLM, cloud APIs, whatever speaks &lt;code&gt;/v1/chat/completions&lt;/code&gt;. Claude keeps on top of the reasoning. The cheap(er) model handles the outputs.&lt;/p&gt;

&lt;p&gt;The hurdles.. Some of which I haven't quite overcome.&lt;/p&gt;

&lt;h2&gt;
  
  
  The routing problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvku6p3d2kf7gwqzgbc22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvku6p3d2kf7gwqzgbc22.png" alt="Comparison diagram showing v1 with one model versus v2.7 with task-based routing" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v1 assumed you had one model loaded. You'd set &lt;code&gt;LM_STUDIO_URL&lt;/code&gt;, maybe override &lt;code&gt;LM_STUDIO_MODEL&lt;/code&gt;, and every delegation call went to the same place. Fine if you're running Qwen Coder and only delegating code tasks.&lt;/p&gt;

&lt;p&gt;Then I loaded GLM-4 alongside Qwen Coder because I wanted a general-purpose model for chat-style delegation - code explanations, content rewrites, commit messages. And immediately hit the problem: houtini-lm had no concept of "this is a code task, use the coder model" versus "this is a chat task, use the general model." Everything went to whatever model ID was in the config.&lt;/p&gt;

&lt;p&gt;So I wrote a router. Here's the core of &lt;code&gt;routeToModel&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;TaskType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analysis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;embedding&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;routeToModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;RoutingDecision&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;listModelsRaw&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;loaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;loaded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;bestModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;bestScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;loaded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getPromptHints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bestTaskTypes&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[]).&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Bonus: code-specialised models for code tasks&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getModelProfile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;family&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;coder&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
      &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Bonus: larger context for analysis tasks&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;taskType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analysis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getContextLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;bestScore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;bestScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="nx"&gt;bestModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;modelId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;bestModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;hints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getPromptHints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bestModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things worth noting about this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. It queries the LM Studio &lt;code&gt;/v1/models&lt;/code&gt; endpoint at routing time.&lt;/strong&gt;  This sounds expensive but the endpoint returns in under 5ms locally and it means model hot-swaps in LM Studio are picked up immediately (even if they can take their good time to load...). I tried caching this and it caused more problems than it solved - we don't want stale model lists when you unload something.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. It can't JIT-load models.&lt;/strong&gt; The MCP SDK has a hard ~60-second timeout on tool calls. Loading a model in LM Studio takes minutes. So if the best model for a task isn't loaded, the router uses the best available one and returns a suggestion string: "💡 qwen3-coder-next is downloaded and better suited for code tasks - ask the user to load it in LM Studio." Claude surfaces this to the user. Not ideal, but the alternative (silent timeout) is worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The scoring is deliberately simple.&lt;/strong&gt; Current version: does the model's &lt;code&gt;bestTaskTypes&lt;/code&gt; include this task type? 10 points. Is it a coder model and this is a code task? 5 bonus. Large context and analysis task? 2 bonus. The highest score wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Think-block stripping
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx80sy7qnj2a6u0ubc60.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx80sy7qnj2a6u0ubc60.png" alt="Diagram showing think-block stripping - raw model output with think tags being cleaned to just the useful response" width="768" height="1376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GLM-4, Qwen3, Nemotron - these models always emit internal chain-of-thought reasoning wrapped in &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags before producing their actual response. When I first loaded GLM-4, every delegation call came back with 400+ tokens of the model arguing with itself before the 50-token answer I actually needed. Watching GLM have a discussion with itself tells you a lot about the model itself - it doesn't seem very confident and really seems to question itself.&lt;/p&gt;

&lt;p&gt;The fix is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strip &amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt; reasoning blocks&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cleanContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;[\s\S]&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;?&lt;/span&gt;&lt;span class="sr"&gt;&amp;lt;&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// closed blocks&lt;/span&gt;
&lt;span class="nx"&gt;cleanContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cleanContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;*/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                   &lt;span class="c1"&gt;// orphaned opening tag&lt;/span&gt;
&lt;span class="nx"&gt;cleanContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cleanContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's 2 lines of regex. But that second line took me a bit long to pin down. Sometimes the model runs out of generation tokens mid-think-block. You get &lt;code&gt;&amp;lt;think&amp;gt;The user wants test stubs for...&lt;/code&gt; and then the actual output, with no closing &lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt;. The first regex doesn't match because there's no closing tag. So I was getting leaked reasoning mixed into the response which cause obvious problems.&lt;/p&gt;

&lt;p&gt;My orphaned tag regex catches that - it's not elegant, but it works, and it was a hard won breakrthough.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;emitsThinkBlocks&lt;/code&gt; flag in the prompt hints system means this only runs for models that produce think blocks. There's no unnecessary processing for LLaMA or other instruct models that don't use this pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  SQLite model cache (sql.js WASM, again)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xjucdh38dcswtuj0vj2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xjucdh38dcswtuj0vj2.png" alt="Architecture diagram showing how the SQLite model cache works - checking static profiles first, falling back to HuggingFace API, caching in sql.js" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the argument from my &lt;a href="https://dev.to/richardbaxter/sqlite-as-an-mcp-context-saver-stop-cramming-raw-api-data-into-your-llm-2oj4"&gt;previous SQLite post&lt;/a&gt; loads back into reader context... The router needs to know what each model is good at. You could stuff model profiles into the system prompt - strengths, weaknesses, best task types for every loaded model. With two models that's maybe 300 tokens. With twelve it's 2,000. And it's the same 2,000 tokens on every single tool call, burning context on metadata the model has already seen. That's the "memory as architecture" trap again: it works at small scale and falls apart the moment your data grows.&lt;/p&gt;

&lt;p&gt;So I did what I did with the Search Console data example: cache it locally, query what you need, return only what's relevant to this specific routing decision.&lt;/p&gt;

&lt;p&gt;For Qwen Coder or GLM-4, I've got hand-written (copy and pasted) profiles - curated descriptions, strengths, weaknesses, what best task types suit the model. But what about when someone loads a random GGUF they downloaded from HuggingFace? Let's query that and store it to the db.&lt;/p&gt;

&lt;p&gt;The cache works in two tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Static profiles&lt;/strong&gt; - regex-matched against the model ID or architecture field. I maintain these by hand for model families I've used recently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MODEL_PROFILES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ModelProfile&lt;/span&gt; &lt;span class="p"&gt;}[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/qwen3-coder|qwen3.*coder/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;family&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Qwen3 Coder&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Code-specialised model with agentic capabilities.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;strengths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code generation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code review&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;debugging&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;test writing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;weaknesses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;non-code creative tasks&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;bestFor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code generation&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code review&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;test stubs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;refactoring&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ... 12 more families&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 2: SQLite cache with HuggingFace auto-profiling&lt;/strong&gt; - if no static profile matches, the server queries HuggingFace's free API, parses the model card, and generates a profile. This gets cached in a SQLite database with a 7-day TTL.&lt;/p&gt;

&lt;p&gt;I chose &lt;code&gt;sql.js&lt;/code&gt; (pure WASM) instead of &lt;code&gt;better-sqlite3&lt;/code&gt;. For Better Search Console I used &lt;code&gt;better-sqlite3&lt;/code&gt; because it was the data layer - hundreds of thousands of rows, complex queries, WAL mode, the lot. For houtini-lm, the cache holds maybe 20 rows. The priority is zero native dependencies. &lt;code&gt;sql.js&lt;/code&gt; compiles to WASM, which means &lt;code&gt;npx -y @houtini/lm&lt;/code&gt; works on any machine without needing a C++ toolchain. No &lt;code&gt;node-gyp&lt;/code&gt;and no build failures on Windows. I wince at whether this would work without a fix on MAC becuase I don't own one and perhaps never will. Still, I've had &lt;code&gt;better-sqlite3&lt;/code&gt; fail on three separate machines because of &lt;code&gt;node-gyp&lt;/code&gt; version mismatches - none of this is worth such a small 20-100 row resource.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sql.js&lt;/code&gt; is slower for heavy workloads. For a 20-row lookup table, though, the speed difference is not noticeable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Schema&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SCHEMA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
  CREATE TABLE IF NOT EXISTS model_profiles (
    model_id TEXT PRIMARY KEY,
    hf_id TEXT,
    pipeline_tag TEXT,
    architectures TEXT,
    license TEXT,
    downloads INTEGER,
    likes INTEGER,
    library_name TEXT,
    family TEXT,
    description TEXT,
    strengths TEXT,
    weaknesses TEXT,
    best_for TEXT,
    fetched_at INTEGER,
    source TEXT DEFAULT 'huggingface'
  )
`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;fetched_at&lt;/code&gt; timestamp drives the 7-day TTL. After a week, the cache re-fetches from HuggingFace in case the model card has been updated. In practice this almost never matters, but I've had at least one case where a model's pipeline tag changed after a major update and the stale cache was routing it incorrectly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-model prompt hints
&lt;/h2&gt;

&lt;p&gt;This is the bit that makes the difference to output quality.&lt;/p&gt;

&lt;p&gt;Each model family has its own set of annoying quirks. GLM-4 writes a paragraph of introduction before every response unless you tell it not to. Even if you do, there's a chance it'll ignore you - Llama is incredibly hard to get only-what-you-want in the output. Qwen Coder is best at temperature 0.1 for code but that's far too low for chat tasks. Nemotron handles structured output well but needs explicit "no preamble" instructions.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;PromptHints&lt;/code&gt; interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PromptHints&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;codeTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// temperature for code generation&lt;/span&gt;
  &lt;span class="nl"&gt;chatTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// temperature for chat/analysis&lt;/span&gt;
  &lt;span class="nl"&gt;outputConstraint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// injected into system prompt&lt;/span&gt;
  &lt;span class="nl"&gt;emitsThinkBlocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// flag for think-block stripping&lt;/span&gt;
  &lt;span class="nl"&gt;bestTaskTypes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analysis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;embedding&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the per-model configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// GLM-4: great general model, but verbose without constraints&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/glm&lt;/span&gt;&lt;span class="se"&gt;[&lt;/span&gt;&lt;span class="sr"&gt;- &lt;/span&gt;&lt;span class="se"&gt;]?&lt;/span&gt;&lt;span class="sr"&gt;4/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;hints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;codeTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;chatTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;outputConstraint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Respond with ONLY the requested output. No step-by-step reasoning. No numbered analysis. No preamble. Go straight to the answer.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;emitsThinkBlocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;bestTaskTypes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analysis&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;

&lt;span class="c1"&gt;// Qwen Coder: focused, needs minimal constraint&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/qwen3.*coder|qwen.*coder/i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;hints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;codeTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;chatTemp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;outputConstraint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Be direct. Output only what was asked for.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;emitsThinkBlocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;bestTaskTypes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;code&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;outputConstraint&lt;/code&gt; string gets injected into the system prompt before every delegation call. Without it, GLM-4 would generate something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Let me analyze this code step by step.

Step 1: First, I'll examine the function signature...
Step 2: Next, I'll consider the edge cases...
Step 3: Now I'll write the tests...

Here are the tests:
// actual tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the constraint, you get just the tests. That's 200+ tokens of preamble saved on every single call. Over a day of heavy delegation, that adds up.&lt;/p&gt;

&lt;p&gt;I tested this properly - ran the same twenty delegation calls with generic settings versus model-specific hints. The tuned version produced usable output on eighteen of them first try, although hallucination seemed to be a massive problem with GLM 4.7. The generic setup managed twelve. Six out of twenty calls needing a retry doesn't sound terrible, but each retry is another round-trip to the model and another chunk of context in the Claude conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance measurement: TTFT and tok/s
&lt;/h2&gt;

&lt;p&gt;Every response now includes timing data in the footer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model: qwen/qwen3-coder-next | 145→248 tokens (38 tok/s, 340ms TTFT) | Session: 12,450 offloaded across 23 calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TTFT (time to first token) and tokens per second, measured from the SSE stream. A slower model was running at 12 tok/s on certain prompts - painfully slow for something that normally hits 100-150. The TTFT metric made the problem obvious: the model was spending 8 seconds thinking before it started generating, which meant it was doing extended reasoning (the think blocks again) on prompts that shouldn't have needed it.&lt;/p&gt;

&lt;p&gt;Turned out it was temperature. At 0.3, certain prompts triggered the model's reasoning mode, so dropping to 0.1 for code tasks fixed it. Without the TTFT data, I'd probably still be wondering why some calls take 15 seconds and others take 2.&lt;/p&gt;

&lt;p&gt;The session totals (tokens offloaded, call count) serve a different purpose. Watching the counter climb to 40,000 offloaded tokens in a heavy coding day made the savings seem tangible rather than theoretical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;TypeScript, distributed via npm. Uses &lt;code&gt;sql.js&lt;/code&gt; (WASM) for the model cache, SSE streaming for the 55-second soft timeout, and the MCP SDK for the protocol layer. Apache-2.0 licence.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/@houtini/lm" rel="noopener noreferrer"&gt;@houtini/lm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/houtini-ai/lm" rel="noopener noreferrer"&gt;houtini-ai/lm&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building MCP servers and running into similar problems with model determinism, I'd genuinely like to hear what patterns you've landed on. The routing and prompt hints system works for my setup but I'm under no illusion it's the only approach. PRs welcome - especially model profiles for families I haven't tested.&lt;/p&gt;

&lt;p&gt;PS: incase I didn't mention, you can use this to add any external cloud model with an OpenAI compatible API endpoint. Comments welcome - it'd be nice to see this through to being fully useful and a genuine sidekick for Claude.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>discuss</category>
      <category>llm</category>
    </item>
    <item>
      <title>SQLite as an MCP context saver: stop cramming raw API data into your LLM</title>
      <dc:creator>Richard Baxter</dc:creator>
      <pubDate>Wed, 04 Mar 2026 14:40:33 +0000</pubDate>
      <link>https://dev.to/richardbaxter/sqlite-as-an-mcp-context-saver-stop-cramming-raw-api-data-into-your-llm-2oj4</link>
      <guid>https://dev.to/richardbaxter/sqlite-as-an-mcp-context-saver-stop-cramming-raw-api-data-into-your-llm-2oj4</guid>
      <description>&lt;p&gt;&lt;strong&gt;Most MCP servers dump raw API responses into the conversation. I've been using SQLite as a dependency in my MCP to sync data from my Google Search Console account locally and query it with SQL - here's the pattern and a working implementation for Google Search Console.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm fascinated by the utility and insight an LLM can provide, provided the data is presented correctly. You're not going to get great results from consumer-grade AI with 100,000 lines of JSON. A SQL query to a populated local database, however, is a different matter. Today's post is the why and how I use SQLite to give Claude a fighting chance of doing analysis, accurately, without the context bloat or hallucination.&lt;/p&gt;

&lt;p&gt;I've been shipping &lt;a href="https://github.com/houtini-ai" rel="noopener noreferrer"&gt;MCP servers since late 2025&lt;/a&gt;. Every time I hook one up to an API with any volume of data, without paging/chunk management the same problem shows up. You call the API, and if you're not limiting the rquest you get &lt;em&gt;all the rows&lt;/em&gt; back. Claude, or your LLM of choice dutifully analyses the data as if its the whole story. But a decent site generates tens of thousands of query/page combinations per month - so that 1,000-row response is actually a small sample, and even that might be too much to deal with efficiently.&lt;/p&gt;

&lt;p&gt;MCP gives AI models access to APIs - we know this. But most APIs spit back volumes of data that don't fit in a conversation, or much of the data is irrelevant to answering the question. I found that after a couple of GSC API calls, about 80% of the token use was wasted on raw data, leaving maybe 20% for the model to actually think with. The analysis that comes back is built on incomplete data, but it reads with full confidence - and that's a tricky combination to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  My approach
&lt;/h2&gt;

&lt;p&gt;The fix is less about prompting and more about architecture.&lt;/p&gt;

&lt;p&gt;Two phases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntnb4xu86akw6ttc917u.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fntnb4xu86akw6ttc917u.jpg" alt="Comparison of the usual API-to-LLM approach versus the SQLite pattern" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sync&lt;/strong&gt; - pull your complete dataset from the API into a local SQLite database as a dependency. No pagination caps, handle rate limits, retries, and deduplication in the background. Create your local data warehouse. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt; - your MCP tools translate prompts to SQL queries, defined in the tool description. The mcp returns aggregated results, not raw dumps.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So in my example, when you ask "which pages are losing traffic?", Claude runs a targeted query that returns maybe 20 rows instead of trying to work through 50,000 raw data points crammed into the conversation. It's like handing someone the analyst's summary rather than the raw spreadsheet.&lt;/p&gt;

&lt;p&gt;You get accuracy because your SQL hits the complete dataset, not a 1,000-row sample. And you get efficiency because you're sending 20 aggregated rows into the context window instead of 50,000 raw ones.&lt;/p&gt;

&lt;p&gt;I reckon this approach has legs beyond just Search Console. Sync locally, query with SQL, return aggregated results. It addresses the context window problem at an architectural level rather than hoping your API response fits.&lt;/p&gt;

&lt;p&gt;The thing that clicked for me building these is that your tools don't have to be thin API wrappers. You can build something that syncs an entire dataset in the background, builds indexes, prunes stale data, and then exposes simple query endpoints. Your model couldn't care less about pagination or rate limits or auth tokens - it asks a question and gets something useful back. Once you reach that point, it's not a party trick - it's just infrastructure.&lt;/p&gt;

&lt;p&gt;MCP is a lot more powerful than some of us might realise! &lt;/p&gt;

&lt;h2&gt;
  
  
  Under the hood: Better Search Console
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://github.com/houtini-ai/better-search-console" rel="noopener noreferrer"&gt;Better Search Console&lt;/a&gt; as a working implementation of this pattern, specifically for Google Search Console data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uii1h5hmhuyylrsy9cc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uii1h5hmhuyylrsy9cc.jpg" alt="Architecture diagram showing how Better Search Console syncs GSC data through SQLite" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The sync engine
&lt;/h3&gt;

&lt;p&gt;The Google Search Console API caps you at 1,000 rows per request and 25,000 rows per date range. For anything beyond a hobby blog, that's nowhere near enough. So the sync engine breaks your date range into 90-day chunks and fetches them in parallel - three concurrent streams, each paginating through its chunk until the API returns fewer rows than requested.&lt;/p&gt;

&lt;p&gt;You can trigger this as a CLI command if it's a particularly massive job, just by the way. &lt;/p&gt;

&lt;p&gt;Writes are serialised though. SQLite doesn't love concurrent writes, so all three streams funnel into a single write queue. Every row hits the database via &lt;code&gt;INSERT OR REPLACE&lt;/code&gt; keyed on date + query + page + device + country.&lt;/p&gt;

&lt;p&gt;I threw 800K rows at it (six months from a client's property) and the initial sync took about four minutes. After that, syncs are incremental - it picks up from where it left off.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQLite config
&lt;/h3&gt;

&lt;p&gt;Default SQLite settings are fairly conservative, so for this workload I've adjusted a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WAL mode&lt;/strong&gt; - write-ahead logging so reads don't block writes during sync&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;64MB cache&lt;/strong&gt; - keeps hot pages in memory instead of hitting disk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4GB memory-mapped I/O&lt;/strong&gt; - lets SQLite memory-map the database file directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10 covering indexes&lt;/strong&gt; - one for each common query pattern (query lookups, page lookups, date ranges, country breakdowns, the lot)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is that queries against 800K rows come back in under a second. SQLite with proper indexes is surprisingly fast for this kind of analytical work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data retention
&lt;/h3&gt;

&lt;p&gt;Left unchecked, six months of GSC data for a busy site can balloon to millions of rows. The server auto-prunes after every sync: keeps the last 90 days in full, then drops zero-click rows from older data. The thinking is - if a query/page combination got zero clicks three months ago, it's probably noise, but anything with actual clicks stays around indefinitely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query sandboxing
&lt;/h3&gt;

&lt;p&gt;I spent longer on this than anything else. The &lt;code&gt;query_gsc_data&lt;/code&gt; tool lets Claude write freeform SQL against your data, which is powerful - but you obviously don't want it running &lt;code&gt;DROP TABLE&lt;/code&gt; or &lt;code&gt;DELETE FROM&lt;/code&gt;. So every query gets regex-checked before execution, and anything that matches INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, or ATTACH gets blocked. There's also a 10,000-row automatic LIMIT on queries that don't specify one, so a careless &lt;code&gt;SELECT *&lt;/code&gt; can't eat your entire context window.&lt;/p&gt;

&lt;h3&gt;
  
  
  The tools
&lt;/h3&gt;

&lt;p&gt;Twelve tools in total. Four get about 90% of my use. The first two will trigger an mcp app to render charts etc with vite (really cool if you're using Claude Desktop). &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;get_overview&lt;/code&gt;&lt;/strong&gt; - shows all your properties at a glance with clicks, impressions, CTR, average position, and sparkline trends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;get_dashboard&lt;/code&gt;&lt;/strong&gt; - deep dive on one property. Hero metrics, comparison periods, top queries, country breakdown, ranking distribution, new/lost queries, branded vs non-branded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;get_insights&lt;/code&gt;&lt;/strong&gt; - sixteen pre-built analytical queries. The one I keep reaching for is &lt;code&gt;opportunities&lt;/code&gt;: it surfaces queries where you're ranking 5-20 with decent impression counts. You're close enough to page one that a title tweak or a content refresh could push you over. These tend to be the lowest-effort wins in SEO.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;query_gsc_data&lt;/code&gt;&lt;/strong&gt; - freeform SQL against the &lt;code&gt;search_analytics&lt;/code&gt; table with the sandboxing I described above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one kicked the whole project off for me - being able to ask anything of the data and get a real answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  SQL queries I've been running for SEO analysis
&lt;/h2&gt;

&lt;p&gt;The schema is nice and simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;search_analytics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran every one of these on a client's ecommerce site last Tuesday - 340K rows in the database - so they're well tested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find long-tail referring keywords (5+ words):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is probably my favourite query in the set. Longer phrases tend to be conversational - they read like something you'd type into ChatGPT or ask Google's AI Overview. Finding which of these already send you traffic tells you exactly what questions your content is answering, and more importantly, which questions it isn't.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_analytics&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-90 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;LENGTH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;LENGTH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;REPLACE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;LENGTH&lt;/code&gt; trick counts spaces to filter for five-or-more-word phrases. Not pretty SQL, but it works in SQLite which doesn't have a native &lt;code&gt;WORD_COUNT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot keyword cannibalisation&lt;/strong&gt; (multiple pages ranking for the same query):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_analytics&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2025-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I find cannibalisation on almost every client audit I do. Two pages scrapping over the same query, neither ranking properly, and nobody on the team has noticed. This surfaces the worst offenders so you can decide whether to consolidate or differentiate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content decay detection&lt;/strong&gt; (pages losing traffic month-over-month):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-28 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-56 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-29 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;prior&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_analytics&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;prior&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;prior&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Questions people are asking about your content:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_analytics&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-90 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'how %'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'what %'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'why %'&lt;/span&gt;
       &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'can %'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'does %'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'is %'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've used this one on sites when building FAQ sections. If you're getting impressions for "how to configure X" but no clicks, your existing content probably doesn't answer that question clearly enough - or at all. Once you can see them, they're straightforward fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Country breakdown for international targeting:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;impressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;position&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_position&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;search_analytics&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'now'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'-90 days'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;clicks&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The nice thing about all of these is that you're querying the complete dataset. Everything sits in SQLite and every query hits every row - no 1,000-row cap, no pagination limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interactive dashboards with ext-apps
&lt;/h2&gt;

&lt;p&gt;Something else I got to explore on this project: MCP's &lt;a href="https://modelcontextprotocol.io/docs/extensions/apps" rel="noopener noreferrer"&gt;ext-apps framework&lt;/a&gt;, which lets you render interactive HTML directly inside Claude Desktop. Actual clickable dashboards with real data, living right there in the conversation as embedded iframes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17z4ooc9lpr58pwevsrd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17z4ooc9lpr58pwevsrd.jpg" alt=" " width="800" height="881"&gt;&lt;/a&gt;&lt;br&gt;
The overview, the property drilldowns, the sparkline trends - they're all interactive. It feels like the beginning of something interesting to me: MCP-native SaaS, where you build the backend as an MCP server, the frontend as ext-apps, and distribute via &lt;code&gt;npx&lt;/code&gt;. No hosting to maintain, no auth flows to build, no deployment pipeline to manage. And your data stays on your machine, which for SEO data is a nice bonus.&lt;/p&gt;
&lt;h2&gt;
  
  
  Setting it up
&lt;/h2&gt;

&lt;p&gt;Three steps. The Google credentials bit takes the longest but you only do it once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Google service account&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Head to &lt;a href="https://console.cloud.google.com/" rel="noopener noreferrer"&gt;Google Cloud Console&lt;/a&gt;, create a new project and enable the Search Console API&lt;/li&gt;
&lt;li&gt;Credentials → Create service account → Keys tab → Add Key → JSON&lt;/li&gt;
&lt;li&gt;Grab the &lt;code&gt;client_email&lt;/code&gt; value from that JSON file&lt;/li&gt;
&lt;li&gt;In &lt;a href="https://search.google.com/search-console/" rel="noopener noreferrer"&gt;Search Console&lt;/a&gt; → Settings → Users and permissions → Add the email with &lt;strong&gt;Full&lt;/strong&gt; permission&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last step is the one everyone misses (I did too, first time). The service account is basically a separate Google identity - it can't see your properties until you explicitly share them. Skip it and you'll get "No properties found" errors wondering what went wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Claude Desktop config&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add this to your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"better-search-console"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@houtini/better-search-console"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"GOOGLE_APPLICATION_CREDENTIALS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/path/to/your-service-account.json"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Restart and sync&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Restart Claude Desktop and ask it to set up Better Search Console. The &lt;code&gt;setup&lt;/code&gt; tool discovers your properties, syncs them in the background, and shows an overview. Initial sync depends on data volume - 30 seconds for a small site, a few minutes for one with millions of rows. After that, syncs are incremental.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;p&gt;Built in TypeScript, distributed via npm. Uses &lt;code&gt;better-sqlite3&lt;/code&gt; for storage, full pagination against the Google Search Console API, and MCP ext-apps for the interactive dashboard bits. Apache-2.0 licence - fork it, break it, whatever works for you.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;npm:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/@houtini/better-search-console" rel="noopener noreferrer"&gt;@houtini/better-search-console&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/houtini-ai/better-search-console" rel="noopener noreferrer"&gt;houtini-ai/better-search-console&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I reckon the sync-locally-query-with-SQL approach has legs well beyond Search Console. Better answers from less context - that's genuinely all there is to it.&lt;/p&gt;

&lt;p&gt;Pull requests welcome - I'm especially interested in SQL queries you've found useful for SEO analysis, and I'll add the good ones to the pre-built insights.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
