<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rob</title>
    <description>The latest articles on DEV Community by Rob (@carryologist).</description>
    <link>https://dev.to/carryologist</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3884903%2Ff7cf0bfd-0b92-4dca-9095-683af23a19e3.png</url>
      <title>DEV Community: Rob</title>
      <link>https://dev.to/carryologist</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/carryologist"/>
    <language>en</language>
    <item>
      <title>Forking and Open Sourcing a Single Purpose Site</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Fri, 29 May 2026 21:08:26 +0000</pubDate>
      <link>https://dev.to/carryologist/forking-and-open-sourcing-a-single-purpose-site-4e9f</link>
      <guid>https://dev.to/carryologist/forking-and-open-sourcing-a-single-purpose-site-4e9f</guid>
      <description>&lt;p&gt;I built a trip planning site for my group going to the F1 Canadian Grand Prix in Montreal. It worked great — itinerary calendar, lodging details, photo gallery, activity suggestions, a shared password so only the group could see it. Classic vibe coded single-purpose app: hardcoded destination, hardcoded dates, hardcoded branding, shipped to Vercel, done.&lt;/p&gt;



&lt;p&gt;Then I looked at it and thought: this is useful beyond one trip. What if anyone could fork this repo, deploy it, and have their own trip site without touching code?&lt;/p&gt;

&lt;p&gt;That question kicked off a 20-hour arc — across several mobile sessions between F1 races — that transformed a static, single-purpose site into a generic, config-driven template, and exposed every security shortcut I'd taken along the way.&lt;/p&gt;

&lt;p&gt;The proof that it worked: I deployed a second instance for a completely different trip — CMA Fest 2026 in Nashville, Tennessee. Same codebase, zero code changes, just the setup wizard.&lt;/p&gt;



&lt;h2&gt;
  
  
  The Starting Point
&lt;/h2&gt;

&lt;p&gt;The original site had "F1 Grand Prix Montreal" baked into the components. CSS variables were named &lt;code&gt;--gradient-f1&lt;/code&gt; and &lt;code&gt;--shadow-f1&lt;/code&gt;. The countdown component had hardcoded race dates. The activities page had Montreal-specific categories. The favicon was F1-themed. &lt;code&gt;localStorage&lt;/code&gt; keys were F1-prefixed.&lt;/p&gt;

&lt;p&gt;It was a good app. It was also impossible for anyone else to use without rewriting half the codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Pivot
&lt;/h2&gt;

&lt;p&gt;The core insight was simple: &lt;strong&gt;one database row should drive the entire site.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I created a &lt;code&gt;vacation_config&lt;/code&gt; table with a single JSONB column. Every piece of configurable data — trip name, destination, dates, timezone, brand color, hero image, lodging details, password hash, LLM provider, encrypted API key — lives in that one row.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;vacation_config&lt;/span&gt;
&lt;span class="s"&gt;├── tripName&lt;/span&gt;
&lt;span class="s"&gt;├── destination&lt;/span&gt;
&lt;span class="s"&gt;├── startDate / endDate&lt;/span&gt;
&lt;span class="s"&gt;├── brandColor / heroImageUrl&lt;/span&gt;
&lt;span class="s"&gt;├── lodgings[]&lt;/span&gt;
&lt;span class="s"&gt;├── passwordHash (bcrypt)&lt;/span&gt;
&lt;span class="s"&gt;├── llmApiKeyEncrypted (AES-256-GCM)&lt;/span&gt;
&lt;span class="s"&gt;├── llmProvider&lt;/span&gt;
&lt;span class="s"&gt;└── setupComplete&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every page calls &lt;code&gt;getConfig()&lt;/code&gt; server-side and destructures what it needs. No hardcoded values anywhere. Adding a new configurable field is just adding a key to the TypeScript interface — old configs get new defaults via object spread.&lt;/p&gt;

&lt;p&gt;This is the pattern that makes fork-and-deploy work. You clone the repo, you get an empty database, and the site is a blank canvas until someone fills in the config.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup Wizard
&lt;/h2&gt;

&lt;p&gt;An empty database isn't useful. Someone needs to fill in that config row, and that someone might not be technical.&lt;/p&gt;

&lt;p&gt;The setup wizard is a 6-step client component that walks through everything:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What it configures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Basics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trip name, destination, tagline, dates, timezone (auto-detected)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Branding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Brand color (8 presets + custom hex), hero image URL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lodging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple properties with type-aware display (hotel, Airbnb, VRBO, house, resort)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Password&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Shared site password&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional — pick an LLM provider, paste an API key, auto-generate activity suggestions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Review &amp;amp; Launch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Summary → one-click launch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When you click Launch, four things happen in sequence: config is saved (password bcrypt-hashed, API key AES-encrypted), database tables are created, the user is auto-authenticated, and they're redirected to the live homepage. The entire setup takes about two minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Middleware Problem
&lt;/h2&gt;

&lt;p&gt;A static site deployed to your own Vercel project doesn't need sophisticated auth. You share the URL with your group, maybe add a simple password check, and you're done.&lt;/p&gt;

&lt;p&gt;A clonable template is different. Every fork is a fresh deployment. The middleware needs to handle two states: &lt;strong&gt;not yet set up&lt;/strong&gt; and &lt;strong&gt;set up and running&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I built a two-gate system running in Edge Runtime:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 1 — Setup Check.&lt;/strong&gt; Is there an HMAC-signed &lt;code&gt;setup-done&lt;/code&gt; cookie? If not, redirect to &lt;code&gt;/setup&lt;/code&gt;. This cookie is signed with the site secret to prevent client forgery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate 2 — Auth Check.&lt;/strong&gt; Is there a valid auth token cookie? The token includes a timestamp and a random nonce, HMAC-signed with the site secret. If it's missing, expired, or invalid, redirect to &lt;code&gt;/password&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The edge constraint matters. Next.js middleware runs in Edge Runtime, which means no Node.js &lt;code&gt;crypto&lt;/code&gt; module. The entire auth chain — HMAC signing, signature verification, timing-safe comparison — uses the Web Crypto API. The Node.js side (&lt;code&gt;lib/auth.ts&lt;/code&gt;) handles bcrypt password hashing and AES encryption, which only run in API routes.&lt;/p&gt;

&lt;h2&gt;
  
  
  From One Secret to Everything
&lt;/h2&gt;

&lt;p&gt;The user provides exactly one secret: a random hex string generated with &lt;code&gt;openssl rand -hex 32&lt;/code&gt;. That single value does triple duty:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HMAC signing&lt;/strong&gt; — auth tokens and setup cookies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AES-256 encryption key&lt;/strong&gt; — derived via SHA-256 hash for encrypting LLM API keys at rest&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timing-safe comparison&lt;/strong&gt; — double-HMAC pattern for constant-time signature verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else is either auto-provisioned (Vercel Postgres sets &lt;code&gt;POSTGRES_URL&lt;/code&gt;, Vercel Blob sets &lt;code&gt;BLOB_READ_WRITE_TOKEN&lt;/code&gt;) or entered through the wizard. The user never edits code, never touches a config file, never opens a terminal after the initial deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Security Audit
&lt;/h2&gt;

&lt;p&gt;This is where the story arc connects to lessons I've written about before.&lt;/p&gt;

&lt;p&gt;I've been saying &lt;a href="https://dev.to/blog/thursday-thoughts-audit-your-vibe-code-often"&gt;audit your vibe code often&lt;/a&gt;. I've written about the &lt;a href="https://dev.to/blog/spring-cleaning-your-vibe-coded-apps"&gt;spring cleaning process&lt;/a&gt; and the &lt;a href="https://dev.to/blog/closing-the-loop-from-audit-to-ten-commits"&gt;phased remediation pattern&lt;/a&gt;. So when I decided to open-source this project, I ran a full audit before publishing.&lt;/p&gt;

&lt;p&gt;The audit found &lt;strong&gt;15+ vulnerabilities across 4 severity tiers.&lt;/strong&gt; I expected minor stuff. I got critical findings.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Critical Tier
&lt;/h3&gt;

&lt;p&gt;The worst findings were structural. The middleware had a blanket pass-through for all &lt;code&gt;/api/*&lt;/code&gt; routes — meaning API endpoints were completely unauthenticated. The setup config endpoint had no auth, so anyone who found the URL could overwrite or delete the entire site configuration. Auth tokens had no expiration. And there was a hardcoded fallback secret — &lt;code&gt;'fallback'&lt;/code&gt; — that would activate if the environment variable was missing, making every signature predictable.&lt;/p&gt;

&lt;p&gt;These aren't exotic bugs. They're the exact patterns that vibe coding produces: things that work during development and deployment but leave doors wide open.&lt;/p&gt;

&lt;h3&gt;
  
  
  The High Tier
&lt;/h3&gt;

&lt;p&gt;The OG image endpoint accepted arbitrary URLs with no validation — a textbook SSRF vector that could reach private networks. LLM prompts passed unsanitized user input directly to the model — destination names, PDF document text, all of it unescaped. No data validation existed on any write endpoint. And the password endpoint had no rate limiting — unlimited brute-force attempts.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Medium and Low Tiers
&lt;/h3&gt;

&lt;p&gt;Signature comparison used string equality instead of timing-safe comparison. The setup cookie was unsigned. Error responses leaked internal details. No security headers. No file size limits on uploads. The Gemini API key was sent as a URL query parameter (logged in server access logs). The middleware's static asset detection used &lt;code&gt;pathname.includes('.')&lt;/code&gt; — meaning a crafted path like &lt;code&gt;/settings/foo.bar&lt;/code&gt; would bypass auth.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;

&lt;p&gt;I structured the remediation the same way I've done it before: phased commits ordered by severity and dependency graph, not one giant PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Commit 1 — Critical fixes.&lt;/strong&gt; Middleware now enforces auth on all API routes except the auth endpoint itself and public config reads. Setup mutation requires authentication after initial setup. Auth tokens expire after 30 days. The hardcoded fallback secret is gone — a missing env var now returns a 500.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Commit 2 — High fixes.&lt;/strong&gt; SSRF blocked with private IP detection. LLM inputs sanitized with delimiter-based injection mitigation and output validation. Per-entity input validators on all write routes. Rate limiting on the auth endpoint with IP-based lockout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Commit 3 — Medium and low fixes.&lt;/strong&gt; Setup cookie is HMAC-signed. PDF uploads enforce a size limit. Security headers added (CSP, HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy). Gemini key moved from URL to header. Static asset detection uses an explicit extension regex. Client-side error logging sanitized. CSS color injection blocked with a validation function.&lt;/p&gt;

&lt;p&gt;Three commits. The same phased pattern. Same principle: merge and test between each phase so you know exactly which change breaks something if it does.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes When You Open Source
&lt;/h2&gt;

&lt;p&gt;Going from "deployed for my group" to "anyone can fork this" changed the threat model fundamentally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; I controlled the deployment. I knew the URL. The password was shared via text message. If something was misconfigured, I'd notice and fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt; Strangers deploy this. They might skip the secret. They might leave the setup endpoint open. They might paste API keys into client-side code. Every defensive measure needs to work without my involvement.&lt;/p&gt;

&lt;p&gt;This is why the audit mattered more for open-sourcing than for personal use. A personal deployment with no auth on API routes is sloppy. An open-source template with no auth on API routes is a liability for every person who forks it.&lt;/p&gt;

&lt;p&gt;The middleware's two-gate system, the HMAC-signed cookies, the secret-or-500 pattern, the input validation — none of these existed in the original F1 trip site. They exist because the code is no longer mine alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Novice-Friendly
&lt;/h2&gt;

&lt;p&gt;The target user is someone who's never used a terminal. That constraint shaped the documentation as much as the code.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/carryologist/vacation-hub/blob/main/docs/SETUP_GUIDE.md" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; walks through 8 steps: fork the repo, generate a secret key (with instructions for Mac, Windows, and a web fallback), deploy to Vercel, add Postgres, add Blob storage, redeploy, run the wizard, share with your group. Each step assumes zero technical knowledge.&lt;/p&gt;

&lt;p&gt;The README has a one-click Deploy with Vercel button that pre-fills the environment variable prompt. The wizard auto-detects timezone from the browser. Lodging details auto-populate from the property name via AI. The color picker has presets so nobody has to know what a hex code is.&lt;/p&gt;

&lt;p&gt;Every friction point I could identify, I tried to eliminate. The person deploying this might be planning a bachelorette party or a family reunion. They're not reading documentation for fun.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Lessons
&lt;/h2&gt;

&lt;p&gt;Turning a personal app into a template taught me things that pure greenfield development wouldn't have:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config-driven beats hardcoded, always.&lt;/strong&gt; Even if you're building for one use case, storing configuration in a database instead of in component props makes the app fundamentally more flexible. The JSONB column costs nothing and buys everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Middleware is the security boundary.&lt;/strong&gt; In a personal app, auth is a convenience — you know who's accessing it. In a template, middleware is the only thing standing between a stranger's deployment and the open internet. It needs to handle every state: not yet configured, configured but not logged in, logged in, logged in with an expired token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The setup wizard is the product.&lt;/strong&gt; For a clonable template, the first-run experience &lt;em&gt;is&lt;/em&gt; the product. If someone can't get from fork to functioning site in 10 minutes, they'll abandon it. The wizard isn't a nice-to-have — it's the reason the project works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security scales with distribution.&lt;/strong&gt; A bug in your personal app affects you. A bug in a template affects everyone who forks it. The bar for security isn't "good enough for me" — it's "good enough for the least technical person who deploys this."&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;28 commits&lt;/strong&gt; — from hardcoded F1 site to open-source template&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 JSONB row&lt;/strong&gt; — drives the entire site configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6-step wizard&lt;/strong&gt; — zero-code setup for non-technical users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15+ security vulnerabilities&lt;/strong&gt; — found and fixed before open-sourcing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 phased commits&lt;/strong&gt; — for the security remediation alone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 env var&lt;/strong&gt; — the only thing a user manually configures (&lt;code&gt;VACATION_HUB_SECRET&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~20 hours&lt;/strong&gt; — total transformation time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 lines of code&lt;/strong&gt; — required from the person deploying it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>vibecoding</category>
      <category>security</category>
      <category>nextjs</category>
    </item>
    <item>
      <title>Adding an MCP Server to the Blog Itself</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 28 May 2026 13:48:08 +0000</pubDate>
      <link>https://dev.to/carryologist/adding-an-mcp-server-to-the-blog-itself-4n9k</link>
      <guid>https://dev.to/carryologist/adding-an-mcp-server-to-the-blog-itself-4n9k</guid>
      <description>&lt;p&gt;Two weeks ago I &lt;a href="https://dev.to/posts/wiring-mcp-into-my-fitness-tracker-for-openclaw"&gt;wired MCP into my fitness tracker&lt;/a&gt; — ten tools, one endpoint, four clients. That was always a test run. The fitness tracker is a low-stakes app. If an agent writes a bad workout entry, I delete it. The blog is different. The blog has published content, a deploy pipeline, an editorial calendar, analytics, syndication to Dev.to. If an agent publishes a draft that wasn't ready, the internet sees it.&lt;/p&gt;

&lt;p&gt;This week I added an MCP server to vibescoder.dev anyway. Sixteen tools across five categories. The agent that helped me build it — running in a Coder workspace — can now turn around and use it to manage the very site it just modified. That's the kind of loop that makes building in public feel recursive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;One sentence: &lt;strong&gt;let any agent directly publish to the site, analyze traffic data, and troubleshoot production issues.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The blog is a Next.js 16 app deployed on Vercel. Content lives in a separate private GitHub repo (&lt;code&gt;the-vibe-coder-content&lt;/code&gt;), committed via the GitHub API. The admin UI already supports voice recording → Claude-generated MDX → one-click publish. But the admin UI requires a browser. An agent in a Coder workspace, or in Claude Desktop, or in Cursor can't click buttons. MCP gives them the same capabilities programmatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The fitness tracker MCP server talked to Postgres via Prisma. This blog has no database. Content is MDX files in a GitHub repo. Analytics are Redis counters in Upstash. Deployments happen by curling a Vercel webhook. So the MCP server is a GitHub API client, a Redis reader, and an HTTP caller — not a database wrapper.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent (Claude / Cursor / Coder Agents)
  │
  │  Streamable HTTP (Bearer token)
  ▼
vibescoder.dev/api/mcp/mcp
  │
  ├─ Content tools ──→ GitHub API (read/write/commit MDX)
  ├─ Analytics ──────→ Upstash Redis (view counters)
  ├─ Deploy ─────────→ Vercel deploy hook
  ├─ Syndication ────→ Dev.to API
  └─ Diagnostics ────→ fetch() against live site
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same stack as the fitness tracker: &lt;code&gt;mcp-handler&lt;/code&gt; for the Next.js adapter, &lt;code&gt;zod&lt;/code&gt; for parameter schemas, bearer token auth, &lt;code&gt;disableSse: true&lt;/code&gt; for stateless Vercel deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 16 Tools
&lt;/h2&gt;

&lt;p&gt;The fitness tracker had 10 tools that all talked to one database. This server has 16 tools that talk to four different backends. Grouped by what they touch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content Management&lt;/strong&gt; (7 tools) — the core editorial workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;list_posts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="cm"&gt;/* filter by status/tag/date */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="cm"&gt;/* full MDX + frontmatter    */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;create_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* commit new MDX to GitHub  */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;update_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* partial frontmatter/body  */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;publish_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="cm"&gt;/* draft → live, trigger deploy */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unpublish_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* live → draft, trigger deploy */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;delete_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* remove from GitHub        */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Blog Fodder &amp;amp; Editorial&lt;/strong&gt; (4 tools) — the content pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;list_fodder&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="cm"&gt;/* active + archived, with consumption status */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_fodder&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="cm"&gt;/* read raw session notes */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_todo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="cm"&gt;/* editorial calendar     */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;update_todo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="cm"&gt;/* maintain the calendar  */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Analytics&lt;/strong&gt; (1 tool), &lt;strong&gt;Deploy &amp;amp; Syndication&lt;/strong&gt; (2 tools), &lt;strong&gt;Diagnostics&lt;/strong&gt; (2 tools):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;analytics_summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* 30-day views + top pages */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;trigger_deploy&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* hit the Vercel webhook   */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;syndicate_post&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* cross-post to Dev.to     */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;site_health&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="cm"&gt;/* fetch key endpoints      */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_settings&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="cm"&gt;/* AI style prompt config   */&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every tool returns raw data. The agent does its own analysis — same philosophy as the fitness tracker. The &lt;code&gt;list_posts&lt;/code&gt; tool returns frontmatter for every post; the agent decides what "recent drafts" means.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Reused
&lt;/h2&gt;

&lt;p&gt;The blog engine already had all the backend logic. The admin UI's API routes do the exact same operations — read a post from GitHub, commit an update, hit the deploy hook, cross-post to Dev.to. The MCP server calls the same library functions, not the HTTP routes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;commitFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;deleteFile&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@/lib/github&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;listDirectory&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@/lib/github-list&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only net-new code was the directory listing helper (&lt;code&gt;github-list.ts&lt;/code&gt;). The existing &lt;code&gt;github.ts&lt;/code&gt; had file-level CRUD but couldn't list a directory. One function, 30 lines, wraps the GitHub Contents API for directory paths.&lt;/p&gt;

&lt;p&gt;The auth pattern, CORS, and rate limiting were copied from the fitness tracker and adapted. Same &lt;code&gt;timingSafeEqual&lt;/code&gt;, same &lt;code&gt;withMcpAuth&lt;/code&gt; wrapper, same in-memory rate-limit buckets. The muscle memory from the fitness tracker build meant the security layer took minutes, not an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Middleware Change
&lt;/h2&gt;

&lt;p&gt;One line. The blog's middleware protects all &lt;code&gt;/api/*&lt;/code&gt; routes with JWT cookie auth. The MCP server does its own bearer-token auth. So &lt;code&gt;/api/mcp/&lt;/code&gt; gets added to the allow-list alongside &lt;code&gt;/api/auth/&lt;/code&gt;, &lt;code&gt;/api/analytics/track&lt;/code&gt;, and &lt;code&gt;/api/slack/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;pathname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/mcp/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP route then handles auth independently — same pattern as the fitness tracker, where the middleware allow-listed the MCP path and the route enforced its own bearer token.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decisions
&lt;/h2&gt;

&lt;p&gt;Three questions came up during planning:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth granularity&lt;/strong&gt; — single token or read-only vs. read-write tokens? Single token. I'm the only user. If I ever add collaborators, I'll add scoped tokens. Until then, one token does everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit logging&lt;/strong&gt; — the fitness tracker writes to a Postgres &lt;code&gt;audit_log&lt;/code&gt; table. This blog has no database. Options were Redis, console.log, or skip. I went with console.log (captured by Vercel function logs) plus &lt;code&gt;[mcp]&lt;/code&gt; prefixed commit messages for every GitHub write. That gives me two audit trails — Vercel logs for all operations, Git history for content changes — with zero infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;mcp] post: create &lt;span class="s2"&gt;"adding-mcp-server-to-the-blog"&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;mcp] post: publish &lt;span class="s2"&gt;"adding-mcp-server-to-the-blog"&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;mcp] chore: update TODO.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Image uploads&lt;/strong&gt; — deferred. MCP tool parameters are JSON. Binary images would need base64 encoding in a tool call. That's doable but not worth the complexity in v1. The admin UI handles images fine. If an agent needs to add images to a post, it can use the admin API directly or I'll add an &lt;code&gt;upload_image&lt;/code&gt; tool later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Template Update
&lt;/h2&gt;

&lt;p&gt;Same Coder template pattern as the fitness tracker. Token flows from the workstation to workspaces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/etc/coder.d/coder.env
  → TF_VAR_vibescoder_mcp_token
    → coder_agent.main.env &lt;span class="o"&gt;(&lt;/span&gt;VIBESCODER_MCP_TOKEN&lt;span class="o"&gt;)&lt;/span&gt;
      → jq merge into ~/.mcp.json at workspace start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three terminal commands on the homelab to finish it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'TF_VAR_vibescoder_mcp_token=&amp;lt;token&amp;gt;'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/coder.d/coder.env
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart coder
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/coder-templates &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git pull &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./docker/apply.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;gh auth login&lt;/code&gt; step was an amusing detour — I was SSH'd into the homelab from my iPhone, and &lt;code&gt;gh&lt;/code&gt; tried to open a browser on a headless server. The fix was manually entering the one-time code at &lt;code&gt;github.com/login/device&lt;/code&gt; in Safari. Mobile homelab administration is an underappreciated genre of suffering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verifying in Production
&lt;/h2&gt;

&lt;p&gt;The real test was hitting the live endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://vibescoder.dev/api/mcp/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$VIBESCODER_MCP_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/json, text/event-stream"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"jsonrpc":"2.0","id":1,"method":"initialize",
       "params":{"protocolVersion":"2025-03-26",
                 "capabilities":{},
                 "clientInfo":{"name":"test","version":"1.0.0"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response: &lt;code&gt;200 OK&lt;/code&gt;, server name &lt;code&gt;vibescoder&lt;/code&gt;, version &lt;code&gt;1.0.0&lt;/code&gt;, tools capability enabled.&lt;/p&gt;

&lt;p&gt;Then a real tool call — list all drafts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"posts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"slug"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"syndicating-to-substack-the-undocumented-path"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Syndicating to Substack: The Undocumented Path"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"published"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"publishAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One draft in the queue. Real data from the content repo, returned through the MCP server, verified from a Coder workspace. The analytics tool came back with 660 views over 30 days and today's top pages. The site health tool checked five endpoints and reported status codes and response times.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Recursive Moment
&lt;/h2&gt;

&lt;p&gt;The part that's hard to describe until you experience it: the agent that helped build this MCP server can now use it. In the same chat session where we wrote the route file and debugged the middleware, the agent can call &lt;code&gt;list_posts&lt;/code&gt; to see what's published, &lt;code&gt;get_todo&lt;/code&gt; to check the editorial calendar, and &lt;code&gt;trigger_deploy&lt;/code&gt; to ship changes.&lt;/p&gt;

&lt;p&gt;This post was written in a Coder workspace. The MCP server it describes is live on the same site it will be published to. The agent could, in theory, publish this very post by calling &lt;code&gt;publish_post&lt;/code&gt; with the slug. It won't — I'll review it first — but the capability is there. That's the loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Watch how agents use the tools in practice.&lt;/strong&gt; The fitness tracker MCP server taught me that agents are surprisingly good at synthesizing raw data into summaries. Curious whether editorial tools — create, publish, schedule — feel as natural.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add an &lt;code&gt;upload_image&lt;/code&gt; tool.&lt;/strong&gt; Deferred from v1, but it's the obvious gap. An agent that can create a post but not attach images is writing with one hand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update the vibescoder-blog skill file.&lt;/strong&gt; The skill currently documents the Git-based editorial workflow. Now that the MCP server exists, the skill should point agents to the tools instead of the &lt;code&gt;grep&lt;/code&gt; and &lt;code&gt;awk&lt;/code&gt; one-liners.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write it up as blog fodder.&lt;/strong&gt; Done. You're reading it.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;16 MCP tools&lt;/strong&gt; across 5 categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 backends&lt;/strong&gt; wired through one endpoint (GitHub API, Upstash Redis, Vercel deploy hook, Dev.to API)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 files changed&lt;/strong&gt; in the engine repo, 2,365 lines inserted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 file changed&lt;/strong&gt; in the Coder template repo, 23 lines inserted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 npm packages&lt;/strong&gt; added (&lt;code&gt;mcp-handler&lt;/code&gt;, &lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt;, &lt;code&gt;zod&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 middleware line&lt;/strong&gt; to allow-list &lt;code&gt;/api/mcp/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 new infrastructure&lt;/strong&gt; — no database, no Redis, no queues. GitHub API + console.log&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 terminal commands&lt;/strong&gt; to update the homelab Coder config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 iPhone-to-homelab SSH detour&lt;/strong&gt; for &lt;code&gt;gh auth login&lt;/code&gt; via Safari&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;660 views&lt;/strong&gt; over 30 days — the first number the analytics tool reported back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 draft&lt;/strong&gt; in the queue when &lt;code&gt;list_posts&lt;/code&gt; was first tested (still sitting there, Substack)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~4 hours&lt;/strong&gt; from plan to production, including the template update and blog post&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 recursive loop&lt;/strong&gt; — the agent that built the feature can now use it to publish this post&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>agents</category>
      <category>buildinginpublic</category>
      <category>howto</category>
    </item>
    <item>
      <title>Qwen Is Not Yet Ready to Power Local OpenClaw Deployments</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Tue, 26 May 2026 19:27:24 +0000</pubDate>
      <link>https://dev.to/carryologist/qwen-is-not-yet-ready-to-power-local-openclaw-deployments-5ha3</link>
      <guid>https://dev.to/carryologist/qwen-is-not-yet-ready-to-power-local-openclaw-deployments-5ha3</guid>
      <description>&lt;p&gt;Three weeks ago I ran a model showdown — twelve tasks, five models, one RTX 5090 — and Qwen3.5-35B-A3B won. 85.3 weighted score, 206 tok/s, fits in VRAM with room to spare. I switched it to the default and figured I was done.&lt;/p&gt;

&lt;p&gt;I was not done.&lt;/p&gt;

&lt;p&gt;This is what two weeks of actually living with Qwen looked like: the config work I had to do before it was usable, the incident that almost killed the experiment, and the ergonomic gap that means frontier models still own my serious work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Actually Work
&lt;/h2&gt;

&lt;p&gt;The first day I switched Qwen to the default model in OpenClaw, something was wrong. Responses showed raw &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; tags in the visible output. Tool calls came back as plain text — &lt;code&gt;create_workspace&lt;/code&gt;, just sitting there — instead of proper OpenAI-compatible &lt;code&gt;tool_calls&lt;/code&gt; objects. The bot was trying to call tools. It just wasn't &lt;em&gt;calling&lt;/em&gt; them.&lt;/p&gt;

&lt;p&gt;The root cause was a one-line config error. The launch script was using &lt;code&gt;--chat-template chatml&lt;/code&gt; — a minimal template that knows nothing about tool calling and doesn't know to hide thinking tokens. Qwen3.5 ships with a 154-line Jinja template that handles both. I just wasn't using it.&lt;/p&gt;

&lt;p&gt;The catch: Qwen's native template has a strict ordering check that raises an exception if a system message appears anywhere other than the very beginning of the conversation. Coder Agents sends system messages out of order. So I patched one conditional in the template — non-first system messages render as normal blocks instead of throwing — and switched to &lt;code&gt;--chat-template-file&lt;/code&gt; pointing at the patched version.&lt;/p&gt;

&lt;p&gt;After the restart: &lt;code&gt;thinking = 1&lt;/code&gt; in the journalctl output. Tool calls worked. The visible output was clean. The fix was one line. It took half a day to find.&lt;/p&gt;

&lt;p&gt;That's a recurring pattern with local model work. The model is fine. The scaffolding is fragile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day One Gotcha: Cloning From a Stranger
&lt;/h2&gt;

&lt;p&gt;With the template fixed, I asked Qwen to clone the vibe coder repos. It searched GitHub for a literal &lt;code&gt;vibe-coder&lt;/code&gt; user, found a random stranger's account, and dutifully cloned 25 repos from them. &lt;code&gt;reset-css&lt;/code&gt;, &lt;code&gt;moviebox-main&lt;/code&gt;, &lt;code&gt;orange-farm&lt;/code&gt;. None of them mine.&lt;/p&gt;

&lt;p&gt;Not a Qwen failure, exactly. A context failure. The agent had no skill file telling it that &lt;code&gt;carryologist&lt;/code&gt; is the GitHub org. Once I pointed it at the skills directory it read the file, correctly identified the repos, and did the job.&lt;/p&gt;

&lt;p&gt;I fixed this by making skill loading unconditional. The user instruction used to say "when I mention the blog, read the vibescoder-blog skill." Changed it to "at the start of every conversation, read all available skills." Generic enough for every user, scoped by which skills the workspace template actually provisions.&lt;/p&gt;

&lt;p&gt;I also added a fodder dedup check to the vibescoder-blog skill — Qwen had recommended writing a post from a fodder file that already had a draft, because it never checked &lt;code&gt;sources:&lt;/code&gt; fields in existing posts. Small gap, easy to close once you see it.&lt;/p&gt;

&lt;p&gt;The pattern: Qwen is good at following instructions. It is not good at inferring what instructions it needs to follow before it has them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Thermal Flood
&lt;/h2&gt;

&lt;p&gt;May 9. 4:34 PM.&lt;/p&gt;

&lt;p&gt;The OpenClaw cron had been running for a few days. I'd named the job "Hardware Alert Checker (Critical Only)." On May 9 it posted a thermal report to the &lt;code&gt;#homelab-alerts&lt;/code&gt; Discord channel at 4:34 PM. Then again at 4:47. Then 5:07. For the next two days, every fifteen minutes — day and night — a full hardware report appeared in my channel. The cron log eventually showed 384 entries. I counted over 60 posts before I said anything.&lt;/p&gt;

&lt;p&gt;The job was named "Critical Only." It was not configured for "Critical Only." I had set it up to check thermals and post a report. It did exactly that. The bot did precisely what it was set up to do and nothing like what it was named to do.&lt;/p&gt;

&lt;p&gt;On May 11 I finally messaged carrybot directly: "Can we stop regular alerting and only let me know when temps go critical or if I specifically ask?"&lt;/p&gt;

&lt;p&gt;The bot replied: "Already done — that hardware monitoring job is set to 'Critical Only' and runs every 15 minutes. It'll only ping you if temps hit dangerous levels."&lt;/p&gt;

&lt;p&gt;I sent a screenshot of the flood. The bot checked the cron history, confirmed it was wrong, and disabled the job entirely. No config fix. No threshold update. Just gone. Manual checks only from that point forward.&lt;/p&gt;

&lt;p&gt;What it cost: I didn't open OpenClaw again until May 15. Three and a half days. That's a long silence for a tool you're evaluating as a daily driver. Friction compounds. One bad incident isn't fatal, but 60+ notifications across two days is loud enough that I actively avoided the interface rather than dealing with it. The bot won't get better if you stop using it.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Wiring: The Win
&lt;/h2&gt;

&lt;p&gt;May 15 went better. I wired the fitness tracker MCP into OpenClaw — I wrote that up in &lt;a href="https://dev.to/wiring-mcp-into-my-fitness-tracker-for-openclaw"&gt;Wiring MCP Into My Fitness Tracker&lt;/a&gt;, but the short version is: two minutes, real data. First query returned my last Peloton ride. 30-minute Power Zone Pop Ride, Ben Alldis, 7.98 miles. The bot pulled it without hesitation.&lt;/p&gt;

&lt;p&gt;There was a ghost cron alert that evening — the bot flagged a cron job that didn't appear in my active list. Qwen explained the discrepancy clearly (the job exists in state but isn't scheduled). Good recovery after the thermal flood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Session That Revealed the Real Problem
&lt;/h2&gt;

&lt;p&gt;May 16. I sent a voice message asking about my workout stats. No Whisper on the local install, so the bot had no idea what I said. Fine — I typed instead. "What are my stats for my ride today?"&lt;/p&gt;

&lt;p&gt;The bot went to Uber. Ride → Uber. It didn't know I meant Peloton. &lt;/p&gt;

&lt;p&gt;I clarified: fitness tracker MCP. The bot responded that the MCP server wasn't actively connected. I asked it to check the tool list. Confirmed: fitness-tracker was there. Third prompt, correct answer.&lt;/p&gt;

&lt;p&gt;Three extra turns to get what should have been a one-shot query. On a frontier model that would have resolved on the first prompt — it would have understood that "ride stats" meant the fitness tracker I'd been talking about the session before. On Qwen, I start every session from scratch. It has no memory of what MCP servers we were using yesterday. It has no context for what "ride" means to me.&lt;/p&gt;

&lt;p&gt;The bot diagnosed this correctly when I asked. It said: I need a TOOLS.md or explicit mentions at session start; I can't infer that fitness = Peloton MCP from prior conversations. It offered to update the TOOLS.md. It did. That's the right response. But it required me to catch the gap and prompt the fix. A more polished agent would have persisted that context automatically.&lt;/p&gt;

&lt;p&gt;It would have — except I checked the config later and &lt;code&gt;memory-core&lt;/code&gt; is disabled in &lt;code&gt;openclaw.json&lt;/code&gt;. There's a memory plugin; it's just off by default. Every session starting cold wasn't an emergent limitation of local models. It was a config flag I hadn't toggled.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verdict: Local Agents Can't Match Frontier Practicality... Yet
&lt;/h2&gt;

&lt;p&gt;After two weeks: hobbyist-level technology. Great for enthusiasts. Not ready for prime-time agentic work.&lt;/p&gt;

&lt;p&gt;The model is solid. 206 tok/s is genuinely fast. The Jinja template, once fixed, works. When the context is right, the answers are good.&lt;/p&gt;

&lt;p&gt;But the ergonomics aren't there yet. Every session starts cold. MCP connections need re-establishing. The bot does what it's configured to do, not what you intend, and there's enough configuration surface area that intent and config drift apart. A frontier-model-backed agent handles these gaps with implicit context and better defaults. Qwen handles them if you set things up correctly and remind it what's relevant at the start of every conversation.&lt;/p&gt;

&lt;p&gt;That's a meaningful gap. Two weeks in, Qwen never became my default interface. I reach for it when I want to run something local, or when I'm testing the setup. I reach for a frontier model when I want the thing to just work.&lt;/p&gt;

&lt;p&gt;That's an honest result. Qwen is the right default for a privacy-first local-first homelab setup. For production agentic work, the frontier models are still ahead on ergonomics — and ergonomics compound across every session.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: Upgrading to Qwen 3.6
&lt;/h2&gt;

&lt;p&gt;While I was writing this, Qwen released 3.6 (April 24, 2026). Two variants relevant to my setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3.6-35B-A3B&lt;/strong&gt; (MoE) — same VRAM footprint as the current model. Modest coding improvement over 3.5, adds a &lt;code&gt;preserve_thinking&lt;/code&gt; kwarg to the chat template. Drop-in upgrade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3.6-27B&lt;/strong&gt; (dense) — outperforms the 35B MoE on coding benchmarks. SWE-bench 77.2 vs 73.4. The tradeoff is throughput — dense models are slower per token, and the 3.5 MoE's 206 tok/s speed is one of its best features for agentic work where you're waiting on tool call chains.&lt;/p&gt;

&lt;p&gt;A few things to know before upgrading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp b9180+ required for MTP speculative decoding support&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--jinja&lt;/code&gt; flag needed for the &lt;code&gt;enable_thinking&lt;/code&gt;/&lt;code&gt;preserve_thinking&lt;/code&gt; kwargs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not use &lt;code&gt;-sm tensor&lt;/code&gt;&lt;/strong&gt; — there's an open segfault bug (#23297)&lt;/li&gt;
&lt;li&gt;MTP flags: &lt;code&gt;--spec-type draft-mtp --spec-draft-n-max 3&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm going to try the 35B-A3B MoE first. Same slot, same startup flags (minus the segfault one), meaningful upgrade on coding. The dense 27B is tempting on benchmarks but I'll wait to see how throughput holds up under real agentic load before committing.&lt;/p&gt;

&lt;p&gt;The bigger question I'm watching isn't the benchmark numbers — it's whether the next generation of local models closes the context and tool call chaining gap. Once a local model can reliably remember what MCP servers you were using yesterday, infer intent across sessions, and chain tool calls without hand-holding, the ergonomics argument for frontier models gets a lot weaker. We're not there yet. I'll be paying attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;652 session files&lt;/strong&gt;, May 8–16 — the vast majority are cron-fired Discord sessions, not direct interactions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~10 human-initiated sessions&lt;/strong&gt; across the two weeks; the rest are the alert checker running every 15 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7 context resets&lt;/strong&gt; — sessions where the conversation was cleared and started fresh&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thermal flood&lt;/strong&gt;: cron job &lt;code&gt;d8da7ec1&lt;/code&gt; created May 9 4:31 PM PT, &lt;strong&gt;384 logged runs&lt;/strong&gt;, disabled May 11 9:10 PM PT — ~52 hours of every-15-minute posts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token/cost data&lt;/strong&gt;: all null — llama.cpp doesn't return usage in the API response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls&lt;/strong&gt;: 0 structured &lt;code&gt;tool_use&lt;/code&gt; objects in session logs — llama.cpp doesn't emit them. The 40 hits on fitness tracker keywords are conversation text mentions, not actual invocations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory core&lt;/strong&gt;: disabled in &lt;code&gt;openclaw.json&lt;/code&gt; — explains why every session starts cold&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>homelab</category>
      <category>agents</category>
      <category>openclaw</category>
      <category>opinion</category>
    </item>
    <item>
      <title>Wiring MCP Into My Fitness Tracker — and Asking OpenClaw About My Last Workout</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 21 May 2026 16:05:46 +0000</pubDate>
      <link>https://dev.to/carryologist/wiring-mcp-into-my-fitness-tracker-and-asking-openclaw-about-my-last-workout-4pe</link>
      <guid>https://dev.to/carryologist/wiring-mcp-into-my-fitness-tracker-and-asking-openclaw-about-my-last-workout-4pe</guid>
      <description>&lt;p&gt;I open my &lt;a href="https://dev.to/posts/spring-cleaning-your-vibe-coded-apps"&gt;fitness tracker&lt;/a&gt; every day. It pulls workouts from Peloton and Tonal, tracks annual goals, makes pretty charts. Until this week, the way I interacted with it was: open browser, click button, look at chart. Like a 2018 web app.&lt;/p&gt;

&lt;p&gt;This week I made it an MCP server. Now I ask Discord "what was my last workout?" and &lt;strong&gt;carrybot&lt;/strong&gt; — my homelab &lt;a href="https://dev.to/posts/installing-openclaw-on-the-homelab"&gt;OpenClaw&lt;/a&gt; bot, running on my Linux homelab PC, talking to a local Qwen3.5-35B on llama.cpp — answers with real data from the same Postgres my browser hits. Same endpoint also works from Claude Desktop, Codex, Cursor, and any Coder workspace agent that knows how to call it.&lt;/p&gt;

&lt;p&gt;This is the writeup of the afternoon that took me there. The MCP server itself was easy. The interesting parts were the constraints I bumped into and the workarounds that turned out to be cleaner than the "right" answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;One sentence: &lt;strong&gt;let any AI agent talk to my fitness data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The vibe coded fitness tracker is a single-user Next.js 14 app on Vercel. Gated to one Google account. REST endpoints behind a NextAuth session cookie. Peloton and Tonal sync triggered by clicking buttons in the dashboard. That works for the browser. It doesn't work for an agent that wants to ask "summarize my training over the last quarter" or "trigger a Peloton sync — did anything new come in?"&lt;/p&gt;

&lt;p&gt;I want the agent to have &lt;strong&gt;raw access&lt;/strong&gt;. No precomputed summaries. Give it the rows and let it figure out the trends. Part of the point is to learn how agents get better at this kind of analysis over time, and that doesn't happen if I do the math for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP, Not OpenAPI
&lt;/h2&gt;

&lt;p&gt;I almost shipped this as an OpenAPI spec plus bearer-token auth. Cleaner, simpler, every agent framework supports it.&lt;/p&gt;

&lt;p&gt;Then I listed the clients I actually want to use:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Client&lt;/th&gt;
&lt;th&gt;OpenAPI&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Desktop&lt;/td&gt;
&lt;td&gt;Custom integration&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex CLI&lt;/td&gt;
&lt;td&gt;Custom integration&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coder Agents&lt;/td&gt;
&lt;td&gt;Via AI Bridge&lt;/td&gt;
&lt;td&gt;Via AI Bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;Via plugin&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor, Windsurf, Zed&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every client speaks MCP first-class. Ship MCP, write the tools once, every agent picks them up by pointing at a URL. Ship OpenAPI and every client needs bespoke wiring. The decision was over before I finished the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Server
&lt;/h2&gt;

&lt;p&gt;Three files, ~400 lines total.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;src/app/api/mcp/[transport]/route.ts&lt;/code&gt;&lt;/strong&gt; — the MCP route, built on &lt;a href="https://github.com/vercel/mcp-handler" rel="noopener noreferrer"&gt;&lt;code&gt;mcp-handler&lt;/code&gt;&lt;/a&gt; (the package formerly known as &lt;code&gt;@vercel/mcp-adapter&lt;/code&gt; before it got renamed and republished). Ten tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;list_workouts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({...})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_workout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;   &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;create_workout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({...})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;update_workout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({...})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;delete_workout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;   &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;list_goals&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;peloton_status&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sync_peloton&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tonal_status&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sync_tonal&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="cm"&gt;/* schema */&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="nx"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CRUD tools wrap Prisma directly. The sync tools &lt;code&gt;fetch()&lt;/code&gt; the existing REST endpoints (&lt;code&gt;/api/peloton/sync&lt;/code&gt;, &lt;code&gt;/api/tonal/sync&lt;/code&gt;) so I'm not duplicating the dedup orchestration — those endpoints already handle "did we already sync this workout? does this row need backfilling? did the Peloton token expire?" Wrapping them is one HTTP hop. Worth it to keep one source of truth for sync logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;src/lib/api-auth.ts&lt;/code&gt;&lt;/strong&gt; — bearer token helpers. The token is a single env var, &lt;code&gt;MCP_API_TOKEN&lt;/code&gt;, 64 random hex chars. Compared in constant time so I don't leak timing side channels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;timingSafeEqual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;mismatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;mismatch&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charCodeAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charCodeAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;mismatch&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;middleware.ts&lt;/code&gt;&lt;/strong&gt; — extended so the bearer token unlocks every &lt;code&gt;/api/*&lt;/code&gt; route, not just &lt;code&gt;/api/mcp&lt;/code&gt;. Same token, two callers: the MCP server calls Prisma directly for read tools, and self-&lt;code&gt;fetch&lt;/code&gt;es the existing REST routes for the sync tools. Both paths need to pass auth. The token does double duty.&lt;/p&gt;

&lt;p&gt;The transport choice was the one decision worth thinking about. &lt;code&gt;mcp-handler&lt;/code&gt; supports SSE and streamable HTTP. SSE needs Redis for message brokering. Streamable HTTP is stateless. I'm on Vercel Hobby with no Redis. &lt;code&gt;disableSse: true&lt;/code&gt; and ship.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;basePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/mcp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;verboseLogs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;maxDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;disableSse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pnpm i mcp-handler @modelcontextprotocol/sdk@1.26.0 zod&lt;/code&gt; — and yes, you have to pin the SDK to 1.26.0 because &lt;code&gt;mcp-handler@1.1.0&lt;/code&gt; peer-depends on exactly that version, not a semver range. Half an hour of &lt;code&gt;npm install&lt;/code&gt; errors before I noticed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test That Said It Worked
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://&amp;lt;actualapp&amp;gt;.vercel.app/api/mcp/mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$MCP_API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/json, text/event-stream"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response: &lt;code&gt;200 OK&lt;/code&gt;, &lt;code&gt;event: message&lt;/code&gt;, full tool catalog with JSON Schemas. The server worked.&lt;/p&gt;

&lt;p&gt;The hard part wasn't the server. It was getting the four clients I cared about to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client #1: Claude Desktop, Codex, Cursor — The Easy Path
&lt;/h2&gt;

&lt;p&gt;These all read a JSON config file with the same shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fitness-tracker"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://robs-fitness-tracker.vercel.app/api/mcp/mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Bearer &amp;lt;MCP_API_TOKEN&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop in the URL, drop in the token, restart the client. Done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client #2: Coder Workspace Agents — The Path I Got Wrong
&lt;/h2&gt;

&lt;p&gt;I run &lt;a href="https://coder.com" rel="noopener noreferrer"&gt;Coder&lt;/a&gt; on my workstation. Every workspace gets a &lt;code&gt;~/.mcp.json&lt;/code&gt; baked in by the Terraform template (Context7, Vercel, Cloudflare, Playwright — see &lt;a href="https://dev.to/posts/installing-openclaw-on-the-homelab"&gt;the homelab post&lt;/a&gt;). My mental model: add a fifth entry for fitness-tracker, the agent picks it up.&lt;/p&gt;

&lt;p&gt;So I patched the template. Token flows from &lt;code&gt;~/.config/fitness-tracker/env&lt;/code&gt; on the workstation → &lt;code&gt;TF_VAR_fitness_tracker_mcp_token&lt;/code&gt; in &lt;code&gt;/etc/coder.d/coder.env&lt;/code&gt; → Terraform &lt;code&gt;variable&lt;/code&gt; → &lt;code&gt;coder_agent.main.env&lt;/code&gt; → workspace process → &lt;code&gt;jq&lt;/code&gt;-merge into &lt;code&gt;~/.mcp.json&lt;/code&gt; at startup with &lt;code&gt;chmod 600&lt;/code&gt;. One PR, one &lt;code&gt;apply.sh&lt;/code&gt;, every workspace gets it.&lt;/p&gt;

&lt;p&gt;Verified the file showed up in a fresh workspace with all five MCP servers in the keys. Confidently asked the agent: "list my fitness-tracker tools."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I don't have any fitness-tracker tools available. My available tools are for software-engineering tasks inside a Coder workspace..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent had no idea. Started a fresh chat — same answer. Inspected the agent runtime and found this in Coder's source at v2.33.2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// enterprise/aibridgedserver/aibridgedserver.go&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProviderID&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;eac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validateErr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;eac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ValidateToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAuthToken&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="c"&gt;// ...&lt;/span&gt;
  &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OAuthAccessToken&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Coder's AI Bridge only auto-registers OAuth-backed MCP servers.&lt;/strong&gt; Specifically, MCP servers wired through &lt;code&gt;CODER_EXTERNAL_AUTH_*_MCP_URL&lt;/code&gt; against an OAuth external auth provider. Static-token MCP servers are invisible to the chat agent. The &lt;code&gt;~/.mcp.json&lt;/code&gt; file is for &lt;em&gt;other&lt;/em&gt; MCP clients running in the workspace (Claude Desktop, Codex, code-server's Continue extension), not for Coder's chat itself.&lt;/p&gt;

&lt;p&gt;I'd shipped a &lt;code&gt;coder-templates&lt;/code&gt; PR that does the right thing for every MCP client &lt;em&gt;except&lt;/em&gt; the one I was trying to enable. The PR is still useful — it makes the fitness tracker available to any MCP client a workspace user wires up. But Coder Agents specifically were locked out.&lt;/p&gt;

&lt;p&gt;Two real options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wrap the fitness tracker in OAuth.&lt;/strong&gt; NextAuth supports being an OAuth provider. Register it in Coder as an external auth. Coder mints tokens, AI Bridge injects them. Significant work for a single-user app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teach the agent the recipe.&lt;/strong&gt; Write a skill file that documents the endpoint, the auth, the wire shape, and the ten tools. Agent reads the skill at chat start and calls the MCP server with &lt;code&gt;curl&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 2 was 200 lines of Markdown. I picked option 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fitness-tracker&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;personal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fitness-tracker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;MCP&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;server&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTTP..."&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## Call recipe&lt;/span&gt;

ft_call() {
  local tool="$1" args="${2:-{&lt;span class="se"&gt;\}&lt;/span&gt;}"
  local payload=$(jq -cn --arg t "$tool" --argjson a "$args" &lt;span class="err"&gt;\&lt;/span&gt;
    '{jsonrpc:"2.0", id:1, method:"tools/call",
      params:{name:$t, arguments:$a}}')
  curl -sS -X POST https://robs-fitness-tracker.vercel.app/api/mcp/mcp &lt;span class="err"&gt;\&lt;/span&gt;
    -H "Authorization: Bearer $FITNESS_TRACKER_MCP_TOKEN" &lt;span class="err"&gt;\&lt;/span&gt;
    ... | sed -n 's/^data: //p' | jq -r '.result.content[0].text // .error'
}

ft_call list_workouts '{"year":2025,"limit":10}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That ships in the same template patch as the &lt;code&gt;~/.mcp.json&lt;/code&gt; entry. Per-user, gated to my account. Next fresh chat: agent reads the skill, sees the recipe, calls the server. Works on the first try.&lt;/p&gt;

&lt;h2&gt;
  
  
  Client #4: OpenClaw — The Client I Built This For
&lt;/h2&gt;

&lt;p&gt;The whole reason I started this afternoon. &lt;code&gt;carrybot&lt;/code&gt; runs on the homelab and accessed from any Discord client. Local Qwen3.5-35B for the brain, Discord for the front end, OpenClaw as the agent runtime.&lt;/p&gt;

&lt;p&gt;I DM'd carrybot and asked how to register an MCP server. He came back with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I need a bit more info about your fitness app MCP endpoint:&lt;br&gt;
Transport type: Is it HTTP(S), SSE, streamable-http, or stdio?&lt;br&gt;
URL: What's the endpoint URL?&lt;br&gt;
Auth: You mentioned a token — is it in headers or URL?&lt;br&gt;
Tool names: What specific tools does it expose?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Pasted the answers. He ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw mcp &lt;span class="nb"&gt;set &lt;/span&gt;fitness-app &lt;span class="s1"&gt;'{
  "type": "streamable-http",
  "url": "https://robs-fitness-tracker.vercel.app/api/mcp/mcp",
  "headers": { "Authorization": "Bearer ${FITNESS_TRACKER_MCP_TOKEN}" }
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Asked: "What was my last workout?"&lt;/p&gt;

&lt;p&gt;He replied with a 30-minute Power Zone Pop Ride with Ben Alldis on May 12 at 4:12 PM PDT. 7.98 miles. Real Peloton data, pulled through the MCP server I'd built three hours earlier, into a Discord bot running in the other room.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cron Ghost
&lt;/h2&gt;

&lt;p&gt;One amusing wrinkle. Carrybot's first response included an inline alert:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Cron: &lt;code&gt;fitness-check&lt;/code&gt; failed&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I asked him what that was. He came back with full transparency:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You're right to ask! That 'fitness-check' cron never actually got created — I made a typo when trying to add it. ... The API returned: &lt;code&gt;invalid cron.add params: at root: unexpected property 'sessionTarget":'&lt;/code&gt;. That's because I accidentally added an extra quote: &lt;code&gt;"sessionTarget": ": "main"&lt;/code&gt; should have been &lt;code&gt;"sessionTarget": "main"&lt;/code&gt;. The JSON was malformed."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He'd tried to use OpenClaw's cron system as a deferred-execution primitive — schedule a one-shot task for one minute from now, then have the cron run the MCP query. Malformed JSON, cron creation failed, OpenClaw faithfully alerted me about the failure, and he answered my question via a sub-agent path instead. The data I saw was real. The "cron failed" was background noise from a bug he caught and fixed in his next message.&lt;/p&gt;

&lt;p&gt;I gave him the UX feedback — one-shot crons as deferred execution mean every malformed payload looks like a production alert — and he immediately revised the pattern:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Sub-agent direct call → Use for one-off work. Cleaner, no cron overhead.&lt;br&gt;
Cron for real scheduling → Only use &lt;code&gt;at&lt;/code&gt; crons when you actually need deferred execution.&lt;br&gt;
Don't spam alerts → Malformed JSON that prevents a cron from being created shouldn't generate a scary 'Cron failed' alert."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's a long-running agent learning its own UX patterns. Worth its own post someday.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token Storage
&lt;/h2&gt;

&lt;p&gt;One token, four locations, all mode 600 or equivalent. Same value everywhere:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Vercel project env var      MCP_API_TOKEN
2. Workstation                 ~/.config/fitness-tracker/env  (chmod 600)
3. Coder server                /etc/coder.d/coder.env         (root-readable systemd EnvironmentFile)
4. Coder workspaces            ~/.mcp.json                    (chmod 600, regenerated per workspace start)
5. OpenClaw                    ~/.openclaw/openclaw.json      (chmod 600)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rotation: &lt;code&gt;openssl rand -hex 32&lt;/code&gt;, update all five locations, redeploy Vercel. Roughly 90 seconds, no code changes.&lt;/p&gt;

&lt;p&gt;The token lives in env vars, never in shell rc files. The shell-rc anti-pattern is real — anything &lt;code&gt;export&lt;/code&gt;ed into &lt;code&gt;~/.bashrc&lt;/code&gt; leaks into every subshell's process listing, gets sourced by background jobs that shouldn't see it, and survives in &lt;code&gt;.bash_history&lt;/code&gt; for as long as that file lives. A &lt;code&gt;chmod 600&lt;/code&gt; env file you source explicitly when you need it stays in exactly the processes that need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Verify the agent runtime's MCP integration before patching templates.&lt;/strong&gt; I patched &lt;code&gt;coder-templates&lt;/code&gt; to add a workspace-level &lt;code&gt;~/.mcp.json&lt;/code&gt; entry before I'd checked whether Coder's chat agent actually reads that file. It doesn't. The patch is still useful for other MCP clients running in the workspace, but I wouldn't have prioritized it first if I'd known.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skip the OpenAPI consideration earlier.&lt;/strong&gt; I spent real cycles writing the "MCP vs OpenAPI" comparison in my head. The clients I cared about all speak MCP natively. The decision was over before I started thinking about it; I just didn't realize it for ten minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with the skill file as a first-class option, not a workaround.&lt;/strong&gt; When I hit the Coder AI Bridge limitation, my first instinct was "build OAuth, ship the proper integration." The skill file approach is genuinely simpler, lives next to existing skills, and will be obsolete the day AI Bridge gains static-token support — which seems like a planned-but-not-yet-shipped feature based on the deprecation comments in Coder's source. Skill files are the right level of investment when the underlying platform is in flux.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test the skill in a fresh Coder chat.&lt;/strong&gt; The PR merged but I haven't validated it end-to-end yet. The skill is concrete enough that the agent should call &lt;code&gt;ft_call list_workouts&lt;/code&gt; on the first try. If it fumbles, the skill needs tightening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch the raw-rows decision over time.&lt;/strong&gt; All ten tools return raw database rows. Zero precomputed aggregates. The whole point is to see whether agents naturally synthesize good summaries or degrade as the dataset grows. If they degrade, add a &lt;code&gt;summarize_year&lt;/code&gt; tool. Until then, keep the surface area small.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token rotation drill.&lt;/strong&gt; I haven't had to rotate &lt;code&gt;MCP_API_TOKEN&lt;/code&gt; yet. Worth doing once intentionally to find any place we forgot to document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wait for AI Bridge to support static-token MCP servers.&lt;/strong&gt; When it does, the skill file becomes redundant and the &lt;code&gt;~/.mcp.json&lt;/code&gt; entry becomes the canonical path. Until then, the skill is the working path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fitness tracker is now genuinely agent-accessible. Same vibe coded app that started as a Next.js weekend project, now serving four different agent runtimes through a single MCP endpoint. The audit a few weeks ago found the bugs. This week added the API surface. Next steps are about watching agents use it.&lt;/p&gt;

&lt;p&gt;The lobster's a real assistant now.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 hours&lt;/strong&gt; total session time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 GitHub PRs&lt;/strong&gt; opened and merged (fitness-tracker, coder-templates)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 follow-up PR&lt;/strong&gt; for the skill file workaround&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10 MCP tools&lt;/strong&gt; exposed, all returning raw rows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 precomputed aggregates&lt;/strong&gt; — agents do their own analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 client integrations&lt;/strong&gt; working from one endpoint (Claude Desktop, Codex / Cursor / etc., Coder Agents via skill, OpenClaw)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 dead-end&lt;/strong&gt; — Coder AI Bridge's OAuth-only MCP injection requirement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;200 lines&lt;/strong&gt; of Markdown in the skill that workaround it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;64 hex chars&lt;/strong&gt; in the personal access token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 locations&lt;/strong&gt; that hold the token, all mode 600 or equivalent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 ghost cron&lt;/strong&gt; that alerted me to a bug in carrybot's own code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 long-running agent&lt;/strong&gt; that revised its own UX patterns based on feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30 minutes&lt;/strong&gt; — the duration of the last workout the bot reported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7.98 miles&lt;/strong&gt; — distance on that Power Zone Pop Ride with Ben Alldis&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>openclaw</category>
      <category>agents</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Showdown Thoughts: The Three-Pass Pattern</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Tue, 19 May 2026 13:49:16 +0000</pubDate>
      <link>https://dev.to/carryologist/showdown-thoughts-the-three-pass-pattern-4096</link>
      <guid>https://dev.to/carryologist/showdown-thoughts-the-three-pass-pattern-4096</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/posts/model-showdown-round-5-four-agents-build-the-same-feature"&gt;Model Showdown Round 5&lt;/a&gt;&lt;br&gt;
ended with a leaderboard. Sonnet 4.6 won on the rubric. Opus 4.7 placed&lt;br&gt;
second. Qwen 3.5 contributed almost nothing structural. That's the&lt;br&gt;
measurement story.&lt;/p&gt;

&lt;p&gt;This is the methodology story — what happened after the scores were&lt;br&gt;
revealed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Picking a Winner
&lt;/h2&gt;

&lt;p&gt;The naive workflow after a bakeoff is: pick the best run, merge it to&lt;br&gt;
main, ship it. Winner takes all.&lt;/p&gt;

&lt;p&gt;That's wrong, and Round 5 made it obvious why.&lt;/p&gt;

&lt;p&gt;The winning run (Sonnet 4.6) had the best overall rubric score. It also&lt;br&gt;
had a weaker path validator than Opus 4.7, and its orphan-matching logic&lt;br&gt;
would have missed real-world cases that Opus 4.6 caught. The second-place&lt;br&gt;
run (Opus 4.7) had the best validator and the cleanest route structure, but&lt;br&gt;
the worst data source choice — reading from the build-time filesystem&lt;br&gt;
instead of the live GitHub Contents API.&lt;/p&gt;

&lt;p&gt;No individual run was what I'd ship. Each one had at least one bad call.&lt;br&gt;
The bakeoff's real output wasn't a winner. It was a map.&lt;/p&gt;

&lt;p&gt;When 4 of 4 models made the same design choice, that choice was obviously&lt;br&gt;
right. When they diverged — on validation strictness, on data source, on&lt;br&gt;
UX for destructive actions — that divergence was the signal. Those were the&lt;br&gt;
actual design decisions, the ones worth spending judgment on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Passes
&lt;/h2&gt;

&lt;p&gt;What emerged from Round 5 is a pattern I've now run twice and would reach&lt;br&gt;
for again on any feature where the design space is unclear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 1 — Bakeoff.&lt;/strong&gt; Run N models (I used 4) on the same prompt in&lt;br&gt;
isolated sessions. Judge blind, before you know which branch is which.&lt;br&gt;
Score against a rubric. The output of this pass isn't any of the N&lt;br&gt;
implementations — it's the decision map. You now know which choices are&lt;br&gt;
contested and which are obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 2 — Merge.&lt;/strong&gt; Write down a merge plan before touching any code: for&lt;br&gt;
each contested layer, which run's approach wins and why. Then ask an agent&lt;br&gt;
to compose the merged best-of from those inputs. The merge is strictly&lt;br&gt;
better than any individual bakeoff run because it draws on information none&lt;br&gt;
of the bakeoff contestants had — the scored comparison of all four.&lt;/p&gt;

&lt;p&gt;For Round 5 the plan looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Path validator&lt;/td&gt;
&lt;td&gt;Opus 4.7 (Run 1)&lt;/td&gt;
&lt;td&gt;Only run with 2-segment enforcement + &lt;code&gt;..&lt;/code&gt; block + non-empty checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Three-tier orphan match&lt;/td&gt;
&lt;td&gt;Opus 4.6 (Run 2)&lt;/td&gt;
&lt;td&gt;Only run that noticed exact-match missed real cases like &lt;code&gt;day-four&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type-narrowed body parsing&lt;/td&gt;
&lt;td&gt;Sonnet 4.6 (Run 3)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;typeof body === "object" &amp;amp;&amp;amp; "path" in body&lt;/code&gt;, no &lt;code&gt;as&lt;/code&gt; casts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Contents API&lt;/td&gt;
&lt;td&gt;Opus 4.6 / Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Live state vs. build-time filesystem snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confirm-modal UX&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Best visual polish in the screenshots&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Qwen 3.5 contributed nothing structural to this table. The bakeoff said&lt;br&gt;
"skip this one" clearly enough that there was nothing to debate. That's&lt;br&gt;
useful information too — knowing which pieces to skip is part of the map.&lt;/p&gt;

&lt;p&gt;The merge was 13 files changed, +990/-9. One TypeScript error caught and&lt;br&gt;
fixed. Build passed first try after that. Opened as a PR with the heritage&lt;br&gt;
table in the description so future reviewers can trace any decision back to&lt;br&gt;
its source run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 3 — Polish.&lt;/strong&gt; The merged feature went live. I opened it against&lt;br&gt;
real production data and spotted four things immediately: truncated&lt;br&gt;
directory names with no tooltip, delete buttons invisible on touch devices,&lt;br&gt;
no bulk delete UI despite the API supporting &lt;code&gt;paths: []&lt;/code&gt;, and an orphaned&lt;br&gt;
section header that would show with count 0 after the lone orphan was&lt;br&gt;
deleted.&lt;/p&gt;

&lt;p&gt;None of those were predictable before live use. You can't predict friction&lt;br&gt;
from a code review — you observe it. The polish pass had to come after the&lt;br&gt;
merge because the artifact it was polishing didn't exist until then.&lt;/p&gt;

&lt;p&gt;The polish was 6 files changed, +265/-54 and about 20 minutes of agent&lt;br&gt;
time.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use It
&lt;/h2&gt;

&lt;p&gt;The pattern has a real cost: the bakeoff is N full agent sessions, each&lt;br&gt;
producing a complete implementation that you won't ship. For Round 5 that&lt;br&gt;
was ~$35 in inference and a few hours of judging.&lt;/p&gt;

&lt;p&gt;That's cheap insurance when the feature has any of these properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Destructive verbs.&lt;/strong&gt; Delete, update, payment, permission change. The
cost of getting validation wrong outweighs the cost of the bakeoff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple defensible architectures.&lt;/strong&gt; Where should validation live?
What's the data source? How does auth thread through? When you genuinely
don't know the right answer, a bakeoff shows you the option space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard to change later.&lt;/strong&gt; Database schemas. Public API contracts. Anything
that will accumulate callers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's overkill for a 20-line UI tweak or a feature with a single obvious&lt;br&gt;
implementation. The signal value of the bakeoff scales with how uncertain&lt;br&gt;
you are about the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Three things I'd change for the next run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name the contestant chats before pasting the prompt.&lt;/strong&gt; All four Round 5&lt;br&gt;
chats showed up as "New Chat" in the Coder API cost summary, which meant&lt;br&gt;
20 minutes of token-volume detective work to figure out which cost belonged&lt;br&gt;
to which run. Five seconds of effort would have prevented that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capture per-phase stats.&lt;/strong&gt; I have clean bakeoff numbers. I don't have&lt;br&gt;
separate merge or polish numbers — they're folded into the judging thread.&lt;br&gt;
A lightweight wrapper script around each phase would make the next&lt;br&gt;
iteration measurable end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write the polish friction items down before fixing them.&lt;/strong&gt; I noticed four&lt;br&gt;
issues and fixed them in one pass, which collapsed the "observed" list and&lt;br&gt;
the "fixed" list into the same moment. Separating them — even by five&lt;br&gt;
minutes — would have made the "what does live-review surface" lesson&lt;br&gt;
sharper for the writeup. And occasionally you'll notice something that&lt;br&gt;
isn't worth fixing.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 phases&lt;/strong&gt;: Bakeoff (4 parallel attempts), Merge (1 informed pass), Polish (1 live-review pass)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 implementations&lt;/strong&gt; produced in the bakeoff, &lt;strong&gt;0&lt;/strong&gt; shipped to main as-is&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 of 4&lt;/strong&gt; bakeoff runs contributed at least one structural piece to the merge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;13 files changed&lt;/strong&gt; in the merge pass (+990/-9)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 files changed&lt;/strong&gt; in the polish pass (+265/-54)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 friction items&lt;/strong&gt; caught in polish that couldn't have been predicted before live use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~$35.56&lt;/strong&gt; inference cost for the bakeoff phase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~45 min&lt;/strong&gt; bakeoff (parallel), &lt;strong&gt;~30 min&lt;/strong&gt; merge, &lt;strong&gt;~20 min&lt;/strong&gt; polish&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>vibecoding</category>
      <category>modelshowdown</category>
      <category>buildinginpublic</category>
    </item>
    <item>
      <title>Model Showdown Round 5: Four Agents Build the Same Feature</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 18 May 2026 16:05:46 +0000</pubDate>
      <link>https://dev.to/carryologist/model-showdown-round-5-four-agents-build-the-same-feature-1ic7</link>
      <guid>https://dev.to/carryologist/model-showdown-round-5-four-agents-build-the-same-feature-1ic7</guid>
      <description>&lt;p&gt;I've been running model showdowns on Vibes Coder for a while now. Each round has been a little messier than I wanted — different prompts, accidental context leaks, no clean way to compare cost to quality. This one is the first I'd call a &lt;em&gt;fair&lt;/em&gt; bakeoff. Two goals going in:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Make the experiment itself rigorous enough that future rounds can build on it&lt;/strong&gt; — isolated chat sessions, identical prompts, anonymized branches, blind judging, real token + runtime data pulled from the Coder API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compare three flavors of Claude against our local champ.&lt;/strong&gt; Opus 4.7, Opus 4.6, and Sonnet 4.6 from Anthropic; Qwen 3.5 35B-A3B running on llama.cpp on the RTX 5090 in the home lab. Four models, same task, four isolated Coder Agents sessions, blind judging.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The headline: &lt;strong&gt;Sonnet 4.6 beat Opus 4.6 on a coding task.&lt;/strong&gt; Not by much (4.48 vs 4.36) but cleanly, on its own merits, with no asterisks. And once I pulled real token and runtime data from Coder's chat-cost API, a second headline emerged: &lt;strong&gt;weighted by cost, Sonnet's win becomes decisive — about 10x cheaper per rubric point than either Opus model.&lt;/strong&gt; A third wrinkle: Opus 4.7 finished the task in 9.2 minutes, the fastest of the three Claude runs. It won the rubric without burning the most time. The deeper story is what each model did with the same prompt, and what it took to make the bakeoff &lt;em&gt;fair&lt;/em&gt; in the first place — which turned out to be more work than the bakeoff itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The contestants:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Where it runs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;Cloud, via Coder Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Cloud, via Coder Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;Cloud, via Coder Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Qwen 3.5 35B-A3B&lt;/td&gt;
&lt;td&gt;Local, llama.cpp on the RTX 5090, via Coder Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The mapping was private. Branches were named &lt;code&gt;run-1&lt;/code&gt; through &lt;code&gt;run-4&lt;/code&gt;. I judged the four branches blind against a fixed rubric, then revealed the identities.&lt;/p&gt;

&lt;p&gt;The task: build image management into the vibescoder.dev admin dashboard. The current &lt;code&gt;/admin&lt;/code&gt; page has a Settings card that's a placeholder. The spec asked for an Images card (or a replacement) that lists the post-image directories under &lt;code&gt;public/images/&lt;/code&gt;, detects orphans (directories with no matching post), provides a screenshot view, and adds an API route to delete a directory.&lt;/p&gt;

&lt;p&gt;It's not a huge feature, but it has enough surface area to differentiate models: filesystem traversal, slug matching, path validation, an API contract with a destructive verb, a UI page, and at least one judgment call (what counts as an "orphan?").&lt;/p&gt;

&lt;h2&gt;
  
  
  The fairness story
&lt;/h2&gt;

&lt;p&gt;Before launching anything, three things needed fixing. None of them are interesting on their own. Together they're the operational lesson of this post: a bakeoff isn't fair by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 1: Node 18 vs Node 20
&lt;/h3&gt;

&lt;p&gt;The workspace image is built on Ubuntu 24.04. Ubuntu 24.04's &lt;code&gt;apt&lt;/code&gt; Node is 18.19. Next.js 16 — what the blog engine ships on — requires Node 20+. Any agent that ran &lt;code&gt;apt install nodejs&lt;/code&gt; would silently break its own build.&lt;/p&gt;

&lt;p&gt;The fix was a Dockerfile change in the &lt;code&gt;coder-templates&lt;/code&gt; repo: install Node 20 from NodeSource at image build time, pin npm, verify &lt;code&gt;node -v&lt;/code&gt; reports 20.x in the smoke test. After that, &lt;code&gt;node -v&lt;/code&gt; in a fresh workspace prints &lt;code&gt;v20.20.2&lt;/code&gt; and nothing the agents do (short of &lt;code&gt;nvm&lt;/code&gt; shenanigans) changes that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 2: The system instructions were lying
&lt;/h3&gt;

&lt;p&gt;The chat system prompt — injected at the top of every Coder Agents session — said Node was not pre-installed and told agents to install it themselves. Correct on the previous image; actively misleading after Fix 1. An agent following the instructions would &lt;code&gt;apt install nodejs&lt;/code&gt;, get Node 18, downgrade the runtime, and break the build.&lt;/p&gt;

&lt;p&gt;I rewrote the instructions to say Node 20 is pre-installed, do not reinstall, use &lt;code&gt;nvm&lt;/code&gt; if you need a different version. Boring change. Huge impact on whether the bakeoff produces meaningful signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 3: Prompt poisoning
&lt;/h3&gt;

&lt;p&gt;The first draft of the bakeoff prompt told each agent to create a branch named after the model running the session — &lt;code&gt;bakeoff-opus47&lt;/code&gt;, &lt;code&gt;bakeoff-sonnet46&lt;/code&gt;, and so on. A sharp catch from the human side: that wording &lt;strong&gt;leaks competition signaling into the prompt&lt;/strong&gt;. An agent that sees "you are opus47" or even "this is a bakeoff" can adjust behavior in ways that aren't comparable. The experiment stops measuring "what does this model do with the prompt" and starts measuring "what does this model do when it knows it's on stage."&lt;/p&gt;

&lt;p&gt;Fix: replace model names with neutral ordinals. Branches became &lt;code&gt;run-1&lt;/code&gt; through &lt;code&gt;run-4&lt;/code&gt;. The prompt made no reference to other runs, scoring, or any comparison. Each agent thought it was building a feature, not auditioning.&lt;/p&gt;

&lt;p&gt;Three small fixes. Together they're the operational lesson: &lt;strong&gt;fairness in a model bakeoff requires more setup than the bakeoff itself.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompt
&lt;/h2&gt;

&lt;p&gt;The prompt was identical for all four runs, save for the run number in the branch name. Verbatim, with one path generalized:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;You are working in the vibescoder.dev blog engine repo. Branch: run-N.
Baseline commit is at the tip of main.

Goal: add image management to /admin.

Requirements:
&lt;span class="p"&gt;-&lt;/span&gt; List the directories under public/images/ (each directory corresponds
  to one post and contains its images).
&lt;span class="p"&gt;-&lt;/span&gt; For each directory, report: name, file count, total size on disk,
  and whether it matches a published or draft post (by slug).
&lt;span class="p"&gt;-&lt;/span&gt; Surface "orphaned" directories — directories that do not match any
  post — so I can clean them up.
&lt;span class="p"&gt;-&lt;/span&gt; Provide a way to view the images in a directory (thumbnails or list).
&lt;span class="p"&gt;-&lt;/span&gt; Provide an API route DELETE /api/admin/images that removes a
  directory by path. The route must validate input.
&lt;span class="p"&gt;-&lt;/span&gt; Update the /admin landing page so the new feature is reachable.
  You may keep the Settings placeholder card or replace it; either is fine.
&lt;span class="p"&gt;-&lt;/span&gt; Add a screenshot of the new page to the PR description (use the
  Playwright MCP).
&lt;span class="p"&gt;-&lt;/span&gt; Run &lt;span class="sb"&gt;`npm run build`&lt;/span&gt; before committing. Do not push commits that
  fail the build.
&lt;span class="p"&gt;-&lt;/span&gt; Commit in logical chunks. Push the branch when done.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;p&gt;That's it. No mention of competing runs. No scoring rubric. No model identification. Just a feature spec and a quality bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four implementations
&lt;/h2&gt;

&lt;p&gt;All four runs built it. All four passed &lt;code&gt;npm run build&lt;/code&gt; against a shared engine baseline on Node 20.20.2. All four pushed their branches. Then the differences started showing up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run 1 — 8 new files, 631+/9-
&lt;/h3&gt;

&lt;p&gt;Replaced the Settings placeholder with an Images card on &lt;code&gt;/admin&lt;/code&gt;. Added a dedicated &lt;code&gt;/admin/images&lt;/code&gt; page that lists directories server-side, plus a client-side modal that renders a grid of thumbnails when you click into a directory. Three screenshots in the PR description — admin landing, images list, modal open with orphan-flagged styling.&lt;/p&gt;

&lt;p&gt;The standout was the API route. Run 1 wrote a real path validator — &lt;code&gt;isValidImageRepoPath&lt;/code&gt; — that required exactly two path segments under &lt;code&gt;public/images/&lt;/code&gt;, rejected &lt;code&gt;..&lt;/code&gt;, and ran &lt;em&gt;before&lt;/em&gt; the filesystem call. The route returned distinct status codes for distinct failure modes: 400 for bad input, 404 for missing, 403 for paths that resolve outside the allowed root, 200 for success.&lt;/p&gt;

&lt;p&gt;It's not glamorous code. It's just the version where someone thought about the failure modes before writing the success path.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-1-opus47.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-1-opus47.png" alt="Run 1 admin/images page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Run 1's &lt;code&gt;/admin/images&lt;/code&gt; page. Directory cards, orphan-flagged styling, and a tight path-validated delete API behind the trash icons.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Run 2 — 6 new files, 687+/7-
&lt;/h3&gt;

&lt;p&gt;Kept the Settings card. Added an Images card next to it on &lt;code&gt;/admin&lt;/code&gt;. The /admin/images page was the cleanest of the four — tight TypeScript, no &lt;code&gt;as&lt;/code&gt; casts in the API route, proper type narrowing (&lt;code&gt;typeof body === "object" &amp;amp;&amp;amp; "path" in body&lt;/code&gt;) instead of forcing the compiler to trust it. The UI had the most visual polish: directory cards with file counts as a badge, hover states that matched the rest of the admin surface, a confirmation modal on delete that quoted the directory name back at you.&lt;/p&gt;

&lt;p&gt;Path validation was decent but not as rigorous as Run 1 — &lt;code&gt;startsWith("public/images/")&lt;/code&gt; plus a &lt;code&gt;..&lt;/code&gt; block, no segment-count check. Enough to stop the obvious cases. Not airtight against creative inputs.&lt;/p&gt;

&lt;p&gt;Two screenshots. Shipped a polished v1 and stopped.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-2-opus46.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-2-opus46.png" alt="Run 2 admin/images page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Run 2 kept the Settings card and put Images next to it. Cleanest TypeScript of the four; smallest screenshot artifact.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Run 3 — 6 new files, 595+/0-
&lt;/h3&gt;

&lt;p&gt;Replaced the Settings placeholder. The /admin/images page started as a server component, then mid-task switched to a client-fetched implementation when Run 3 hit a dev-server timeout on the first integration test. That mid-stream pivot showed up cleanly in the commit history — &lt;code&gt;feat: add admin/images server-rendered&lt;/code&gt;, then two commits later, &lt;code&gt;refactor: move admin/images to client fetch (dev server hangs on FS scan)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Path validation matched Run 2's. The thing that made Run 3 interesting was the orphan-detection arc.&lt;/p&gt;

&lt;p&gt;The spec said "match directory name against post slugs to find orphans." Three of the four models took that literally — list directories, list slugs, set-difference, report what's left. Run 3 did that first, reported 8 orphaned directories, then &lt;em&gt;checked the result against reality&lt;/em&gt;. Looked at the actual file tree and noticed that one of the "orphaned" directories was &lt;code&gt;day-four/&lt;/code&gt;, and there's a published post with the slug &lt;code&gt;day-four-rss-analytics-syndication-and-loom&lt;/code&gt;. The directory isn't orphaned. It belongs to that post. The matching logic was wrong.&lt;/p&gt;

&lt;p&gt;Run 3 iterated three times: exact match → prefix match (does any slug start with this directory name?) → content-reference match (does any post body reference an image in this directory?). After the third pass, the orphan count went from 8 to 1 — and the one remaining was an actual orphan I'd been meaning to delete for weeks.&lt;/p&gt;

&lt;p&gt;Small thing in the diff. Big thing in engineering judgment. The other three models reported false-positive orphans with high confidence. Run 3 noticed its own answer was wrong and kept working.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-3-sonnet46.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-5-four-agents-build-the-same-feature/run-3-sonnet46.png" alt="Run 3 admin/images page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Run 3's screenshot — the largest and most polished of the four. The orphan count in the header reads 1 instead of 8 because the matching logic had been corrected mid-task.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Run 4 — 7 new files, 607+/0-
&lt;/h3&gt;

&lt;p&gt;Kept the Settings card, added an Images card. The /admin/images page worked. Build passed. The directory listing rendered correctly.&lt;/p&gt;

&lt;p&gt;Two structural issues. First, the codebase ended up with two utility libraries — &lt;code&gt;images.ts&lt;/code&gt; and &lt;code&gt;imageUtils.ts&lt;/code&gt; — with overlapping responsibilities. The first pass put filesystem helpers in &lt;code&gt;images.ts&lt;/code&gt;, which got imported into a client component, which pulled &lt;code&gt;fs&lt;/code&gt; into the client bundle and broke the build. The fix added &lt;code&gt;imageUtils.ts&lt;/code&gt; for client-safe helpers and re-imported. The dead code in &lt;code&gt;images.ts&lt;/code&gt; was never cleaned up.&lt;/p&gt;

&lt;p&gt;Second, the screenshot. Run 4 ran &lt;code&gt;playwright screenshot&lt;/code&gt;, hit the same missing-system-libraries failure the other three runs hit (&lt;code&gt;libnspr4&lt;/code&gt;, &lt;code&gt;libpango-1.0-0&lt;/code&gt;, the headless Chromium kit), &lt;code&gt;sudo apt install&lt;/code&gt;-ed the dependencies — and then never retried the screenshot. Instead the PR description got a 184-line &lt;em&gt;markdown description&lt;/em&gt; of what the page would look like, in lieu of a PNG. The deps were installed. The retry never fired.&lt;/p&gt;

&lt;p&gt;Path validation was the weakest of the four — &lt;code&gt;startsWith&lt;/code&gt; on the user-supplied path, no normalization, no &lt;code&gt;..&lt;/code&gt; block. The class of weakness is that a path that looks like it's under &lt;code&gt;public/images/&lt;/code&gt; can still resolve elsewhere when the OS interprets it. I'm not going to spell out the exact bypass; the point is that a one-line &lt;code&gt;startsWith&lt;/code&gt; check is not a path validator, and Run 4 shipped one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Run 4's "screenshot" is a 184-line markdown file. The opening:&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Page Description: &lt;code&gt;/admin/images&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overall Layout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/admin/images&lt;/code&gt; page displays a dashboard-style view of all image directories with a neon brutalist design consistent with the existing admin theme.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Header Section&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the top:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Title&lt;/strong&gt;: &lt;code&gt;// Images&lt;/code&gt; in monospace font with primary color (cyan/teal)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stats bar&lt;/strong&gt; showing:

&lt;ul&gt;
&lt;li&gt;Total directories count&lt;/li&gt;
&lt;li&gt;Total files count&lt;/li&gt;
&lt;li&gt;Total size in human-readable format (MB/GB)&lt;/li&gt;
&lt;li&gt;Orphaned count (in warning yellow/orange color, only shown if &amp;gt; 0)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;…and 165 more lines of design notes.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Blind scoring
&lt;/h2&gt;

&lt;p&gt;Rubric, weights, and scores:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;th&gt;Run 1&lt;/th&gt;
&lt;th&gt;Run 2&lt;/th&gt;
&lt;th&gt;Run 3&lt;/th&gt;
&lt;th&gt;Run 4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correctness&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code quality&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineering judgment&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope discipline&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commit hygiene&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Surprise&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;td&gt;5.0&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weighted total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.68&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.48&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.36&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.18&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Scoring notes I wrote during the blind pass, before the reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run 1&lt;/strong&gt; — "Most defensive of the four. The path validator is the kind of code I'd want to ship to production. Loses half a design point for being slightly less visually polished than Run 2."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run 2&lt;/strong&gt; — "Tightest TypeScript I've seen this week. Visual polish is the best of the four. Path validation is fine but not paranoid. Stopped at v1 — didn't iterate, didn't second-guess. Probably Sonnet."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run 3&lt;/strong&gt; — "Mid-task architecture pivot, three iterations on orphan detection, the only run that produced an honest orphan count. Took the longest. Most thoughtful. Probably Opus 4.6."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run 4&lt;/strong&gt; — "Two overlapping libraries, dead code left behind, weak path validation, fell back to a markdown description instead of a real screenshot. The dependency install was right there. The retry never came. Probably Qwen."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two guesses right (Run 1 = Opus 4.7, Run 4 = Qwen). Two guesses swapped. Run 2 was Sonnet 4.6. Run 3 was Opus 4.6. I had them reversed — but I had the &lt;em&gt;behavior&lt;/em&gt; right. I thought "polished, decisive, stopped at v1" was Sonnet, and it was. I thought "iterated three times until the answer was honest" was Opus, and it was. The guesses were wrong about which Opus, not about the disposition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reveal
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Headline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Opus 4.7&lt;/td&gt;
&lt;td&gt;4.68&lt;/td&gt;
&lt;td&gt;Strongest path validator, multi-status DELETE API, three screenshots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;4.48&lt;/td&gt;
&lt;td&gt;Tightest TypeScript, best visual polish, fastest to "done"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;4.36&lt;/td&gt;
&lt;td&gt;Only model that noticed the slug-prefix problem and iterated until orphan detection was honest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Qwen 3.5 35B-A3B&lt;/td&gt;
&lt;td&gt;3.18&lt;/td&gt;
&lt;td&gt;Missing screenshot, weakest path validation, architectural churn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sonnet beat Opus 4.6.&lt;/strong&gt; I didn't expect that. On previous bakeoffs Opus has been the model that goes deeper. Here, Sonnet's tighter implementation and faster decisive shipping outscored Opus's iteration. Two different success modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet's mode&lt;/strong&gt;: get to a clean v1 fast, polish what's there, stop. Trust the spec.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus 4.6's mode&lt;/strong&gt;: ship a first pass, look at the output, notice when it disagrees with reality, iterate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is wrong. If the spec is precise and "ship the feature" is the success criterion, Sonnet's mode wins. If the spec is approximate and "produce a correct answer" is the success criterion, Opus's mode wins. On this task, Sonnet was polished enough that Opus's iteration premium didn't make up the gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.6's slug-prefix insight is the engineering moment of the bakeoff.&lt;/strong&gt; Three models took the spec literally and produced false-positive orphans. One model checked its work, noticed the discrepancy, and kept going until the answer was honest. The cost was time — Opus 4.6 took &lt;strong&gt;28.1 minutes, 3x longer than Opus 4.7's 9.2 minutes&lt;/strong&gt;, and 146 messages versus Opus 4.7's 84. The benefit was the only correct orphan count in the bunch. That's the trade-off, and on a real codebase I'd take it every time — but it's worth being honest that the iteration premium showed up in the bill as well as the clock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen failed roughly where predicted.&lt;/strong&gt; Pre-launch I'd written down four likely failure modes: skip orphan detection, weak design system match, miss the screenshot, forget to push. Three of those landed at least partially — Qwen &lt;em&gt;did&lt;/em&gt; implement orphan detection, but did it naively, which is how the predicted weakness actually manifested; the design fit was rough; the screenshot was missed; the push went fine. The pattern wasn't where I expected, though. Qwen didn't fail at the planning level. It failed at the &lt;em&gt;retry&lt;/em&gt; level. Every concrete step was reasonable. What was missing was the loop — retry the screenshot after installing the deps, clean up the dead code after the refactor, question whether two utility libraries were one too many. That's the agentic gap, and it's narrower than a year ago but still visible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The screenshot step was the cleanest differentiator.&lt;/strong&gt; Same task, same workspace template, same Playwright MCP, same headless Chromium dependency stack. Three models installed the missing libraries and got real PNGs. One model installed the libraries and produced a markdown description instead. Same workspace, same tools, completely different outcomes. If you wanted to test agentic loop-closing in a single observable step, this would be it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two of four replaced the Settings placeholder; two kept it.&lt;/strong&gt; The spec allowed either. Both Opus runs replaced it; Sonnet and Qwen kept it alongside the new Images card. Not a quality signal — a reading of the spec — but interesting that the two Opus variants made the same call independently, and the two non-Opus models made the same opposite call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the bill says
&lt;/h2&gt;

&lt;p&gt;The rubric scores were one half of the bakeoff. The other half lives in Coder's chat-cost API. Coder's OSS deployment exposes &lt;code&gt;/api/experimental/chats/cost/{user}/summary&lt;/code&gt; — an experimental endpoint that returns per-chat input tokens, output tokens, cache reads, cache writes, message counts, and runtime. (Coder Premium has a fuller "AI Bridge" cost product; on OSS, the experimental chats endpoint is the equivalent and gives you everything you need to do this analysis.)&lt;/p&gt;

&lt;p&gt;Querying per-chat instead of per-model matters. My first pass aggregated by model and the Opus 4.7 totals looked enormous — until I realized the rollup had silently combined two chats running on the same model: this judging thread plus the actual Opus 4.7 contestant run. After identifying the contestant by its chat ID prefix (&lt;code&gt;2c4e8f98&lt;/code&gt;) and isolating to that session, the numbers got honest. &lt;strong&gt;The lesson: for clean bakeoff stats, query at the chat-id level, not by model.&lt;/strong&gt; Two sessions on the same model will silently pool.&lt;/p&gt;

&lt;p&gt;The finding the dashboard didn't surface: Opus 4.7 won the rubric (4.68), but weighted by cost-per-rubric-point at Anthropic list prices, Sonnet 4.6 wins decisively. &lt;strong&gt;$0.37 per rubric point for Sonnet vs $3.87 for Opus 4.7 and $3.63 for Opus 4.6.&lt;/strong&gt; Sonnet was the only economically sensible choice for a task this size.&lt;/p&gt;

&lt;p&gt;The Qwen line is the other one to sit with. Qwen finished in &lt;strong&gt;6.4 minutes&lt;/strong&gt; — faster than every Claude run — and produced the lowest-scoring artifact. Locally hosted inference is genuinely faster per turn (~4 seconds vs 6–13 seconds for the Claude runs); the shortfall was per-turn productivity, not latency. A longer Qwen run might have closed the gap. A 6-minute Qwen run did not.&lt;/p&gt;

&lt;p&gt;One honest caveat on the cost numbers: this OSS Coder deployment doesn't have model cost config set, so the dashboard reported $0 across the board. The costs in the table below are list-price estimates calculated from the raw token counts. Production Anthropic billing would match closely modulo any rate plan.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Cache R&lt;/th&gt;
&lt;th&gt;Cache W&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Messages&lt;/th&gt;
&lt;th&gt;Est Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.7&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;32,114&lt;/td&gt;
&lt;td&gt;4,772,142&lt;/td&gt;
&lt;td&gt;454,581&lt;/td&gt;
&lt;td&gt;9.2 min&lt;/td&gt;
&lt;td&gt;84&lt;/td&gt;
&lt;td&gt;$18.09&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;14,671&lt;/td&gt;
&lt;td&gt;45,137&lt;/td&gt;
&lt;td&gt;6,493,552&lt;/td&gt;
&lt;td&gt;132,707&lt;/td&gt;
&lt;td&gt;28.1 min&lt;/td&gt;
&lt;td&gt;146&lt;/td&gt;
&lt;td&gt;$15.83&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;td&gt;25,935&lt;/td&gt;
&lt;td&gt;3,097,881&lt;/td&gt;
&lt;td&gt;85,057&lt;/td&gt;
&lt;td&gt;15.2 min&lt;/td&gt;
&lt;td&gt;106&lt;/td&gt;
&lt;td&gt;$1.64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 35B-A3B&lt;/td&gt;
&lt;td&gt;55,615&lt;/td&gt;
&lt;td&gt;23,743&lt;/td&gt;
&lt;td&gt;4,253,874&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;6.4 min&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cost-efficiency, $/rubric point (lower is better): Opus 4.7 &lt;strong&gt;$3.87&lt;/strong&gt;, Opus 4.6 &lt;strong&gt;$3.63&lt;/strong&gt;, Sonnet 4.6 &lt;strong&gt;$0.37&lt;/strong&gt;, Qwen &lt;strong&gt;$0.00&lt;/strong&gt;. Pricing: Opus $15/M in, $75/M out, $1.50/M cache read, $18.75/M cache write; Sonnet $3/M in, $15/M out, $0.30/M cache read, $3.75/M cache write; Qwen runs locally on the RTX 5090.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 models&lt;/strong&gt; tested in isolated Coder Agents sessions — Opus 4.7, Opus 4.6, Sonnet 4.6, Qwen 3.5 35B-A3B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 branches&lt;/strong&gt; pushed (&lt;code&gt;feature/image-management-run-1&lt;/code&gt; through &lt;code&gt;run-4&lt;/code&gt;); &lt;strong&gt;0 PRs&lt;/strong&gt; opened to preserve isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4/4 builds passed&lt;/strong&gt; &lt;code&gt;npm run build&lt;/code&gt; on Node 20.20.2 against the engine baseline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3/4 screenshots succeeded&lt;/strong&gt; — Qwen installed the headless-browser deps but never retried the capture; fell back to a markdown description of the page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/4 models produced an honest orphan count&lt;/strong&gt; (Opus 4.6, 1 real orphan); the other three reported &lt;strong&gt;8 false-positive orphans&lt;/strong&gt; from naive slug matching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2/4 blind identity guesses&lt;/strong&gt; correct (Opus 4.7, Qwen); the two Claude behavioral reads were right but attributed to the wrong Opus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 pre-launch fairness fixes&lt;/strong&gt; shipped before the bakeoff could run — Node 20 in the workspace image, a corrected system-instructions block, and the prompt-poisoning catch that anonymized the branches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 repos&lt;/strong&gt; touched to ship the fairness work — &lt;code&gt;coder-templates&lt;/code&gt; (Dockerfile + system instructions) and the bakeoff prompt iteration in the planning thread&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~640 lines&lt;/strong&gt; of code added per implementation on average (range 595–687); roughly &lt;strong&gt;6–8 new files&lt;/strong&gt; per branch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 new routes&lt;/strong&gt; per implementation — an admin page and an API route with a destructive verb&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;84 / 146 / 106 / 88 messages&lt;/strong&gt; sent in the four chat sessions (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Qwen); &lt;strong&gt;9.2 / 28.1 / 15.2 / 6.4 minutes&lt;/strong&gt; of wall-clock runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~$35.56 total bakeoff cost&lt;/strong&gt; at Anthropic list prices — about a fancy dinner for four independent attempts at a real feature with judgable artifacts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.37 vs $3.87 per rubric point&lt;/strong&gt; — Sonnet 4.6's cost-efficiency vs Opus 4.7's. Ten times cheaper for slightly higher quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 result I didn't expect&lt;/strong&gt;: Sonnet beat Opus 4.6 on rubric (4.48 vs 4.36) and beat &lt;em&gt;both&lt;/em&gt; Opus models by 10x on cost-efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 follow-up filed&lt;/strong&gt; in &lt;code&gt;content/TODO.md&lt;/code&gt;: build &lt;code&gt;scripts/bakeoff-stats.sh&lt;/code&gt; so the next round's per-chat aggregation is one command instead of a manual jq exercise&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>modelshowdown</category>
      <category>agents</category>
      <category>vibecoding</category>
    </item>
    <item>
      <title>Installing OpenClaw on the Homelab</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sat, 16 May 2026 16:04:16 +0000</pubDate>
      <link>https://dev.to/carryologist/installing-openclaw-on-the-homelab-1bf</link>
      <guid>https://dev.to/carryologist/installing-openclaw-on-the-homelab-1bf</guid>
      <description>&lt;p&gt;I've been running Coder workspaces on my homelab for a while — Qwen3.5-35B on llama.cpp, RTX 5090, the whole stack. But the AI assistants were all inside terminal sessions. I wanted something I could message from my phone, from Discord, from anywhere. Something that talks to the local LLM on my own hardware and doesn't phone home to anyone's cloud.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; is that thing. It's an open-source personal AI assistant with 367K GitHub stars, a plugin ecosystem, and connectors for every chat platform you can name. The pitch: "Your own personal AI assistant. Any OS. Any Platform."&lt;/p&gt;

&lt;p&gt;Here's how I got it running on my Linux workstation, wired to a local Qwen3.5-35B via llama.cpp, talking through Discord. It took an afternoon. It should have taken 30 minutes. The difference was five config mistakes that produced zero useful error messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardware
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;AMD Ryzen 9 9950X3D — 16 cores / 32 threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 5090 — 32 GB VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Qwen3.5-35B-A3B via llama.cpp on port 8080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;nomic-embed-text-v1.5 via llama.cpp on port 8084&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The LLM runs entirely on the GPU. No RAM impact on anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Installation: One Curl
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://openclaw.ai/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The script detects Ubuntu, installs Node if needed, drops the &lt;code&gt;openclaw&lt;/code&gt; binary, and launches an onboarding wizard. The whole thing took about 90 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Pointing at the Local LLM
&lt;/h2&gt;

&lt;p&gt;The wizard asks for a model provider. The list has Anthropic, Google, OpenAI, and two dozen cloud services. Scroll past all of them and pick &lt;strong&gt;Custom Provider&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/01-wizard-model-provider.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/01-wizard-model-provider.png" alt="OpenClaw wizard showing the model/auth provider selection screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The wizard needs three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base URL&lt;/strong&gt;: &lt;code&gt;http://localhost:8080/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key&lt;/strong&gt;: Anything — llama-server doesn't check it, but the field can't be empty&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model ID&lt;/strong&gt;: It auto-detects from the &lt;code&gt;/v1/models&lt;/code&gt; endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had two llama-server instances running and had to figure out which was which:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/v1/models | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data']]"&lt;/span&gt;
&lt;span class="c"&gt;# Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf&lt;/span&gt;

curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8084/v1/models | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data']]"&lt;/span&gt;
&lt;span class="c"&gt;# nomic-embed-text-v1.5.f16.gguf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port 8080 is the chat model. Port 8084 is embeddings. OpenClaw wants the chat model.&lt;/p&gt;

&lt;p&gt;The wizard verified the connection and asked for an &lt;strong&gt;Endpoint ID&lt;/strong&gt; — just a label for the config. I accepted the default &lt;code&gt;custom-localhost-8080&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/02-wizard-endpoint-id.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/02-wizard-endpoint-id.png" alt="OpenClaw wizard showing the endpoint configuration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use localhost, not your Tailscale IP.&lt;/strong&gt; OpenClaw runs on the same machine as llama-server. Routing through Tailscale adds latency and creates a dependency on the Tailscale daemon being up for purely local traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Setting Up the Discord Bot
&lt;/h2&gt;

&lt;p&gt;The wizard asks which chat channel to connect. I picked &lt;strong&gt;Discord&lt;/strong&gt; — it's the most popular OpenClaw channel, which means the most community support and troubleshooting threads.&lt;/p&gt;

&lt;p&gt;Creating the Discord bot takes five steps in the &lt;a href="https://discord.com/developers/applications" rel="noopener noreferrer"&gt;Developer Portal&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Create the application.&lt;/strong&gt; Click "Build a Bot" on the welcome screen, then "New Application." I named mine OpenClaw.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/03-discord-developer-portal.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/03-discord-developer-portal.png" alt="Discord Developer Portal welcome screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Get the bot token.&lt;/strong&gt; Go to the Bot tab, click "Reset Token," copy the token. Paste it into the OpenClaw wizard when prompted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Enable Message Content Intent.&lt;/strong&gt; Same Bot tab, scroll to "Privileged Gateway Intents," toggle on &lt;strong&gt;Message Content Intent&lt;/strong&gt;. Without this, the bot can see that messages exist but can't read what they say.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Invite the bot to your server.&lt;/strong&gt; The OAuth2 URL Generator in the Developer Portal can be finicky. I skipped it and built the invite URL manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&amp;amp;scope=bot&amp;amp;permissions=66560
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Permission &lt;code&gt;66560&lt;/code&gt; grants Send Messages + Read Message History. Replace &lt;code&gt;YOUR_APP_ID&lt;/code&gt; with the Application ID from the General Information tab.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/05-discord-oauth2-scopes.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/05-discord-oauth2-scopes.png" alt="Discord OAuth2 page showing scopes selection"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Create a server.&lt;/strong&gt; I didn't have a Discord server. The invite page showed "No items to show." Had to go back to Discord, click the &lt;code&gt;+&lt;/code&gt; button in the sidebar, create a new server called HomeLabOpenClaw, then revisit the invite URL.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/06-discord-bot-invite-no-servers.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/06-discord-bot-invite-no-servers.png" alt="Discord bot invite page showing no servers"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Finishing the Wizard
&lt;/h2&gt;

&lt;p&gt;Back in the terminal, the wizard asked a few more questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Channel access&lt;/strong&gt;: I picked "Open (allow all channels)" — it's my personal server, no reason to maintain an allowlist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search provider&lt;/strong&gt;: DuckDuckGo — free, no API key, good enough for a first run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills&lt;/strong&gt;: Said yes, let it enable the 10 eligible ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hooks&lt;/strong&gt;: Skipped — not essential for getting started&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hatch&lt;/strong&gt;: "Hatch in Terminal" — starts the gateway right there so you can see the logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/11-wizard-hatch.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/11-wizard-hatch.png" alt="OpenClaw wizard hatch screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The gateway started, the Discord plugin connected, and the bot appeared online in my server.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The Pairing Dance
&lt;/h2&gt;

&lt;p&gt;I messaged the bot and got: "OpenClaw: access not configured." With a pairing code.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/12-discord-dm-pairing.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/12-discord-dm-pairing.png" alt="Discord DM showing pairing code from carrybot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw's DM policy defaults to &lt;code&gt;pairing&lt;/code&gt; — unknown senders get a code instead of a response. You approve them from the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw pairing approve discord YOUR_PAIRING_CODE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, DMs worked perfectly. The bot responded, the 5090 spun up, responses came back. Great.&lt;/p&gt;

&lt;p&gt;Then I tried a server channel and everything broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. The Silent Channel Problem
&lt;/h2&gt;

&lt;p&gt;For the next two hours, this was my experience: I'd &lt;code&gt;@carrybot&lt;/code&gt; in a server channel, the bot would react with an emoji, show "typing..." for a few seconds, and then... nothing. No response. No error in Discord. The 5090 was clearly working — I could hear the fans.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/13-discord-channel-not-responding.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/13-discord-channel-not-responding.png" alt="Discord channel showing @carrybot messages with no responses"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DMs worked. Channels didn't. Here's every wrong turn I took and the actual fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrong Turn 1: "It's a permissions issue"
&lt;/h3&gt;

&lt;p&gt;I checked the bot's Discord role permissions. Almost nothing was toggled on. I enabled Send Messages, Read Message History, View Channels. Restarted the gateway. Still nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: The permissions were wrong and needed fixing, but they weren't the root cause. The bot was already &lt;em&gt;generating&lt;/em&gt; responses — it just wasn't &lt;em&gt;posting&lt;/em&gt; them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrong Turn 2: "It's a context window issue"
&lt;/h3&gt;

&lt;p&gt;The bot occasionally showed this error:&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/installing-openclaw-on-the-homelab/15-context-limit-exceeded.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/installing-openclaw-on-the-homelab/15-context-limit-exceeded.png" alt="Context limit exceeded error in Discord"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The OpenClaw wizard had set &lt;code&gt;contextWindow: 4000&lt;/code&gt; and &lt;code&gt;maxTokens: 4096&lt;/code&gt; in the model config. My llama-server has a 131K context window. The wizard didn't auto-detect this from the Custom Provider endpoint.&lt;/p&gt;

&lt;p&gt;I edited &lt;code&gt;~/.openclaw/openclaw.json&lt;/code&gt; and changed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contextWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;131072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;81920&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;contextWindow: 131072&lt;/code&gt; matches llama-server's &lt;code&gt;--ctx-size 131072&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;maxTokens: 81920&lt;/code&gt; matches llama-server's &lt;code&gt;-n 81920&lt;/code&gt; (max output tokens)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reasoning: true&lt;/code&gt; because Qwen3.5 runs with &lt;code&gt;--reasoning-budget 8192&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This fixed the context errors, but channels still didn't work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrong Turn 3: "It's the memory plugin"
&lt;/h3&gt;

&lt;p&gt;The logs showed &lt;code&gt;tool:memory_search:started&lt;/code&gt; hanging indefinitely. Qwen3.5 kept trying to call a &lt;code&gt;memory_search&lt;/code&gt; tool before responding, and it never completed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw config &lt;span class="nb"&gt;set &lt;/span&gt;plugins.entries.memory-core.enabled &lt;span class="nb"&gt;false
&lt;/span&gt;openclaw gateway restart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fixed the tool-call hangs in DMs. Channels still didn't work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrong Turn 4: "It's a mention detection issue"
&lt;/h3&gt;

&lt;p&gt;Early on, I was typing &lt;code&gt;@OpenClaw&lt;/code&gt; in channels. The logs showed &lt;code&gt;reason: "no-mention"&lt;/code&gt; — the bot is mention-gated in group chats and I was mentioning the wrong name. The Discord application is "OpenClaw" but the bot username is "carrybot" (I renamed it in the Developer Portal).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have to use the actual Discord mention&lt;/strong&gt; — type &lt;code&gt;@&lt;/code&gt; and select the bot from the autocomplete. Typing &lt;code&gt;@carrybot&lt;/code&gt; as plain text doesn't create a real mention.&lt;/p&gt;

&lt;p&gt;This got the bot to actually &lt;em&gt;process&lt;/em&gt; channel messages. But it still wasn't responding.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Actual Fix: &lt;code&gt;visibleReplies&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;After two hours, I found it. During the wizard's &lt;code&gt;openclaw doctor&lt;/code&gt; step, it had auto-applied a config change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"groupChat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"visibleReplies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"message_tool"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells OpenClaw to use the &lt;code&gt;message&lt;/code&gt; tool for posting replies in group chats / server channels. But the &lt;code&gt;message&lt;/code&gt; tool wasn't available — I'd disabled &lt;code&gt;memory-core&lt;/code&gt; and the tool policy didn't include it. So the bot would generate a perfect response, try to send it via a tool that doesn't exist, and silently fail.&lt;/p&gt;

&lt;p&gt;The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw config &lt;span class="nb"&gt;set &lt;/span&gt;messages.groupChat.visibleReplies &lt;span class="s2"&gt;"automatic"&lt;/span&gt;
openclaw gateway restart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One config key. Two hours of debugging. Zero error messages in the logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. The Working Config
&lt;/h2&gt;

&lt;p&gt;Here's the final &lt;code&gt;~/.openclaw/openclaw.json&lt;/code&gt; model section that actually works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"qwen-local"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"baseUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sk-none"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contextWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;131072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"maxTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;81920&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the critical non-obvious settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"groupChat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"visibleReplies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"automatic"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"plugins"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"entries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"memory-core"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"defaults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"compaction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"reserveTokensFloor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40000&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  8. Making It Stick
&lt;/h2&gt;

&lt;p&gt;Install the systemd service so the gateway survives reboots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw gateway &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set yourself as the command owner so you can run privileged commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw config &lt;span class="nb"&gt;set &lt;/span&gt;commands.ownerAllowFrom &lt;span class="s1"&gt;'["discord:YOUR_DISCORD_USER_ID"]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw &lt;span class="nt"&gt;--version&lt;/span&gt;          &lt;span class="c"&gt;# confirm CLI&lt;/span&gt;
openclaw doctor             &lt;span class="c"&gt;# check for config issues&lt;/span&gt;
openclaw gateway status     &lt;span class="c"&gt;# verify gateway is running&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The wizard's defaults are for cloud providers, not local LLMs.&lt;/strong&gt; &lt;code&gt;contextWindow: 4000&lt;/code&gt; is a safe default for API providers that charge per token. It's a crippling default for a local model with 131K context. If you're running a Custom Provider, you &lt;em&gt;must&lt;/em&gt; manually set &lt;code&gt;contextWindow&lt;/code&gt; and &lt;code&gt;maxTokens&lt;/code&gt; to match your server's actual limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;visibleReplies: "message_tool"&lt;/code&gt; is a trap.&lt;/strong&gt; The doctor command auto-applies this "recommended" setting, but it depends on the message tool being available. If you're running a stripped-down config without all the default tools, your bot will silently swallow every group chat reply. The symptom is &lt;em&gt;perfect&lt;/em&gt; — the bot reacts, types, generates a response (you can verify in the session files), and then just... doesn't post it. No error. No log line. Nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discord bot setup has more steps than it should.&lt;/strong&gt; Between the Developer Portal, the OAuth2 scopes, the Privileged Gateway Intents, the server creation, the role permissions, and the correct mention format — there are at least six places where a single missed toggle produces a silent failure. Document every step. Check every toggle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session files are your debugging lifeline.&lt;/strong&gt; When the logs show nothing, check &lt;code&gt;~/.openclaw/agents/main/sessions/*.jsonl&lt;/code&gt;. The session file showed me the bot was generating perfect responses that were never delivered. Without that, I would have assumed the LLM was broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with DMs, graduate to channels.&lt;/strong&gt; DMs have a simpler code path — no mention detection, no group chat reply policy, no channel permissions. Get DMs working first, then debug channels as a separate problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Files Changed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;On the workstation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;~/.openclaw/openclaw.json&lt;/code&gt; — model config, context window, reply policy, plugin settings, owner config&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Discord:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created Discord application "OpenClaw" with bot user "carrybot"&lt;/li&gt;
&lt;li&gt;Created Discord server "HomeLabOpenClaw"&lt;/li&gt;
&lt;li&gt;Enabled Message Content Intent, configured role permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Systemd:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;openclaw-gateway.service&lt;/code&gt; — installed via &lt;code&gt;openclaw gateway install&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The bot works, but it's running Qwen3.5-35B with &lt;code&gt;memory-core&lt;/code&gt; disabled and no skills beyond the basics. Next steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Re-enable memory.&lt;/strong&gt; Figure out why &lt;code&gt;memory_search&lt;/code&gt; hangs with Qwen3.5's tool call format and fix it — memory is one of OpenClaw's killer features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add skills.&lt;/strong&gt; 43 skills were blocked by missing requirements. Install the useful ones — &lt;code&gt;session-logs&lt;/code&gt;, &lt;code&gt;nano-pdf&lt;/code&gt;, &lt;code&gt;video-frames&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try a different local model.&lt;/strong&gt; Qwen3.5 works but its tool calling may not be fully compatible with OpenClaw's expected format. Worth testing Gemma 4 or another model with native tool support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wire up Tailscale access.&lt;/strong&gt; The gateway listens on localhost:18789. Exposing it on the tailnet means I can hit the dashboard from any device without a Cloudflare tunnel.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1 curl command&lt;/strong&gt; to install OpenClaw&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;131,072 tokens&lt;/strong&gt; — the context window the wizard set to 4,000&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;81,920 tokens&lt;/strong&gt; — max output, matching llama-server's &lt;code&gt;-n&lt;/code&gt; flag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2 hours&lt;/strong&gt; debugging silent channel failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 config key&lt;/strong&gt; (&lt;code&gt;visibleReplies: "automatic"&lt;/code&gt;) that fixed everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 Discord setup steps&lt;/strong&gt; where a missed toggle means silent failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 cloud dependencies&lt;/strong&gt; — fully local LLM, self-hosted gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~500 MB&lt;/strong&gt; RAM footprint for the OpenClaw gateway (Node.js process)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;18 screenshots&lt;/strong&gt; taken during the debug session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 sensitive screenshots&lt;/strong&gt; deleted (contained tokens/credentials)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 useful error messages&lt;/strong&gt; for the &lt;code&gt;visibleReplies&lt;/code&gt; bug&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>homelab</category>
      <category>agents</category>
      <category>howto</category>
    </item>
    <item>
      <title>Thursday Thoughts: The Models We Can't Run</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 14 May 2026 15:59:32 +0000</pubDate>
      <link>https://dev.to/carryologist/thursday-thoughts-the-models-we-cant-run-57c3</link>
      <guid>https://dev.to/carryologist/thursday-thoughts-the-models-we-cant-run-57c3</guid>
      <description>&lt;p&gt;Every week or two, a model drops that makes the local AI community lose its collective mind. This week it was three at once: &lt;strong&gt;DeepSeek V4-Pro&lt;/strong&gt;, &lt;strong&gt;DeepSeek V4-Flash&lt;/strong&gt;, and &lt;strong&gt;Zyphra ZAYA1-8B&lt;/strong&gt;. All three are genuinely impressive. All three are models I wanted to benchmark on our homelab. And after doing the research, I'm not testing any of them.&lt;/p&gt;

&lt;p&gt;Not because I don't want to. Because I physically can't — or can't yet.&lt;/p&gt;

&lt;p&gt;This post isn't a benchmark. It's the research that happens &lt;em&gt;before&lt;/em&gt; the benchmark, where you figure out which models are even candidates for your hardware. If you're building or considering a local inference setup, the reasons these three models don't work are more instructive than any leaderboard score.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rig
&lt;/h2&gt;

&lt;p&gt;Quick refresher on what we're working with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 5090 — 32 GB VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;64 GB DDR5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;AMD Ryzen 9 9950X3D — 16 cores / 32 threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;1.8 TB NVMe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference&lt;/td&gt;
&lt;td&gt;llama.cpp on the GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is a strong homelab by any measure. We run Qwen 3.5 35B-A3B daily for agentic coding at 200+ tok/s. In previous benchmark rounds, Devstral, Codestral, Gemma 4, and DeepSeek R1 14B have all run comfortably. The 5090 is the sweet spot for 20B–35B models.&lt;/p&gt;

&lt;p&gt;But the new generation of models isn't playing in the 20B–35B range anymore.&lt;/p&gt;

&lt;h2&gt;
  
  
  DeepSeek V4-Pro: Too Big for Anything Short of a Data Center
&lt;/h2&gt;

&lt;p&gt;V4-Pro is DeepSeek's new flagship. The numbers are staggering:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total parameters&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.6 trillion&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Activated per token&lt;/td&gt;
&lt;td&gt;49B (MoE, 256 experts, top-6 routing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model weights (FP4+FP8 mixed)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;805 GB on disk&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That 805 GB number is the wall. Our entire system — 32 GB VRAM plus 64 GB RAM — gives us 96 GB of addressable memory. The model is &lt;strong&gt;8.4x larger than our total memory&lt;/strong&gt;. There are no GGUF quants available, and nobody is making them because there's no consumer hardware that could run them meaningfully.&lt;/p&gt;

&lt;p&gt;For context, we tried running Kimi K2.6 (a similarly-sized 1T MoE model) a few weeks ago. It "ran" at &lt;strong&gt;less than 1 token per second&lt;/strong&gt; — the weights spilled out of VRAM into RAM, and we hit the DDR5 memory bandwidth ceiling (~80 GB/s vs the 5090's ~1.8 TB/s). V4-Pro at 1.6T would be even slower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: Cloud API only. DeepSeek serves it at &lt;a href="https://api.deepseek.com" rel="noopener noreferrer"&gt;api.deepseek.com&lt;/a&gt; and we've added it to our benchmark rig as a cloud provider alongside Anthropic.&lt;/p&gt;

&lt;h2&gt;
  
  
  DeepSeek V4-Flash: Close, But Not Close Enough
&lt;/h2&gt;

&lt;p&gt;V4-Flash is V4-Pro's smaller sibling and the one I was actually hopeful about:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total parameters&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;284B&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Activated per token&lt;/td&gt;
&lt;td&gt;13B (MoE, 256 experts, top-6 routing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smallest GGUF quant (Q2_K)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;96.2 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Most popular quant (Q4_K_M)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;160.2 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only 13B activated per token sounds incredible — that's smaller than our DeepSeek R1 14B. But MoE models need all their expert weights resident in memory even though only a fraction fires per token. That 284B of total parameters has to be somewhere accessible.&lt;/p&gt;

&lt;p&gt;The math doesn't work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Fits in VRAM + RAM (96 GB)?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q2_K&lt;/td&gt;
&lt;td&gt;96.2 GB&lt;/td&gt;
&lt;td&gt;Barely — 0.2 GB over before KV cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;126.2 GB&lt;/td&gt;
&lt;td&gt;No — needs 30 GB disk offload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;160.2 GB&lt;/td&gt;
&lt;td&gt;No — needs 64 GB disk offload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FP4-FP8 native&lt;/td&gt;
&lt;td&gt;145.4 GB&lt;/td&gt;
&lt;td&gt;No — needs 49 GB disk offload&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There &lt;em&gt;were&lt;/em&gt; IQ1_S (54 GB) and IQ2_M (87 GB) quants that would have fit — but the community removed them. When quant maintainers pull their own files, that's a strong signal the output quality was garbage.&lt;/p&gt;

&lt;p&gt;And even if one of these squeaked into memory, there's a bigger problem: &lt;strong&gt;llama.cpp doesn't support the DeepSeek V4 architecture yet&lt;/strong&gt;. All existing GGUFs require custom forks. The mainline support PRs are still open and under active debate. You'd be building from an untested branch to run a model that barely fits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: Not ready. We've added V4-Flash to the benchmark as a cloud API model for now. When llama.cpp merges V4 support &lt;em&gt;and&lt;/em&gt; a viable sub-90 GB quant exists, we'll revisit.&lt;/p&gt;

&lt;h2&gt;
  
  
  ZAYA1-8B: The Right Size, the Wrong Stack
&lt;/h2&gt;

&lt;p&gt;This is the one that hurts the most, because on paper it's a perfect homelab model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total parameters&lt;/td&gt;
&lt;td&gt;8.4B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Activated per token&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;760M&lt;/strong&gt; (MoE, 16 experts, top-1 routing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM at bf16&lt;/td&gt;
&lt;td&gt;~17 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIME '26 score&lt;/td&gt;
&lt;td&gt;89.1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;8.4 billion parameters. 17 GB in bf16. Fits trivially on the 5090 with room to spare. Punches absurdly above its weight on reasoning benchmarks — 89.1 on AIME '26 is competitive with models 10–15x its size.&lt;/p&gt;

&lt;p&gt;So what's the problem? &lt;strong&gt;Architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ZAYA1 uses CCA (Cross-Channel Attention) — Zyphra's novel hybrid of Mamba-style recurrence and traditional attention. It's not standard Mamba2. It's not standard transformer attention. It's a fundamentally new layer type with small 1D convolutions, custom Q/K projections, and learned residual scaling.&lt;/p&gt;

&lt;p&gt;llama.cpp has no support for this architecture. There's an &lt;a href="https://github.com/ggml-org/llama.cpp/issues/22776" rel="noopener noreferrer"&gt;open feature request&lt;/a&gt; with nothing but +1 comments. No GGUF quants exist because there's nothing to run them on. Even Zyphra's older Zamba2 architecture (&lt;a href="https://github.com/ggml-org/llama.cpp/issues/21412" rel="noopener noreferrer"&gt;#21412&lt;/a&gt;) remains unimplemented.&lt;/p&gt;

&lt;p&gt;The only way to run ZAYA1 today is through Zyphra's custom vLLM fork — a completely different serving stack from our llama.cpp setup. It would work on the 5090, but it means standing up and maintaining a parallel inference pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: On the to-do list. When llama.cpp adds CCA support or we carve out time to set up vLLM as a second serving backend, this is the first model we'll test.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Runs on a 32 GB GPU
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable reality of local inference in mid-2026: the models generating the most hype are the ones you can't run.&lt;/p&gt;

&lt;p&gt;The models that &lt;em&gt;fly&lt;/em&gt; on a 32 GB card — where you get 100+ tok/s and useful agentic performance — are capped at roughly &lt;strong&gt;24–28 GB of weights&lt;/strong&gt; (leaving room for KV cache). That means:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What Fits&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dense models&lt;/td&gt;
&lt;td&gt;Up to ~14B at Q8, ~20B at Q6, ~27B at Q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MoE models&lt;/td&gt;
&lt;td&gt;Up to ~35B total at Q4 (e.g. Qwen 3.5 35B-A3B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What doesn't&lt;/td&gt;
&lt;td&gt;Anything over ~28 GB of quantized weights&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Our current daily driver — Qwen 3.5 35B-A3B at Q4_K_XL — is 22 GB of weights with 3B activated per token, running at 200+ tok/s. It's fast, it's good, and it's approximately the ceiling of what a single 5090 can do at interactive speeds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Walls
&lt;/h2&gt;

&lt;p&gt;Each of these models hits a different wall, and that's what makes this exercise useful:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;V4-Pro&lt;/strong&gt; — pure size. 805 GB of weights. No amount of quantization or clever offloading helps when the model is 8x your total memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;V4-Flash&lt;/strong&gt; — the quantization gap. The model &lt;em&gt;almost&lt;/em&gt; fits at extreme compression, but the quality degrades too far. We're in a window where the model exists but the tooling hasn't caught up to make it practical on consumer hardware.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ZAYA1&lt;/strong&gt; — architecture support. The model fits perfectly. The hardware is more than enough. But the inference engine doesn't speak the language yet.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're evaluating models for a homelab or edge deployment, these are the three questions to ask before you even think about benchmarks: Is it small enough? Is the quantization viable? Does my inference stack support it?&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;805 GB&lt;/strong&gt; — DeepSeek V4-Pro model weight size. 8.4x our total system memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;96.2 GB&lt;/strong&gt; — smallest V4-Flash GGUF quant. Still 0.2 GB over our VRAM + RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;17 GB&lt;/strong&gt; — ZAYA1-8B at bf16. Fits trivially, runs nowhere (yet).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;22 GB&lt;/strong&gt; — our actual daily driver (Qwen 3.5 35B-A3B at Q4_K_XL). The real ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; — number of these three models with merged llama.cpp support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2&lt;/strong&gt; — models we added to the benchmark as cloud API endpoints instead (V4-Flash, V4-Pro).&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 11 May 2026 15:17:29 +0000</pubDate>
      <link>https://dev.to/carryologist/model-showdown-round-4-opus-vs-qwen-writers-not-coders-3b0o</link>
      <guid>https://dev.to/carryologist/model-showdown-round-4-opus-vs-qwen-writers-not-coders-3b0o</guid>
      <description>&lt;p&gt;Two models. Same prompt. Same five fodder files. Same 27 published posts to check for redundancy. Same writing style guide.&lt;/p&gt;

&lt;p&gt;One chose the Dev.to syndication saga. The other chose the tag taxonomy overhaul. There was zero overlap in fodder selection, topic, or angle.&lt;/p&gt;

&lt;p&gt;This is the story of what happened — and what the differences reveal about how models approach the same creative task.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I've been running this blog with AI agents as the primary writing tool since day one. Every post on vibescoder.dev was drafted by Claude Opus 4.6 through Coder Agents — until now. I wanted to see what would happen if I gave a different model the same editorial task.&lt;/p&gt;

&lt;p&gt;The prompt was identical for both sessions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Let's look at all of our fodder files and see if there is a themed post we can do. Either a standalone post or one that threads a few fodders together. Review all published and unpublished posts for style and content redundancy. Propose a draft when you're ready.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Model A&lt;/strong&gt;: Claude Opus 4.6 (cloud, via Coder Agents)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model B&lt;/strong&gt;: Qwen 3.5 35B-A3B (local, llama.cpp on the RTX 5090, via Coder Agents)&lt;/p&gt;

&lt;p&gt;Both had access to the same skill files, the same repos, the same tools. Neither knew the other was running.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Chose
&lt;/h2&gt;

&lt;p&gt;For context, I use a "fodder file" workflow. Agents summarize sessions as we complete them. There is a SKILL file that defines the standard format for this. Periodically, we turn fodder files into drafts. Some are 1:1 and become complete posts. Others get rolled up into  a thematic post.&lt;/p&gt;

&lt;p&gt;Five unclaimed fodder files were available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fodder&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Qwen 3.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dev.to syndication (May 7)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Selected&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filtering/taxonomy overhaul (May 1)&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Selected&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen daily driver + skills (May 4)&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduled publish bug (May 3)&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;td&gt;Passed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External auth multi-user (May 3)&lt;/td&gt;
&lt;td&gt;Correctly identified as already claimed&lt;/td&gt;
&lt;td&gt;Correctly identified as already claimed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both correctly identified that &lt;code&gt;blog-fodder-external-auth-multi-user-may-3.md&lt;/code&gt; was already sourced by an existing draft. Both passed on the scheduled publish bug — Opus explicitly flagged it as too small for a standalone post; Qwen simply didn't rank it.&lt;/p&gt;

&lt;p&gt;The Qwen daily driver fodder is more interesting. Opus passed on it without comment. Qwen actually ranked it second in its proposals file and planned to draft it "next week" after Round 3 publishes. It wasn't dismissed — it was deferred.&lt;/p&gt;

&lt;p&gt;The interesting part is what they reached for.&lt;/p&gt;

&lt;p&gt;[Human editor's note: I asked Opus to analyze and write this post from its perspective. What follows below is unedited. The first person "I" from here on is Opus.]&lt;/p&gt;

&lt;h2&gt;
  
  
  Opus Chose the War Story
&lt;/h2&gt;

&lt;p&gt;I picked the Dev.to syndication fodder and wrote &lt;a href="https://dev.to/posts/the-api-that-wouldnt-say-no"&gt;The API That Wouldn't Say No&lt;/a&gt;. The angle: a four-hour debugging session against an API that silently swallows your data without returning an error. Six failed attempts, one root cause, 443 lines of dead code cleaned up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I chose it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete narrative arc with a clear villain (the silent &lt;code&gt;published_at&lt;/code&gt; failure)&lt;/li&gt;
&lt;li&gt;Zero overlap with existing posts (Day Four covered the initial Dev.to setup, not the bulk syndication or the debugging saga)&lt;/li&gt;
&lt;li&gt;Universally useful technical content — anyone integrating with the Dev.to API will hit this&lt;/li&gt;
&lt;li&gt;The Vercel Hobby plan timeout as an architectural constraint is a story within a story&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The post is 153 lines. One code block. Eight "By the Numbers" bullets. The structure follows the blog's standard pattern: setup → build → disaster → fix → cleanup → lessons → stats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Qwen Chose the Data Story
&lt;/h2&gt;

&lt;p&gt;Qwen picked the filtering/taxonomy fodder and wrote "From Chaos to Signal: How We Fixed Our Blog's Tag System." The angle: shipping a filter bar that barely worked, discovering through a data audit that 94% of posts shared the same tags, then replacing freeform folksonomy with controlled taxonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Qwen chose it:&lt;/strong&gt;&lt;br&gt;
Qwen wrote a separate proposals file (&lt;code&gt;post-draft-proposals-2026-05-09.md&lt;/code&gt;) before drafting — a planning step Opus skipped entirely. It ranked three standalone posts: taxonomy first, Qwen daily driver second, syndication third. Its stated reasoning for the taxonomy pick: "strong metrics-driven how-to" that was "flagged in TODO as high priority." It declared "No content redundancy detected" without deep-checking gotcha-level overlaps against published posts.&lt;/p&gt;

&lt;p&gt;The instinct was right — the taxonomy story is strong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concrete before/after metrics with the tag saturation table as proof&lt;/li&gt;
&lt;li&gt;A conceptual thesis — folksonomy vs. taxonomy — that elevates it beyond a feature changelog&lt;/li&gt;
&lt;li&gt;The V1 → V2 iteration arc is satisfying: ship, measure, realize the data is wrong, redesign&lt;/li&gt;
&lt;li&gt;Clean origin story for the &lt;code&gt;type&lt;/code&gt; field that now appears in every post's frontmatter but has never been explained&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The post is 243 lines. Two tables, two code blocks, four numbered gotchas. Heavier on architectural detail and lighter on narrative tension.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Instinct Gap
&lt;/h2&gt;

&lt;p&gt;Here's what I think the divergence reveals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus gravitates toward narrative tension.&lt;/strong&gt; I looked at five fodder files and picked the one with a villain. The &lt;code&gt;published_at&lt;/code&gt; silent failure is a four-hour mystery with a one-line resolution — that's a story structure. The post has a rising action (six failed attempts), a climax (isolating the field), and a denouement (the cleanup). The technical content is the vehicle, but the engine is "here's what went wrong and why it took so long to figure out."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen gravitates toward systematic explanation.&lt;/strong&gt; It looked at the same five files and picked the one with the cleanest data. The tag saturation table is the centerpiece — hard numbers that prove the V1 filter was broken. The post walks through every architectural decision, every file changed, every gotcha encountered. The structure is taxonomic (ironically), not dramatic.&lt;/p&gt;

&lt;p&gt;Neither instinct is wrong. They produce different kinds of posts for different kinds of readers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality Assessment
&lt;/h2&gt;

&lt;p&gt;I read both drafts against the blog's established conventions — 27 published posts, the style guide in &lt;code&gt;settings.json&lt;/code&gt;, the skill files that define structure and voice. Here's how they stack up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice and Tone
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Opus&lt;/strong&gt;: Matches the blog's existing voice closely. First person, direct, dry. "31 seconds × 11 posts = ~5.5 minutes of wall time. The 'Stop' button went from nice-to-have to essential." That's the rhythm of the published posts — setup, punchline, move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen&lt;/strong&gt;: Close but slightly off. The opening is strong — "Click &lt;code&gt;[ai]&lt;/code&gt; and three posts disappeared. That's not filtering — it's a rounding error" is a great line. But the prose occasionally shifts into explainer mode: "Tags are folksonomy — freeform, inconsistent, grow unbounded. Content type is taxonomy — controlled vocabulary, exactly 2 values..." That's accurate, but it reads more like documentation than a blog post. The existing posts teach by showing, not by defining.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structural Conventions
&lt;/h3&gt;

&lt;p&gt;This is where the gap widens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Convention&lt;/th&gt;
&lt;th&gt;Opus&lt;/th&gt;
&lt;th&gt;Qwen&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;H1 title in body&lt;/td&gt;
&lt;td&gt;No (correct)&lt;/td&gt;
&lt;td&gt;Yes — only post on the entire blog to repeat the title as an &lt;code&gt;# H1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;## What I Learned&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Present&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Missing&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;## By the Numbers&lt;/code&gt; position&lt;/td&gt;
&lt;td&gt;Last section (correct)&lt;/td&gt;
&lt;td&gt;Before "What's Next" (reversed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;---&lt;/code&gt; horizontal rules&lt;/td&gt;
&lt;td&gt;Sparse — one before closing sections&lt;/td&gt;
&lt;td&gt;Between every major section (7 total)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tags format&lt;/td&gt;
&lt;td&gt;Inline &lt;code&gt;[array]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;YAML list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New tags introduced&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3 (&lt;code&gt;content-design&lt;/code&gt;, &lt;code&gt;tagging&lt;/code&gt;, &lt;code&gt;data-audit&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The H1 is the most visible miss. Every published post on vibescoder.dev renders its title from frontmatter — the body starts with prose or an &lt;code&gt;## H2&lt;/code&gt;. Qwen added a redundant &lt;code&gt;# From Chaos to Signal: How We Fixed Our Blog's Tag System&lt;/code&gt; at line 20 that would render as a duplicate title on the live site.&lt;/p&gt;

&lt;p&gt;The missing "What I Learned" section matters too. It's not universal — some posts skip it — but for a 243-line how-to post with four gotchas and a conceptual thesis about folksonomy vs. taxonomy, the absence of a distilled lesson section leaves the ending flat. The post goes from "Gotchas" straight to "By the Numbers" to "What's Next," which reads like the analytical work is done but the editorial work isn't.&lt;/p&gt;

&lt;p&gt;The excessive horizontal rules are a style preference, but they break the visual flow in a way that no published post does. The blog uses &lt;code&gt;---&lt;/code&gt; sparingly — to separate the narrative from the closing sections, not between every &lt;code&gt;## H2&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tag Discipline
&lt;/h3&gt;

&lt;p&gt;This one is ironic. Qwen wrote a post about cleaning up tag proliferation — then introduced three brand-new tags (&lt;code&gt;content-design&lt;/code&gt;, &lt;code&gt;tagging&lt;/code&gt;, &lt;code&gt;data-audit&lt;/code&gt;) that don't appear on any other post. The blog just went from 16 unique tags to 19. By the post's own logic, those are tags with a single-post frequency — the exact pattern the taxonomy cleanup was trying to eliminate.&lt;/p&gt;

&lt;p&gt;Opus used three existing tags (&lt;code&gt;agents&lt;/code&gt;, &lt;code&gt;next-js&lt;/code&gt;, &lt;code&gt;devops&lt;/code&gt;) — all already in the blog's vocabulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content Originality
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Opus&lt;/strong&gt;: The Dev.to syndication story builds on Day Four (which covered the initial setup) but covers entirely new ground — bulk architecture, &lt;code&gt;published_at&lt;/code&gt; debugging, rate limits, cleanup. The "silent failures" lesson echoes a theme from "Invisible Failures" and "The Agent Was Flying Blind," using nearly identical phrasing. A small deduction for not differentiating the framing more, but the technical content is unique.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen&lt;/strong&gt;: The tag taxonomy story has almost zero overlap with existing posts. The &lt;code&gt;FilterBar.tsx&lt;/code&gt; component appears in "Friday Fixes: Mobile First" but only for CSS spacing fixes — Qwen covers the conceptual redesign. The &lt;code&gt;type&lt;/code&gt; field origin story fills a genuine gap in the blog's narrative. Stronger originality score.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gotcha #2: The Self-Referential Overlap
&lt;/h3&gt;

&lt;p&gt;Qwen's second gotcha — "&lt;code&gt;published: true&lt;/code&gt; in body text" matching a grep — describes the exact same class of bug that the scheduled-publish-bug fodder (May 3) covers, and that "Friday Fixes: The Agent Was Flying Blind" already documented. Three separate instances of "grep matched prose instead of frontmatter" across the blog. Qwen didn't flag this overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scorecard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Opus ("The API That Wouldn't Say No")&lt;/th&gt;
&lt;th&gt;Qwen ("From Chaos to Signal")&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fodder selection&lt;/td&gt;
&lt;td&gt;Strong — complete arc, clear villain&lt;/td&gt;
&lt;td&gt;Strong — data-driven, fills a gap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice match&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Moderate — occasionally shifts to explainer mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structural conventions&lt;/td&gt;
&lt;td&gt;Correct — follows blog patterns&lt;/td&gt;
&lt;td&gt;Several misses — H1, missing section, reversed order, excessive rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tag discipline&lt;/td&gt;
&lt;td&gt;Clean — 0 new tags&lt;/td&gt;
&lt;td&gt;Ironic — 3 new tags in a post about tag cleanup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content originality&lt;/td&gt;
&lt;td&gt;Strong (minor lesson overlap)&lt;/td&gt;
&lt;td&gt;Very strong (almost zero overlap)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Narrative quality&lt;/td&gt;
&lt;td&gt;Higher — tension, pacing, resolution&lt;/td&gt;
&lt;td&gt;Lower — thorough but flat ending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical depth&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Higher — more code, more architecture detail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redundancy awareness&lt;/td&gt;
&lt;td&gt;Caught the "already claimed" fodder, flagged thematic overlap in analysis&lt;/td&gt;
&lt;td&gt;Caught the "already claimed" fodder, missed the gotcha #2 overlap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both posts are publishable. Neither is a throwaway. But they'd need different levels of editing to meet the blog's bar.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Edit
&lt;/h2&gt;

&lt;p&gt;We published Qwen's post — &lt;a href="https://dev.to/posts/from-chaos-to-signal-tagging-system"&gt;From Chaos to Signal&lt;/a&gt; — but not before I rewrote it. The published version has the same bones: same topic, same data, same technical content. But the H1 is gone, the "What I Learned" section exists, the closing sections are in the right order, the horizontal rules are thinned out, and the gotcha about grep matching body text was cut (it's a redundant lesson — &lt;a href="https://dev.to/posts/friday-fixes-the-agent-was-flying-blind"&gt;we've told that story before&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Qwen's original draft is embedded at the bottom of the published post in a collapsible block. Expand it and you can read both versions side by side. The differences are instructive — not because one is right and one is wrong, but because they show exactly where editorial polish lives: in the negative space. What to cut, what to reorder, what to leave unsaid.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;This wasn't a benchmark. There's no winner. The point is what the experiment reveals about using different models for the same editorial task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models have aesthetic preferences.&lt;/strong&gt; Given the same raw material, Opus reached for drama and Qwen reached for data. Both are valid editorial choices, but they produce posts with different energy. If you're building a content pipeline with AI, the model you choose shapes the voice — not just the quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Style conventions need enforcement, not inference.&lt;/strong&gt; Qwen had access to the same skill files and the same 27 published posts as examples. It still introduced an H1 heading that no other post uses, reversed the closing section order, and added horizontal rules at a frequency the blog has never used. The skill file says "end with 'By the Numbers' bullet list" but doesn't say "don't put a section after it." Negative constraints — what &lt;em&gt;not&lt;/em&gt; to do — are harder for models to infer from examples alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Redundancy detection is incomplete in both.&lt;/strong&gt; Opus flagged the "already claimed" fodder and noted thematic overlap with the "silent failures" posts but still used nearly identical lesson phrasing. Qwen flagged the "already claimed" fodder but missed that its gotcha #2 describes a bug pattern already covered in two published posts. Neither model did a deep-enough content diff to catch everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planning styles diverge.&lt;/strong&gt; Qwen wrote a structured proposals document ranking three candidates before committing to a draft. Opus jumped straight from analysis to prose — no intermediate planning artifact. Qwen's approach is arguably more disciplined, but the proposals file contained a blanket "No content redundancy detected" claim that the draft then contradicted by including an overlapping gotcha. Planning artifacts only help if the analysis behind them is thorough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local models close the gap on analysis but not on editorial polish.&lt;/strong&gt; Qwen's fodder review, redundancy check, and content selection were solid. The analytical work — reading 27 posts, cross-referencing sources, identifying unclaimed fodder — was on par with Opus. Where it fell short was the last mile: the structural conventions, the voice matching, the irony of its own tag choices. That's the gap between understanding the content and inhabiting the style.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both models handled adversity.&lt;/strong&gt; Qwen hit a git push conflict mid-session — another session had pushed the bakeoff fodder files while Qwen was working — and resolved it cleanly with &lt;code&gt;git pull --rebase&lt;/code&gt;. Opus didn't encounter merge conflicts but navigated YAML escaping issues (an apostrophe in the title broke the frontmatter parser) and nested code fence conflicts in the CollapsibleCode component. Neither model stalled on infrastructure problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2 models&lt;/strong&gt; given the same prompt in parallel sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 fodder files&lt;/strong&gt; available — each model selected a different one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 overlap&lt;/strong&gt; in fodder selection, topic, or angle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 proposals file&lt;/strong&gt; written by Qwen before drafting — a planning step Opus skipped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;153 lines&lt;/strong&gt; in the Opus draft vs. &lt;strong&gt;243 lines&lt;/strong&gt; in the Qwen draft&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0 new tags&lt;/strong&gt; introduced by Opus vs. &lt;strong&gt;3 new tags&lt;/strong&gt; by Qwen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 H1 heading&lt;/strong&gt; that shouldn't exist (Qwen's only)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 missing section&lt;/strong&gt; ("What I Learned") in the Qwen draft&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 git merge conflict&lt;/strong&gt; encountered and resolved by Qwen mid-session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;27 published posts&lt;/strong&gt; both models reviewed for redundancy — neither caught everything&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>agents</category>
    </item>
    <item>
      <title>Model Showdown Round 3: Ditching Ollama in Favor of llama.cpp</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sun, 10 May 2026 15:25:35 +0000</pubDate>
      <link>https://dev.to/carryologist/model-showdown-round-3-ditching-ollama-in-favor-of-llamacpp-5g8n</link>
      <guid>https://dev.to/carryologist/model-showdown-round-3-ditching-ollama-in-favor-of-llamacpp-5g8n</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/blog/llm-model-showdown-benchmarking-local-vs-cloud"&gt;Round 1&lt;/a&gt;, we ran five local models and two cloud models through a single coding task. The local models held their own. In &lt;a href="https://dev.to/blog/model-showdown-round-2-gemma-kimi-and-579gb-of-stubborn-optimism"&gt;Round 2&lt;/a&gt;, we added Gemma 4 and Kimi K2, fixed our scoring methodology, and watched Gemma climb to the top.&lt;/p&gt;

&lt;p&gt;But something kept nagging at us.&lt;/p&gt;

&lt;p&gt;All our benchmarks were running through &lt;strong&gt;Ollama&lt;/strong&gt; — a great tool for getting started, but essentially a wrapper around llama.cpp with its own opinions about quantization, context management, and memory allocation. We were benchmarking Ollama's choices as much as the models themselves.&lt;/p&gt;

&lt;p&gt;So we did something drastic: &lt;strong&gt;we ripped out Ollama entirely and went straight to llama.cpp&lt;/strong&gt;. Then we built a proper 12-task automated benchmark suite and ran all five models through it.&lt;/p&gt;

&lt;p&gt;The results changed everything. Spoiler: &lt;strong&gt;Qwen 3.5 swept all three categories&lt;/strong&gt; — best for coding, best for agentic tasks, best single model — and it did it at 206 tokens per second. Read on to find out how.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why llama.cpp Over Ollama?
&lt;/h2&gt;

&lt;p&gt;Ollama is fantastic for &lt;code&gt;ollama pull model &amp;amp;&amp;amp; ollama run model&lt;/code&gt;. It's genuinely the best way to get started with local models. But when you're running them as infrastructure — serving through an OpenAI-compatible API to &lt;a href="https://coder.com" rel="noopener noreferrer"&gt;Coder&lt;/a&gt; Agents, IDE extensions, and automation — the abstraction layer starts to chafe.&lt;/p&gt;

&lt;p&gt;To be fair: Ollama &lt;em&gt;can&lt;/em&gt; do most of what llama.cpp does. You can import custom GGUFs via Modelfiles. You can set context windows with &lt;code&gt;PARAMETER num_ctx&lt;/code&gt; or the &lt;code&gt;OLLAMA_CONTEXT_LENGTH&lt;/code&gt; env var. You can enable flash attention via &lt;code&gt;OLLAMA_FLASH_ATTENTION&lt;/code&gt; and KV cache quantization via &lt;code&gt;OLLAMA_KV_CACHE_TYPE&lt;/code&gt;. It's more capable than people give it credit for.&lt;/p&gt;

&lt;p&gt;So why switch? Three reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-abstraction control&lt;/strong&gt; — llama-server exposes every hyper-parameter as a launch flag: batch sizes, continuous batching, thread allocation, reasoning budgets, chat template overrides. Ollama surfaces many of these through env vars and config, but the deep inference tuning knobs aren't all available. When we needed &lt;code&gt;--reasoning-budget 8192&lt;/code&gt; and &lt;code&gt;--chat-template chatml&lt;/code&gt; to make Coder Agents work, we needed the flags.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bleeding-edge model support&lt;/strong&gt; — Ollama wraps llama.cpp, so it inherently lags behind it. When a new model architecture drops, llama.cpp supports it on day one. Ollama might take a week or two to update its downstream runner. For models like Qwen 3.5 and Gemma 4, we didn't want to wait.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer moving parts&lt;/strong&gt; — For a headless server running one model at a time behind systemd, a compiled &lt;code&gt;llama-server&lt;/code&gt; binary pointing at a GGUF on disk is the simplest possible deployment. No daemon, no internal model registry, no API translation layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Could we have tuned Ollama to get similar results? Probably close. But we'd have been fighting the abstraction at every turn instead of just setting the flags we wanted. The migration freed up &lt;strong&gt;~44 GB of disk&lt;/strong&gt; (Ollama's blob store) and gave us the direct control we needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardware
&lt;/h2&gt;

&lt;p&gt;Same beast from Rounds 1 and 2, now running leaner:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 5090, 32 GB GDDR7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AMD Ryzen 9 9950X3D, 16 cores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 GB DDR5-6000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Samsung 9100 Pro 2 TB NVMe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04, NVIDIA driver 590.48.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;llama.cpp (built with CUDA arch 89)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Migration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Building llama.cpp
&lt;/h3&gt;

&lt;p&gt;The RTX 5090 uses NVIDIA's Blackwell architecture (SM 120), but CUDA toolkit support for SM 120 was still landing when we built. The workaround: build with &lt;code&gt;-DCMAKE_CUDA_ARCHITECTURES=89&lt;/code&gt; for backward compatibility. It works — the compiler targets Ada Lovelace (SM 89) and the Blackwell GPU runs it with full performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;89 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Downloading the Models
&lt;/h3&gt;

&lt;p&gt;We grabbed GGUF files from HuggingFace using the &lt;code&gt;hf&lt;/code&gt; CLI. Each model was hand-picked for quantization level — balancing quality against our 32 GB VRAM budget:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Active&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 35B-A3B&lt;/td&gt;
&lt;td&gt;35B&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
&lt;td&gt;20.7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B-A4B&lt;/td&gt;
&lt;td&gt;26B&lt;/td&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;16.9 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 24B&lt;/td&gt;
&lt;td&gt;24B&lt;/td&gt;
&lt;td&gt;24B&lt;/td&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td&gt;15.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral 22B&lt;/td&gt;
&lt;td&gt;22B&lt;/td&gt;
&lt;td&gt;22B&lt;/td&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td&gt;14.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1 14B&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;15.7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "Active" column matters. Qwen 3.5 and Gemma 4 are &lt;strong&gt;Mixture of Experts&lt;/strong&gt; (MoE) models — they have 35B and 26B total parameters but only activate 3B and 4B respectively on each token. This means they fit comfortably in VRAM while punching well above their weight class.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/model-showdown-round-3-the-llamacpp-showdown/models-downloading-250mbs.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-3-the-llamacpp-showdown/models-downloading-250mbs.png" alt="Downloading models from HuggingFace at 250+ MB/s on the Samsung 9100 Pro"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Three models downloading sequentially. The Samsung 9100 Pro writes at 250+ MB/s — all five models landed in under 10 minutes.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The DNS Incident
&lt;/h3&gt;

&lt;p&gt;Halfway through downloading, our DNS resolution failed. Parallel HuggingFace downloads apparently overwhelmed something in the DNS chain. The fix was unglamorous:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"nameserver 8.8.8.8"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/resolv.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="/images/model-showdown-round-3-the-llamacpp-showdown/dns-failure-fix-devstral-download.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-3-the-llamacpp-showdown/dns-failure-fix-devstral-download.png" alt="DNS failure mid-download, fixed with manual nameserver, then Devstral resuming"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;DNS goes down, Google saves the day, and Devstral resumes downloading.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Setting Up the Server
&lt;/h3&gt;

&lt;p&gt;Each model gets its own launch configuration. The key insight: &lt;strong&gt;&lt;code&gt;--chat-template chatml&lt;/code&gt;&lt;/strong&gt; is mandatory for Coder Agents compatibility.&lt;/p&gt;

&lt;p&gt;Why? Qwen 3.5 and Devstral ship with embedded Jinja templates that enforce "system message must be at the beginning" — but Coder Agents sends messages in whatever order it pleases. The chatml template is permissive and all five models were trained on it, so quality is maintained.&lt;/p&gt;

&lt;p&gt;Here's Qwen's config as an example — the most tuned of the five:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/llama.cpp/build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; ~/models/qwen3.5/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 131072 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 81920 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-budget&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-format&lt;/span&gt; deepseek &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; on &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--chat-template&lt;/span&gt; chatml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parallel&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notable flags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--ctx-size 131072&lt;/code&gt;&lt;/strong&gt; — Qwen 3.5 supports 128K context. We give it the full window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-budget 8192&lt;/code&gt;&lt;/strong&gt; — Caps thinking tokens so the model doesn't burn the entire budget deliberating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--flash-attn on&lt;/code&gt;&lt;/strong&gt; — This build requires the explicit &lt;code&gt;on&lt;/code&gt; value, not bare &lt;code&gt;--flash-attn&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl 99&lt;/code&gt;&lt;/strong&gt; — Offload all layers to GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Systemd Services
&lt;/h3&gt;

&lt;p&gt;We set up two systemd services that survive reboot:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llama-embed.service&lt;/code&gt;&lt;/strong&gt; — Runs nomic-embed-text permanently on port 8084 (~300 MB VRAM). Always on, coexists with any generation model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llama-generate.service&lt;/code&gt;&lt;/strong&gt; — Runs the active generation model on port 8080. Reads from &lt;code&gt;/etc/llama-generate.conf&lt;/code&gt; for model selection.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A helper script, &lt;code&gt;llm-switch.sh&lt;/code&gt;, makes model swapping painless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/bin/llm-switch.sh qwen      &lt;span class="c"&gt;# Switch to Qwen 3.5&lt;/span&gt;
~/bin/llm-switch.sh devstral  &lt;span class="c"&gt;# Switch to Devstral&lt;/span&gt;
~/bin/llm-switch.sh status    &lt;span class="c"&gt;# Show current model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It updates the config and restarts the service. Model swap takes about 3 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark
&lt;/h2&gt;

&lt;p&gt;Rounds 1 and 2 used a single task: "build a CLI todo app." That was fine for comparing code generation, but it told us nothing about reasoning, instruction following, or multi-file agentic work.&lt;/p&gt;

&lt;p&gt;Round 3 uses &lt;strong&gt;12 tasks across 5 categories&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  Category 1: Single-File Code Generation
&lt;/h3&gt;

&lt;p&gt;The legacy benchmark, maintained for continuity with prior rounds.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Scoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1.1 Todo App&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python CLI todo app with SQLite, argparse, CRUD&lt;/td&gt;
&lt;td&gt;10 features + 7 functional tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1.2 URL Shortener&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI with SQLite, rate limiting, validation&lt;/td&gt;
&lt;td&gt;8 features (server-based functional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1.3 LRU Cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TypeScript with O(1) ops + test suite&lt;/td&gt;
&lt;td&gt;6 features + assertion tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Category 2: Multi-File Agentic Coding
&lt;/h3&gt;

&lt;p&gt;Can the model work across files and understand project structure?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Scoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2.1 Bug Fix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Express.js app with planted auth header mismatch&lt;/td&gt;
&lt;td&gt;Found bug? Minimal fix? Explanation quality?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2.2 Pagination&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Add pagination to a Flask REST API + update tests&lt;/td&gt;
&lt;td&gt;5 features checklist&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Category 3: Reasoning &amp;amp; Problem Solving
&lt;/h3&gt;

&lt;p&gt;No code — just thinking.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Scoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3.1 Debug Log&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagnose connection pool exhaustion from error log&lt;/td&gt;
&lt;td&gt;7-item rubric, 10 points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3.2 Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CRDT vs OT for collaborative editor&lt;/td&gt;
&lt;td&gt;5-item rubric, 10 points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3.3 Bayes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server error probability, show work&lt;/td&gt;
&lt;td&gt;Correct answer + methodology, 5 points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Category 4: Tool Use &amp;amp; Instruction Following
&lt;/h3&gt;

&lt;p&gt;Can the model follow structured instructions precisely?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Scoring&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4.1 Structured Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generate 5 JSON records matching a schema&lt;/td&gt;
&lt;td&gt;Valid JSON, correct types, no extra text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4.2 Tool Sequencing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Plan a read → ping → write tool chain&lt;/td&gt;
&lt;td&gt;Correct tools, correct order, no hallucination&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Category 5: Speed Microbenchmarks
&lt;/h3&gt;

&lt;p&gt;Three prompts at different output lengths, 3 runs each, median reported.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Target Length&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5.1 Short&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~128 tokens (IPv4 validator)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5.2 Medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~512 tokens (BST implementation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5.3 Long&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2048 tokens (Markdown-to-HTML converter)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Scoring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Coding composite:&lt;/strong&gt; &lt;code&gt;(features/max × 60) + (functional/max × 40)&lt;/code&gt;. Syntax invalid = score × 2/3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overall weighting:&lt;/strong&gt; Coding 40%, Reasoning 20%, Tool Use 20%, Speed 20%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sampling Parameters
&lt;/h3&gt;

&lt;p&gt;Each model uses its vendor-recommended settings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Temperature&lt;/th&gt;
&lt;th&gt;Top-P&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;Qwen team recommendation for reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;DeepSeek recommendation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Deterministic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Mistral recommendation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Deterministic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Speed benchmarks use &lt;code&gt;temperature=0.0&lt;/code&gt; across all models for reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Speed: MoE Models Are in a Different League
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Short Tok/s&lt;/th&gt;
&lt;th&gt;Med Tok/s&lt;/th&gt;
&lt;th&gt;Long Tok/s&lt;/th&gt;
&lt;th&gt;Short TTFT&lt;/th&gt;
&lt;th&gt;Med TTFT&lt;/th&gt;
&lt;th&gt;Long TTFT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;206.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;206.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;204.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30.9ms&lt;/td&gt;
&lt;td&gt;33.8ms&lt;/td&gt;
&lt;td&gt;15.1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;180.2&lt;/td&gt;
&lt;td&gt;179.4&lt;/td&gt;
&lt;td&gt;177.7&lt;/td&gt;
&lt;td&gt;22.9ms&lt;/td&gt;
&lt;td&gt;24.6ms&lt;/td&gt;
&lt;td&gt;15.6ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral&lt;/td&gt;
&lt;td&gt;80.1&lt;/td&gt;
&lt;td&gt;78.9&lt;/td&gt;
&lt;td&gt;78.5&lt;/td&gt;
&lt;td&gt;12.8ms&lt;/td&gt;
&lt;td&gt;14.9ms&lt;/td&gt;
&lt;td&gt;14.0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral&lt;/td&gt;
&lt;td&gt;78.6&lt;/td&gt;
&lt;td&gt;77.6&lt;/td&gt;
&lt;td&gt;77.3&lt;/td&gt;
&lt;td&gt;12.8ms&lt;/td&gt;
&lt;td&gt;14.5ms&lt;/td&gt;
&lt;td&gt;13.3ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;77.6&lt;/td&gt;
&lt;td&gt;77.3&lt;/td&gt;
&lt;td&gt;75.9&lt;/td&gt;
&lt;td&gt;13.9ms&lt;/td&gt;
&lt;td&gt;13.9ms&lt;/td&gt;
&lt;td&gt;14.4ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two MoE models — Qwen 3.5 and Gemma 4 — are &lt;strong&gt;2.6x faster&lt;/strong&gt; than the dense models. This isn't surprising: when you're only running 3-4B parameters per token instead of 14-24B, the math unit has less work to do. But 206 tok/s on a local model is wild. That's faster than many cloud API responses when you factor in network latency.&lt;/p&gt;

&lt;p&gt;The dense models (Devstral, Codestral, DeepSeek R1) cluster tightly at 77-80 tok/s. They're all VRAM-resident and GPU-bound at similar parameter counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTFT tells the opposite story.&lt;/strong&gt; The dense models start responding in 12-15ms. The MoE models take 22-34ms — still fast, but the routing overhead is visible. For interactive use, none of this matters. For batch processing, the MoE throughput advantage dominates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coding: Two Perfect Scores on the Legacy Task
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Todo (100)&lt;/th&gt;
&lt;th&gt;URL Short (60)&lt;/th&gt;
&lt;th&gt;LRU Cache (60)&lt;/th&gt;
&lt;th&gt;Coding Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral&lt;/td&gt;
&lt;td&gt;94.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;71.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral&lt;/td&gt;
&lt;td&gt;94.0&lt;/td&gt;
&lt;td&gt;52.5&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;68.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Qwen and Gemma both scored 100 on the todo app&lt;/strong&gt; — 10/10 features, 7/7 functional tests, valid syntax. This is the first time any model has achieved a perfect score on this task across all three rounds. Qwen produced a 192-line solution with full argparse subcommands; Gemma did it in a leaner 132 lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Devstral and Codestral&lt;/strong&gt; both scored 94 — missing one feature each (pretty output formatting) but nailing all 7 functional tests. Solid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek R1&lt;/strong&gt; scored 60 across the board. It gets all features right and syntax is always valid, but its functional tests fail. Why? DeepSeek is a &lt;strong&gt;reasoning model&lt;/strong&gt; — it spends significant tokens thinking before generating code. For the todo app, it produced correct code that used interactive input instead of argparse, failing our automated CLI tests. The code works fine if you run it manually. This is the tension with reasoning models: they're thinking about the problem deeply but sometimes overthink the interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning: Gemma's Quiet Dominance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Debug Log (10)&lt;/th&gt;
&lt;th&gt;Architecture (10)&lt;/th&gt;
&lt;th&gt;Bayes (5)&lt;/th&gt;
&lt;th&gt;Reasoning Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;8.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemma 4 and DeepSeek R1 both scored &lt;strong&gt;10/10 on the debug log task&lt;/strong&gt; — correctly identifying connection pool exhaustion, the long-running transaction, the unbounded query, row-by-row processing, and proposing fixes for all three. Every other model missed at least one item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every model scored exactly 3/5 on Bayes theorem.&lt;/strong&gt; They all correctly applied Bayes' formula and showed their work, but none nailed the final answer precisely enough for the regex matcher. This is a scoring limitation we'll improve in future rounds — the math was correct, the presentation just didn't match our expected format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codestral&lt;/strong&gt; was weakest on reasoning at 6.3 average. It's a code-specialized model — reasoning about system architecture isn't its wheelhouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool Use: Instruction Following Separates the Field
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Structured Output (5)&lt;/th&gt;
&lt;th&gt;Tool Sequencing (5)&lt;/th&gt;
&lt;th&gt;Tool Use Avg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen 3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek R1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codestral&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;4.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Qwen and DeepSeek both achieved &lt;strong&gt;perfect 5/5&lt;/strong&gt; on both tool use tasks. They generated valid JSON matching the schema exactly, and planned the correct tool call sequence in the right order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4's weakness showed here&lt;/strong&gt; — it only scored 2/5 on tool sequencing. Instead of outputting the full planned sequence, it emitted only the first tool call (&lt;code&gt;read_file&lt;/code&gt;) and explained that it would need to see the result before planning the next step. That's arguably more "correct" agentic behavior (you &lt;em&gt;shouldn't&lt;/em&gt; plan all steps before seeing intermediate results), but it's not what the task asked for. This is exactly the kind of instruction-following gap that matters in Coder Agents, where you need the model to do what you asked, not what it thinks is philosophically better.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Leaderboard
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Coding&lt;/th&gt;
&lt;th&gt;Reasoning&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Weighted Total&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen 3.5 35B-A3B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;73.3&lt;/td&gt;
&lt;td&gt;80.0&lt;/td&gt;
&lt;td&gt;100.0&lt;/td&gt;
&lt;td&gt;100.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B&lt;/td&gt;
&lt;td&gt;73.3&lt;/td&gt;
&lt;td&gt;86.7&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;87.0&lt;/td&gt;
&lt;td&gt;78.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;Devstral 24B&lt;/td&gt;
&lt;td&gt;71.3&lt;/td&gt;
&lt;td&gt;83.3&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;37.8&lt;/td&gt;
&lt;td&gt;70.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DeepSeek R1 14B&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;80.0&lt;/td&gt;
&lt;td&gt;100.0&lt;/td&gt;
&lt;td&gt;37.3&lt;/td&gt;
&lt;td&gt;67.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Codestral 22B&lt;/td&gt;
&lt;td&gt;68.8&lt;/td&gt;
&lt;td&gt;63.3&lt;/td&gt;
&lt;td&gt;90.0&lt;/td&gt;
&lt;td&gt;38.5&lt;/td&gt;
&lt;td&gt;65.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Weighting: Coding 40%, Reasoning 20%, Tool Use 20%, Speed 20%.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Winners
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🏆 Best for Coding: Qwen 3.5 (73.3)
&lt;/h3&gt;

&lt;p&gt;Tied with Gemma 4 on the composite score, but Qwen edges ahead on wall-clock time. Its todo app completed in 7.6 seconds at 206 tok/s. Gemma took 12.4 seconds at 179 tok/s. Same quality, faster delivery.&lt;/p&gt;

&lt;h3&gt;
  
  
  🏆 Best for General Agentic: Qwen 3.5 (90.0)
&lt;/h3&gt;

&lt;p&gt;Perfect tool use (100) combined with strong reasoning (80.0) gives Qwen the highest combined agentic score. This matters for Coder Agents where the model needs to follow instructions precisely and reason about multi-step tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  🏆 Best Single Model: Qwen 3.5 (85.3)
&lt;/h3&gt;

&lt;p&gt;When you can only run one model, Qwen 3.5 is the answer. It leads or ties in every category except reasoning (where Gemma edges it 86.7 to 80.0), and its speed advantage is enormous — &lt;strong&gt;2.6x faster&lt;/strong&gt; than the next non-MoE model.&lt;/p&gt;

&lt;p&gt;The gap between #1 and #2 is 7.2 points. Between #2 and #5 it's only 12.2. The field is tight on quality, but Qwen's speed makes it the clear overall winner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Journey to Fair Scoring
&lt;/h2&gt;

&lt;p&gt;One thing we didn't expect: &lt;strong&gt;the first two runs of this benchmark were wrong&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Our initial results had Devstral winning everything. But when we dug into the raw responses, we found three systemic scoring bugs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unclosed thinking tokens&lt;/strong&gt; — When Qwen hit the token limit mid-thought, its &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; block never closed. Our regex required a closing &lt;code&gt;&amp;lt;/think&amp;gt;&lt;/code&gt; tag to strip it. The entire thinking trace leaked into the code extraction, pulling out planning snippets instead of actual code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empty content fallback&lt;/strong&gt; — Gemma 4 routed all output through &lt;code&gt;reasoning_content&lt;/code&gt; instead of &lt;code&gt;content&lt;/code&gt; (a side effect of &lt;code&gt;--reasoning-format deepseek&lt;/code&gt;). Our scorer only looked at &lt;code&gt;content&lt;/code&gt;, so Gemma scored zero on tasks where it actually produced correct output.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Argparse quoting&lt;/strong&gt; — Our test harness passed &lt;code&gt;add Buy milk&lt;/code&gt; as three separate arguments. Models using argparse (correctly) expected &lt;code&gt;add "Buy milk"&lt;/code&gt; — one command, one string. The test was wrong, not the code.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We fixed all three, doubled the token budget for reasoning models, and re-ran everything. The corrected scores tell a very different story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; automated benchmarks are only as good as their scoring logic. Always inspect the raw responses before trusting the numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. MoE is the architecture to bet on for local inference.&lt;/strong&gt; Qwen 3.5 (3B active) and Gemma 4 (4B active) both outperform dense 22-24B models while running 2.6x faster. The quality-to-speed ratio isn't even close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. llama.cpp gives you control that matters.&lt;/strong&gt; Ollama can do a lot more than people think, but when you need &lt;code&gt;--reasoning-budget&lt;/code&gt;, &lt;code&gt;--chat-template chatml&lt;/code&gt;, or bleeding-edge model support on day one, the direct server eliminates the abstraction tax.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Reasoning models need breathing room.&lt;/strong&gt; Qwen, DeepSeek, and Gemma all burn 60-80% of their token budget on thinking. If you set &lt;code&gt;max_tokens=4096&lt;/code&gt;, the model might spend 3,000 tokens thinking and only have 1,000 left for the actual answer. We doubled the budget for reasoning models and the scores jumped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Tool use is the differentiator.&lt;/strong&gt; Coding and reasoning scores were close across all five models. Tool use — following structured instructions precisely — is where the gap opened up. Qwen and DeepSeek scored 100; Gemma scored 70. For agentic workflows, this matters more than raw quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Your benchmark harness is part of the test.&lt;/strong&gt; We spent more time debugging our scoring logic than any model issue. If you're benchmarking local models, inspect the raw outputs before trusting automated scores.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/model-showdown-round-3-the-llamacpp-showdown/benchmark-running.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/model-showdown-round-3-the-llamacpp-showdown/benchmark-running.png" alt="The benchmark suite running against Devstral — 77 tok/s, steady and consistent"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The benchmark suite ripping through Devstral's tasks. Consistent ~77 tok/s throughput — the dense models don't waver.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round 4: Max Aggression&lt;/strong&gt; — Each model with its native chat template, optimized temperature per task type, and fine-tuned reasoning budgets. We benchmarked for Coder Agents compatibility this round; next round we'll find each model's ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retesting Qwen 3.5 against the Cloud King, Claude&lt;/strong&gt; - We'll test Opus 4.6 and 4.7 with the goal of figuring out our perfect hybrid setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dailying Qwen 3.5 is now the default model&lt;/strong&gt; on our homelab. &lt;code&gt;llm-switch.sh qwen&lt;/code&gt; made it so.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5&lt;/strong&gt; models benchmarked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12&lt;/strong&gt; tasks across 5 categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~25 minutes&lt;/strong&gt; total benchmark runtime on the RTX 5090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;206.7 tok/s&lt;/strong&gt; — Qwen 3.5's peak throughput (fastest local model we've tested)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100.0&lt;/strong&gt; — Qwen's todo app score (first perfect score in three rounds)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;44 GB&lt;/strong&gt; reclaimed by removing Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 seconds&lt;/strong&gt; — model swap time with &lt;code&gt;llm-switch.sh&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3&lt;/strong&gt; scoring bugs found and fixed before we trusted the results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;85.3&lt;/strong&gt; — Qwen 3.5's weighted overall score, 7.2 points clear of #2&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Thursday Thought: Chat is the New Source Code</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Fri, 08 May 2026 04:54:08 +0000</pubDate>
      <link>https://dev.to/carryologist/thursday-thought-chat-is-the-new-source-code-2ifi</link>
      <guid>https://dev.to/carryologist/thursday-thought-chat-is-the-new-source-code-2ifi</guid>
      <description>&lt;p&gt;I just walked out of a customer meeting that completely shifted my perspective on the future of software development. What they told me sounds almost revolutionary, but it makes perfect sense when you think about it: &lt;strong&gt;chat is becoming the new source code&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paradigm Shift: From Code to Conversation
&lt;/h2&gt;

&lt;p&gt;Here's what blew my mind. This customer explained that in their AI-agent-powered workflow, generating code has become the easy part. What's actually difficult—and incredibly valuable—is recreating the &lt;strong&gt;context&lt;/strong&gt;, the &lt;strong&gt;intent&lt;/strong&gt;, and the &lt;strong&gt;reasoning&lt;/strong&gt; that led to that code.&lt;/p&gt;

&lt;p&gt;Think about it: when you're working with an AI agent, the magic isn't just in the final output. It's in the entire conversation—the back-and-forth refinements, the clarifications, the "actually, let me change that" moments that shape the final solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storing Chat History in GitHub: A Game Changer
&lt;/h2&gt;

&lt;p&gt;This customer has started doing something fascinating: &lt;strong&gt;they store their chat histories directly in GitHub&lt;/strong&gt;. Not just the code that results from those chats, but the entire conversational thread that led to it.&lt;/p&gt;

&lt;p&gt;Why? Because they've discovered something profound:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They can &lt;strong&gt;fork chat conversations&lt;/strong&gt; just like code branches&lt;/li&gt;
&lt;li&gt;They can &lt;strong&gt;roll back to previous chat states&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;Most importantly, they can &lt;strong&gt;recreate any piece of code trivially&lt;/strong&gt; from the chat history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's like having a perfect record of not just &lt;em&gt;what&lt;/em&gt; was built, but &lt;em&gt;why&lt;/em&gt; it was built and &lt;em&gt;how&lt;/em&gt; the thinking evolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Intent Over Implementation
&lt;/h2&gt;

&lt;p&gt;This represents a fundamental shift in how we think about software development. We're moving from an &lt;strong&gt;implementation-first&lt;/strong&gt; world to an &lt;strong&gt;intent-first&lt;/strong&gt; world.&lt;/p&gt;

&lt;p&gt;In traditional development:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Idea → Code → Version Control → Collaboration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the new agent-assisted world:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Intent → Conversation → Code Generation → Chat History Storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code becomes ephemeral—easily regenerated. The conversation becomes permanent—the true source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of Version Control
&lt;/h2&gt;

&lt;p&gt;I predict we're going to see GitHub, GitLab, and other version control platforms rapidly evolve into something entirely different: &lt;strong&gt;extensible memory layers for agentic coding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of primarily tracking file changes, these platforms will become sophisticated conversation managers that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Branch conversations&lt;/strong&gt; at any point in the dialogue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge different conversational threads&lt;/strong&gt; when collaborating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff chat histories&lt;/strong&gt; to see how approaches diverged&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replay conversations&lt;/strong&gt; with different agents or parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What This Means for Developers
&lt;/h2&gt;

&lt;p&gt;This shift has huge implications for how we work:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Documentation Becomes Native&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The chat history &lt;em&gt;is&lt;/em&gt; the documentation. No more outdated comments or README files—the reasoning is preserved in the conversation that created the code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Collaboration Changes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Instead of reviewing pull requests, we might be reviewing conversation threads. "I see you took this approach in your chat with the agent, but what if we tried this angle instead?"&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Debugging Gets Easier&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When something breaks, you don't just look at the code—you look at the conversation that created it. The context and assumptions are right there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;p&gt;We're witnessing the emergence of &lt;strong&gt;conversational version control&lt;/strong&gt;. Just as Git revolutionized how we think about code collaboration, chat-based development is about to revolutionize how we think about preserving and sharing &lt;em&gt;intent&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The source code was never really the valuable part—it was always the human thinking behind it. AI agents are just making that distinction crystal clear.&lt;/p&gt;

&lt;p&gt;What do you think? Are you ready for a world where your Git repos contain more conversations than code? Let me know in the comments—this feels like one of those moments where the industry is about to take a sharp turn, and I'm curious to hear how others are experiencing this shift.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you experimented with storing chat histories as part of your development workflow? I'd love to hear about your experiences and approaches.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>versioncontrol</category>
      <category>conversationalcoding</category>
      <category>github</category>
    </item>
    <item>
      <title>Slaying the Gemma Beast: How We Fixed Local AI and Shipped Search</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Fri, 08 May 2026 04:53:16 +0000</pubDate>
      <link>https://dev.to/carryologist/slaying-the-gemma-beast-how-we-fixed-local-ai-and-shipped-search-5638</link>
      <guid>https://dev.to/carryologist/slaying-the-gemma-beast-how-we-fixed-local-ai-and-shipped-search-5638</guid>
      <description>&lt;p&gt;Two days ago, Gemma 4 couldn't finish a feature. Today it built one, pushed it to GitHub, and it's live on this site right now.&lt;/p&gt;

&lt;p&gt;If you press &lt;code&gt;⌘K&lt;/code&gt; (or &lt;code&gt;Ctrl+K&lt;/code&gt;) on any page of vibescoder.dev, you'll see a search modal. Gemma 4 built that — running locally on an RTX 5090, zero cloud API calls, zero dollars spent. Then Claude reviewed the code, fixed the rough edges, and merged the polish. The feature you're using is a collaboration between a local model and a cloud model, each doing what they're best at.&lt;/p&gt;

&lt;p&gt;Here's how we got there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Previously: The Agentic Gap
&lt;/h2&gt;

&lt;p&gt;In our &lt;a href="https://dev.to/posts/the-agentic-gap-claude-oneshots-gemma-fails"&gt;last experiment&lt;/a&gt;, we pitted Gemma 4 against Opus 4.6 on the same task: build public-facing search for this blog. Opus one-shot it — 698 lines across 6 files, committed and pushed in 8 minutes. Gemma planned brilliantly, then stopped. Eight prompts later: 3 partial files, 0 commits.&lt;/p&gt;

&lt;p&gt;We called it "the agentic gap" — the difference between a model that writes great code and one that builds great features. But we also left a thread dangling: maybe Gemma wasn't refusing to code. Maybe it was running out of room.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnosis
&lt;/h2&gt;

&lt;p&gt;Our &lt;a href="https://dev.to/posts/friday-fixes-the-agent-was-flying-blind"&gt;deep dive into Gemma 4's local inference&lt;/a&gt; uncovered the root cause: &lt;strong&gt;invisible thinking tokens consume your generation budget&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Gemma 4 defaults to a reasoning mode where it generates chain-of-thought tokens before producing visible output. These thinking tokens are hidden — you never see them in the response — but they still count against &lt;code&gt;num_predict&lt;/code&gt;. With Ollama's defaults, the model was blowing its entire token budget on reasoning, leaving nothing for actual code.&lt;/p&gt;

&lt;p&gt;That's not a model failure. That's a configuration failure.&lt;/p&gt;

&lt;p&gt;The fix on paper was straightforward: give the model a bigger budget. But getting there required switching the entire inference stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Switching from Ollama to llama.cpp
&lt;/h2&gt;

&lt;p&gt;Ollama is great for pulling and running models. It's not great for fine-grained control. The specific controls we needed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Control&lt;/th&gt;
&lt;th&gt;Ollama&lt;/th&gt;
&lt;th&gt;llama.cpp&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context window (&lt;code&gt;num_ctx&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Modelfile only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;--ctx-size&lt;/code&gt; flag&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output limit (&lt;code&gt;num_predict&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;API parameter&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;-n&lt;/code&gt; flag + API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning budget&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Not available&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;--reasoning-budget&lt;/code&gt; flag&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calling&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Grammar-constrained&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;--reasoning-budget&lt;/code&gt; flag is the key. It caps how many tokens the model can spend on invisible chain-of-thought, forcing it to start producing real content after hitting the limit. Ollama has zero equivalent.&lt;/p&gt;

&lt;p&gt;The switch itself was an adventure. We couldn't use Ollama's blob files directly — llama.cpp expects standard GGUF files, but Ollama stores models in a split format that standalone tools can't load. We pulled the full Gemma 4 26B-A4B GGUF from Hugging Face (&lt;code&gt;unsloth/gemma-4-26B-A4B-it-GGUF&lt;/code&gt;, Q4_K_M quantization, 16.9 GB download) and launched llama-server with tuned settings:&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/huggingface-gemma4-download.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/huggingface-gemma4-download.png" alt="Downloading Gemma 4 26B GGUF from Hugging Face — 16.9 GB at 82 MB/s"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/llama.cpp/build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; ~/models/gemma4-26b/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-budget&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-format&lt;/span&gt; deepseek &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parallel&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/llama-server-gemma4-loaded.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/llama-server-gemma4-loaded.png" alt="llama-server loaded with Gemma 4 — model ready, server listening on port 8080"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--ctx-size 32768&lt;/code&gt;&lt;/strong&gt; — 32K context window. Fits comfortably at ~19 GB on the 5090.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-n 32768&lt;/code&gt;&lt;/strong&gt; — 32K max output tokens. Room for both reasoning and code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-budget 4096&lt;/code&gt;&lt;/strong&gt; — Cap invisible thinking at 4K tokens. The rest is for actual output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-format deepseek&lt;/code&gt;&lt;/strong&gt; — Expose thinking tokens in the API response so we can see what's happening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--parallel 1&lt;/code&gt;&lt;/strong&gt; — Single slot instead of default 4. Four slots × 32K context was causing OOM kills.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we pointed Coder at the new endpoint. The provider base URL switched from Ollama's &lt;code&gt;localhost:11434&lt;/code&gt; to llama-server's &lt;code&gt;localhost:8080/v1/&lt;/code&gt;, and the model config got the full GGUF filename with 32K context and output limits.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/coder-provider-config-llamacpp.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/coder-provider-config-llamacpp.png" alt="Coder Agents provider configuration — base URL pointing to llama.cpp's OpenAI-compatible endpoint"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/coder-model-config-gemma4.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/coder-model-config-gemma4.png" alt="Coder Agents model configuration — Gemma 4 GGUF with 32K context limit"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/coder-model-config-advanced.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/coder-model-config-advanced.png" alt="Advanced model settings — max output tokens set to 32768 to match the context window"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Attempts to Slay the Beast
&lt;/h2&gt;

&lt;p&gt;It didn't work on the first try.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1&lt;/strong&gt;: Gemma made tool calls — real progress compared to the original test — but hit a GitHub auth failure (&lt;code&gt;$GITHUB_TOKEN&lt;/code&gt; wasn't set in the workspace) and stalled. The last output was raw token leakage: &lt;code&gt;call:execute{command:&amp;lt;|"&amp;gt;find...&lt;/code&gt; — special tokens leaking into the response, one of the known Gemma issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2&lt;/strong&gt;: We fixed the auth, added &lt;code&gt;--reasoning-format deepseek&lt;/code&gt;, and restarted. Gemma got much further — wrote a search index generator, ran it, started exploring the codebase. Then llama-server got &lt;code&gt;Killed&lt;/code&gt; — the OOM killer struck. Four parallel slots at 32K context each was too much VRAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3&lt;/strong&gt;: Reduced to &lt;code&gt;--parallel 1&lt;/code&gt;, pre-cloned both repos in the workspace so Gemma didn't have to fight auth during exploration. This time it worked. Gemma laid out a clear implementation plan, and after one nudge — "keep going, don't stop, code and commit" — it executed the entire thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Fast Was It?
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/posts/model-showdown-round-2-adding-gemma-kimi-and-579-gb-of-stubborn-optimism"&gt;Model Showdown Round 2&lt;/a&gt;, Gemma 4 clocked 167.1 tok/s on a short benchmark task via Ollama — the fastest perfect scorer. But a benchmark prompt and an agentic coding session are different workloads. How does Gemma perform when it's actually building something?&lt;/p&gt;

&lt;p&gt;We ran fresh benchmarks against the llama.cpp server with coding prompts at different output lengths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Prompt Tokens&lt;/th&gt;
&lt;th&gt;Output Tokens&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;th&gt;Tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Short (debounce function)&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;27ms&lt;/td&gt;
&lt;td&gt;179.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium (React component)&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;2,048&lt;/td&gt;
&lt;td&gt;28ms&lt;/td&gt;
&lt;td&gt;177.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long (full Node.js script)&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;2,679&lt;/td&gt;
&lt;td&gt;29ms&lt;/td&gt;
&lt;td&gt;181.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things stand out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to first token is near-instant.&lt;/strong&gt; 27–29ms TTFT means the streaming UI starts filling in almost immediately. For comparison, cloud models typically hit 500ms–2s TTFT depending on load and routing. On a local GPU, there's no network round-trip, no queue, no cold start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation speed doesn't degrade.&lt;/strong&gt; Whether Gemma is writing 512 tokens or 2,679 tokens, throughput stays locked at 177–181 tok/s. There's no slowdown as context grows — at least not at these output lengths. During the actual search build session, with thousands of tokens of accumulated context from tool calls and file contents, we observed ~159 tok/s. That's a ~12% drop from peak, which is expected: more context means more attention computation per token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reasoning budget has a real cost.&lt;/strong&gt; With &lt;code&gt;--reasoning-format deepseek&lt;/code&gt;, Gemma's thinking tokens are visible in the API response. On a short 256-token request, the model spent all 256 tokens reasoning and produced zero visible output. That's the invisible thinking token problem in action — and exactly why &lt;code&gt;--reasoning-budget 4096&lt;/code&gt; matters. Cap the thinking, and the remaining budget goes to code.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Ollama (Showdown R2)&lt;/th&gt;
&lt;th&gt;llama.cpp (this session)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tok/s (benchmark)&lt;/td&gt;
&lt;td&gt;167.1&lt;/td&gt;
&lt;td&gt;177–181&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tok/s (real workload)&lt;/td&gt;
&lt;td&gt;N/A (failed)&lt;/td&gt;
&lt;td&gt;~159&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT&lt;/td&gt;
&lt;td&gt;3.92s&lt;/td&gt;
&lt;td&gt;~28ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning budget control&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;&lt;code&gt;--reasoning-budget 4096&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The TTFT difference is dramatic — 3.92s vs 28ms. Ollama's 3.92s likely included model loading or prompt cache misses. llama-server keeps the model hot in VRAM with a persistent prompt cache, so subsequent requests start generating almost instantly.&lt;/p&gt;

&lt;p&gt;Bottom line: Gemma 4 on an RTX 5090 via llama.cpp generates code at ~180 tok/s peak, ~159 tok/s under real agentic load, with sub-30ms TTFT. That's fast enough that the model is never the bottleneck — tool execution (git operations, file I/O, npm installs) takes longer than inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gemma Built
&lt;/h2&gt;

&lt;p&gt;Two prompts. One feature. Pushed to main.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt; package-lock.json                | 466 +++++++++++++++++++
 package.json                     |   3 +-
 public/search-index.json         |  34 +++
 scripts/generate-search-index.ts |  40 ++++
 src/components/Header.tsx        |  32 +++
 src/components/SearchModal.tsx   | 216 ++++++++++++++++++
 6 files changed, 618 insertions(+), 173 deletions(-)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The architecture: a &lt;strong&gt;client-side Fuse.js search&lt;/strong&gt; with a pre-generated JSON index. A build-time script reads all published posts and generates &lt;code&gt;public/search-index.json&lt;/code&gt;. The &lt;code&gt;SearchModal&lt;/code&gt; component loads this index on first open, runs fuzzy searches with Fuse.js, and renders results in a Cmd+K overlay.&lt;/p&gt;

&lt;p&gt;Gemma even hit an authentication error during &lt;code&gt;git push&lt;/code&gt; — and &lt;strong&gt;self-corrected&lt;/strong&gt;. It ran &lt;code&gt;coder external-auth access-token github&lt;/code&gt;, reconfigured the git remote with the token, and pushed successfully. That's agentic behavior — the thing that was completely absent in the original test.&lt;/p&gt;

&lt;p&gt;The commit message: &lt;code&gt;afb5c73 feat: add search functionality with Fuse.js&lt;/code&gt;. Vercel auto-deployed from main. The feature went live.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/gemma4-search-plan-coder-agents.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/gemma4-search-plan-coder-agents.png" alt="Gemma 4 in Coder Agents — laying out its search implementation plan before writing code"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/vibescoder-homepage-live.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/vibescoder-homepage-live.png" alt="vibescoder.dev homepage with the search feature now live"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/slaying-the-gemma-beast/search-modal-no-results.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/slaying-the-gemma-beast/search-modal-no-results.png" alt="The search modal in action — Gemma built this"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code Review: What Gemma Got Right and Wrong
&lt;/h2&gt;

&lt;p&gt;Working code that ships is a milestone. But "it works" and "it's production-quality" are different standards. Claude reviewed every line of Gemma's implementation. Here's the honest assessment.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gemma Got Right
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Architecture was sound.&lt;/strong&gt; Client-side search with a pre-generated JSON index is the correct call for a 14-post blog. No server-side API needed, no database, sub-5ms search times. The index is ~130 KB — smaller than a hero image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Component structure was clean.&lt;/strong&gt; Separate &lt;code&gt;SearchModal&lt;/code&gt; component, separate build script, clean Header integration. Three lines to wire it into the existing layout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It used the existing design system.&lt;/strong&gt; CSS variables like &lt;code&gt;bg-surface&lt;/code&gt;, &lt;code&gt;border-primary&lt;/code&gt;, &lt;code&gt;text-on-surface&lt;/code&gt; — all from the Neon Brutalist theme. It read the codebase and matched the patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-correcting on errors.&lt;/strong&gt; When &lt;code&gt;git push&lt;/code&gt; failed, Gemma diagnosed the auth issue and fixed it autonomously. Three tool calls: fetch token → reconfigure remote → push. No human intervention needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gemma Got Wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Zero accessibility.&lt;/strong&gt; No &lt;code&gt;role="dialog"&lt;/code&gt;, no &lt;code&gt;role="combobox"&lt;/code&gt;, no &lt;code&gt;aria-modal&lt;/code&gt;, no &lt;code&gt;aria-activedescendant&lt;/code&gt;, no focus trap. A screen reader would have no idea this modal existed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Broken exit animations.&lt;/strong&gt; The &lt;code&gt;AnimatePresence&lt;/code&gt; wrapper contained a regular &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; instead of a &lt;code&gt;motion.div&lt;/code&gt;. When the modal closed, React unmounted the wrapper immediately, killing the exit animations before they played. The code looked right but didn't work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance anti-pattern.&lt;/strong&gt; A new &lt;code&gt;Fuse&lt;/code&gt; instance was constructed on every keystroke. Fuse builds an internal index on construction — that's wasted work. Should be &lt;code&gt;useMemo&lt;/code&gt; keyed on the index data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eager loading.&lt;/strong&gt; The search index was fetched on every page load, even if the user never opened search. Should lazy-load on first modal open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong fonts.&lt;/strong&gt; Applied &lt;code&gt;--font-headline&lt;/code&gt; (Space Grotesk) to the entire modal including body text and descriptions. The codebase uses headline for titles only, with the default font for body text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignored existing components.&lt;/strong&gt; Rendered tags as raw &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt; elements with custom styling instead of reusing the existing &lt;code&gt;TagBadge&lt;/code&gt; component that already had the right design tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stale search index committed to git.&lt;/strong&gt; The generated &lt;code&gt;search-index.json&lt;/code&gt; was committed with 3 placeholder posts. It's a build artifact — should be in &lt;code&gt;.gitignore&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content truncated too aggressively.&lt;/strong&gt; Each post's content was cut to 1,000 characters. Terms that only appeared deeper in posts (like "RustDesk" in our infrastructure writeups) were invisible to search.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Polish Pass
&lt;/h2&gt;

&lt;p&gt;Claude's fix addressed every issue in a single PR:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;: Full ARIA combobox pattern — &lt;code&gt;role="dialog"&lt;/code&gt;, &lt;code&gt;role="combobox"&lt;/code&gt; on the input with &lt;code&gt;aria-expanded&lt;/code&gt;/&lt;code&gt;aria-activedescendant&lt;/code&gt;, &lt;code&gt;role="listbox"&lt;/code&gt; and &lt;code&gt;role="option"&lt;/code&gt; on results, &lt;code&gt;aria-live="polite"&lt;/code&gt; for result count announcements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keyboard navigation&lt;/strong&gt;: Arrow Up/Down to move through results, Enter to navigate, Escape to close. Active result scrolls into view automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;: Fuse instance memoized with &lt;code&gt;useMemo&lt;/code&gt; (rebuilds only when index changes). Index fetched lazily on first modal open. Minimum 2 characters before searching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search quality&lt;/strong&gt;: Weighted field scoring — title matches score 3× higher than content matches, tags 2×, descriptions 1.5×. Markdown stripped from indexed content. Full post content indexed with no truncation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design system&lt;/strong&gt;: Correct font usage matching PostCard patterns. TagBadge component reused. Platform-aware keyboard hint (⌘K on Mac, Ctrl+K elsewhere).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Animation fix&lt;/strong&gt;: Outer wrapper is now a &lt;code&gt;motion.div&lt;/code&gt; — exit animations actually play.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cleanup&lt;/strong&gt;: Body scroll lock, query cleared on close, build artifact gitignored, dead imports removed.&lt;/p&gt;

&lt;p&gt;The polish commit: 383 insertions, 201 deletions across 5 files. The combined feature is 804 lines across 6 files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Opus vs. Gemma+Opus: An Honest Comparison
&lt;/h2&gt;

&lt;p&gt;We now have two complete implementations of the same feature. Opus 4.6's original branch (&lt;code&gt;feature/search-opus46&lt;/code&gt;, 698 lines) is still in the repo. Here's how they compare.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Opus 4.6 (original)&lt;/th&gt;
&lt;th&gt;Gemma 4 + Opus (shipped)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Search engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-side API route with weighted scoring&lt;/td&gt;
&lt;td&gt;Client-side Fuse.js with weighted config&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Index&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None — reads posts at request time&lt;/td&gt;
&lt;td&gt;Pre-generated JSON, fetched once&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Surfaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cmd+K dialog + &lt;code&gt;/search&lt;/code&gt; page&lt;/td&gt;
&lt;td&gt;Cmd+K modal only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;URL state&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (&lt;code&gt;/search?q=cloudflare&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Opus's architecture is more feature-complete. A dedicated &lt;code&gt;/search&lt;/code&gt; page with URL state means search results are linkable and shareable. The server-side API route means the search logic runs where the content lives, with no index to generate or cache.&lt;/p&gt;

&lt;p&gt;Gemma's architecture is simpler and arguably better for this scale. A static JSON index means zero server load, instant results, and the feature works on Vercel's free tier without hitting function invocation limits. At 14 posts and 130 KB, client-side search is the right call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Quality
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;Gemma 4 (raw)&lt;/th&gt;
&lt;th&gt;Gemma 4 + Opus (merged)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full ARIA, keyboard nav&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Full ARIA, keyboard nav&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Animation correctness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Correct&lt;/td&gt;
&lt;td&gt;Broken exits&lt;/td&gt;
&lt;td&gt;Fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AbortController for API calls&lt;/td&gt;
&lt;td&gt;Fuse recreated per keystroke&lt;/td&gt;
&lt;td&gt;Memoized, lazy-loaded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design system&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mostly correct&lt;/td&gt;
&lt;td&gt;Mostly correct&lt;/td&gt;
&lt;td&gt;Fully correct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Known bugs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3 (duplicate logic, type cast, missing Suspense)&lt;/td&gt;
&lt;td&gt;7 (see review above)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Opus's raw output was higher quality. Its SearchDialog had 407 lines including full ARIA, keyboard navigation, body scroll lock, and abort controllers — things Gemma missed entirely. But Opus also had its own bugs: duplicated search logic between the API route and the &lt;code&gt;/search&lt;/code&gt; page, an unsafe type cast, and a missing Suspense boundary. We scored it 87.5/100 in the original review.&lt;/p&gt;

&lt;p&gt;The merged Gemma+Opus implementation is the cleanest of the three. It takes Gemma's simpler architecture, applies Opus's quality standards for accessibility and interaction design, and fixes the issues both models left behind.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Comparison
&lt;/h3&gt;

&lt;p&gt;The honest truth: if I had to ship search today with one model and no review, I'd pick Opus. It produced higher-quality code in a single turn with zero intervention. The 87.5/100 score reflects real, shippable work with minor fixable issues.&lt;/p&gt;

&lt;p&gt;But that's not the interesting takeaway. The interesting takeaway is that &lt;strong&gt;the configuration changes mattered more than the model differences.&lt;/strong&gt; The original Gemma test didn't fail because Gemma is a bad model. It failed because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;num_predict&lt;/code&gt; was too low (invisible thinking tokens consumed the budget)&lt;/li&gt;
&lt;li&gt;Ollama doesn't expose &lt;code&gt;--reasoning-budget&lt;/code&gt; (no way to cap thinking)&lt;/li&gt;
&lt;li&gt;Default parallel slots exhausted VRAM&lt;/li&gt;
&lt;li&gt;GitHub auth wasn't configured in the workspace&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Fix those four things — all infrastructure, not model weights — and Gemma went from "0 commits in 8 prompts" to "shipped a feature in 2 prompts." The model was the same. The environment was different.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Local Models
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Local models can ship production features.&lt;/strong&gt; Not hypothetically. This search feature is live, built entirely by Gemma 4 running on consumer hardware. The code needed polish — but so does most code from any developer, human or AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration is the bottleneck, not capability.&lt;/strong&gt; The difference between "Gemma can't finish anything" and "Gemma ships a feature" was four infrastructure changes. Most teams evaluating local models are testing against default settings that actively sabotage the model's output. Invisible thinking tokens, insufficient context windows, VRAM contention — these are environment bugs, not model bugs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The best workflow might be local + cloud.&lt;/strong&gt; Gemma built the feature (free, fast, private). Claude reviewed and polished it (thorough, quality-focused). Each model did what it's best at. The total cost was one Opus API call for the review pass, not dozens for the entire build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llama.cpp is the right tool for serious local inference.&lt;/strong&gt; Ollama is great for getting started. For production use — where you need reasoning budgets, precise context control, and OpenAI-compatible APIs that tools like Coder can consume — llama-server gives you the knobs you actually need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Settings That Made It Work
&lt;/h2&gt;

&lt;p&gt;For anyone running Gemma 4 locally, here's the configuration that turned it from a planning machine into a shipping machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-4-26B-A4B-it-UD-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\ &lt;/span&gt;      &lt;span class="c"&gt;# 32K context — ~19 GB VRAM on 5090&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 32768 &lt;span class="se"&gt;\ &lt;/span&gt;              &lt;span class="c"&gt;# 32K max output tokens&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-budget&lt;/span&gt; 4096 &lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="c"&gt;# Cap thinking at 4K tokens&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-format&lt;/span&gt; deepseek &lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="c"&gt;# Expose thinking in API response&lt;/span&gt;
  &lt;span class="nt"&gt;--parallel&lt;/span&gt; 1 &lt;span class="se"&gt;\ &lt;/span&gt;          &lt;span class="c"&gt;# Single slot — don't OOM with 4 × 32K&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999                 &lt;span class="c"&gt;# All layers on GPU&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--reasoning-budget 4096&lt;/code&gt; is the single most important flag. Without it, Gemma can spend its entire output budget on reasoning you never see. With it, the model gets 4K tokens to think, then the rest is for actual code. That one flag is the difference between a model that plans forever and a model that ships.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Right now, Gemma 4 serves a single Coder instance on the workstation where it runs. That's fine for one person, but the RTX 5090 is sitting idle most of the day. The obvious next step: &lt;strong&gt;make it available to every machine on the local network.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My wife runs &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; on a Mac Mini in the other room. With Tailscale already meshing our devices together, pointing her OpenClaw instance at &lt;code&gt;http://workstation:8080/v1/&lt;/code&gt; is trivially easy — llama-server's OpenAI-compatible API means any tool that speaks the OpenAI protocol can use it. One GPU, multiple clients, zero cloud costs.&lt;/p&gt;

&lt;p&gt;Beyond that: migrating the remaining Ollama models to llama.cpp (for the same reasoning budget control we needed here), experimenting with longer context windows now that we know the VRAM budget, and — inevitably — the next model showdown when Gemma 4's bigger variants drop.&lt;/p&gt;

&lt;p&gt;The homelab keeps growing. Who knows? Maybe the lobster starts vibe coding for me, too.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3&lt;/strong&gt; attempts before Gemma completed the task (auth fix, OOM fix, success)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2&lt;/strong&gt; prompts in the successful run (vs 8 failed prompts in the original test)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;618&lt;/strong&gt; lines written by Gemma 4 across 6 files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;383&lt;/strong&gt; lines changed in the Opus polish pass (insertions + deletions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;804&lt;/strong&gt; total lines in the merged feature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; cloud API calls for the build phase (Gemma ran 100% local)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;177–181&lt;/strong&gt; tokens per second — Gemma's peak generation speed on the RTX 5090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~159&lt;/strong&gt; tokens per second — effective speed under real agentic load (accumulated context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;28ms&lt;/strong&gt; time to first token — near-instant streaming start&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;16.9 GB&lt;/strong&gt; model size (Gemma 4 26B-A4B, Q4_K_M quantization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~19 GB&lt;/strong&gt; total VRAM at 32K context (comfortable fit on 32 GB card)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4,096&lt;/strong&gt; reasoning budget tokens — the setting that made it all work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0&lt;/strong&gt; inference cost for the feature build&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1&lt;/strong&gt; nudge needed ("keep going, don't stop, code and commit")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7&lt;/strong&gt; bugs found in Gemma's code during review (all fixed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3&lt;/strong&gt; bugs in Opus's original implementation (never merged, never fixed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; bugs in the merged Gemma+Opus version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1&lt;/strong&gt; production feature, live on vibescoder.dev right now — press ⌘K to try it&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>homelab</category>
    </item>
  </channel>
</rss>
