DEV Community: Rob

Forking and Open Sourcing a Single Purpose Site

Rob — Fri, 29 May 2026 21:08:26 +0000

I built a trip planning site for my group going to the F1 Canadian Grand Prix in Montreal. It worked great — itinerary calendar, lodging details, photo gallery, activity suggestions, a shared password so only the group could see it. Classic vibe coded single-purpose app: hardcoded destination, hardcoded dates, hardcoded branding, shipped to Vercel, done.

Then I looked at it and thought: this is useful beyond one trip. What if anyone could fork this repo, deploy it, and have their own trip site without touching code?

That question kicked off a 20-hour arc — across several mobile sessions between F1 races — that transformed a static, single-purpose site into a generic, config-driven template, and exposed every security shortcut I'd taken along the way.

The proof that it worked: I deployed a second instance for a completely different trip — CMA Fest 2026 in Nashville, Tennessee. Same codebase, zero code changes, just the setup wizard.

The Starting Point

The original site had "F1 Grand Prix Montreal" baked into the components. CSS variables were named --gradient-f1 and --shadow-f1. The countdown component had hardcoded race dates. The activities page had Montreal-specific categories. The favicon was F1-themed. localStorage keys were F1-prefixed.

It was a good app. It was also impossible for anyone else to use without rewriting half the codebase.

The Architecture Pivot

The core insight was simple: one database row should drive the entire site.

I created a vacation_config table with a single JSONB column. Every piece of configurable data — trip name, destination, dates, timezone, brand color, hero image, lodging details, password hash, LLM provider, encrypted API key — lives in that one row.

vacation_config
├── tripName
├── destination
├── startDate / endDate
├── brandColor / heroImageUrl
├── lodgings[]
├── passwordHash (bcrypt)
├── llmApiKeyEncrypted (AES-256-GCM)
├── llmProvider
└── setupComplete

Every page calls getConfig() server-side and destructures what it needs. No hardcoded values anywhere. Adding a new configurable field is just adding a key to the TypeScript interface — old configs get new defaults via object spread.

This is the pattern that makes fork-and-deploy work. You clone the repo, you get an empty database, and the site is a blank canvas until someone fills in the config.

The Setup Wizard

An empty database isn't useful. Someone needs to fill in that config row, and that someone might not be technical.

The setup wizard is a 6-step client component that walks through everything:

Step	What it configures
Basics	Trip name, destination, tagline, dates, timezone (auto-detected)
Branding	Brand color (8 presets + custom hex), hero image URL
Lodging	Multiple properties with type-aware display (hotel, Airbnb, VRBO, house, resort)
Password	Shared site password
AI Generation	Optional — pick an LLM provider, paste an API key, auto-generate activity suggestions
Review & Launch	Summary → one-click launch

When you click Launch, four things happen in sequence: config is saved (password bcrypt-hashed, API key AES-encrypted), database tables are created, the user is auto-authenticated, and they're redirected to the live homepage. The entire setup takes about two minutes.

The Middleware Problem

A static site deployed to your own Vercel project doesn't need sophisticated auth. You share the URL with your group, maybe add a simple password check, and you're done.

A clonable template is different. Every fork is a fresh deployment. The middleware needs to handle two states: not yet set up and set up and running.

I built a two-gate system running in Edge Runtime:

Gate 1 — Setup Check. Is there an HMAC-signed setup-done cookie? If not, redirect to /setup. This cookie is signed with the site secret to prevent client forgery.

Gate 2 — Auth Check. Is there a valid auth token cookie? The token includes a timestamp and a random nonce, HMAC-signed with the site secret. If it's missing, expired, or invalid, redirect to /password.

The edge constraint matters. Next.js middleware runs in Edge Runtime, which means no Node.js crypto module. The entire auth chain — HMAC signing, signature verification, timing-safe comparison — uses the Web Crypto API. The Node.js side (lib/auth.ts) handles bcrypt password hashing and AES encryption, which only run in API routes.

From One Secret to Everything

The user provides exactly one secret: a random hex string generated with openssl rand -hex 32. That single value does triple duty:

HMAC signing — auth tokens and setup cookies
AES-256 encryption key — derived via SHA-256 hash for encrypting LLM API keys at rest
Timing-safe comparison — double-HMAC pattern for constant-time signature verification

Everything else is either auto-provisioned (Vercel Postgres sets POSTGRES_URL, Vercel Blob sets BLOB_READ_WRITE_TOKEN) or entered through the wizard. The user never edits code, never touches a config file, never opens a terminal after the initial deploy.

The Security Audit

This is where the story arc connects to lessons I've written about before.

I've been saying audit your vibe code often. I've written about the spring cleaning process and the phased remediation pattern. So when I decided to open-source this project, I ran a full audit before publishing.

The audit found 15+ vulnerabilities across 4 severity tiers. I expected minor stuff. I got critical findings.

The Critical Tier

The worst findings were structural. The middleware had a blanket pass-through for all /api/* routes — meaning API endpoints were completely unauthenticated. The setup config endpoint had no auth, so anyone who found the URL could overwrite or delete the entire site configuration. Auth tokens had no expiration. And there was a hardcoded fallback secret — 'fallback' — that would activate if the environment variable was missing, making every signature predictable.

These aren't exotic bugs. They're the exact patterns that vibe coding produces: things that work during development and deployment but leave doors wide open.

The High Tier

The OG image endpoint accepted arbitrary URLs with no validation — a textbook SSRF vector that could reach private networks. LLM prompts passed unsanitized user input directly to the model — destination names, PDF document text, all of it unescaped. No data validation existed on any write endpoint. And the password endpoint had no rate limiting — unlimited brute-force attempts.

The Medium and Low Tiers

Signature comparison used string equality instead of timing-safe comparison. The setup cookie was unsigned. Error responses leaked internal details. No security headers. No file size limits on uploads. The Gemini API key was sent as a URL query parameter (logged in server access logs). The middleware's static asset detection used pathname.includes('.') — meaning a crafted path like /settings/foo.bar would bypass auth.

The Fix

I structured the remediation the same way I've done it before: phased commits ordered by severity and dependency graph, not one giant PR.

Commit 1 — Critical fixes. Middleware now enforces auth on all API routes except the auth endpoint itself and public config reads. Setup mutation requires authentication after initial setup. Auth tokens expire after 30 days. The hardcoded fallback secret is gone — a missing env var now returns a 500.

Commit 2 — High fixes. SSRF blocked with private IP detection. LLM inputs sanitized with delimiter-based injection mitigation and output validation. Per-entity input validators on all write routes. Rate limiting on the auth endpoint with IP-based lockout.

Commit 3 — Medium and low fixes. Setup cookie is HMAC-signed. PDF uploads enforce a size limit. Security headers added (CSP, HSTS, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy). Gemini key moved from URL to header. Static asset detection uses an explicit extension regex. Client-side error logging sanitized. CSS color injection blocked with a validation function.

Three commits. The same phased pattern. Same principle: merge and test between each phase so you know exactly which change breaks something if it does.

What Changes When You Open Source

Going from "deployed for my group" to "anyone can fork this" changed the threat model fundamentally.

Before: I controlled the deployment. I knew the URL. The password was shared via text message. If something was misconfigured, I'd notice and fix it.

After: Strangers deploy this. They might skip the secret. They might leave the setup endpoint open. They might paste API keys into client-side code. Every defensive measure needs to work without my involvement.

This is why the audit mattered more for open-sourcing than for personal use. A personal deployment with no auth on API routes is sloppy. An open-source template with no auth on API routes is a liability for every person who forks it.

The middleware's two-gate system, the HMAC-signed cookies, the secret-or-500 pattern, the input validation — none of these existed in the original F1 trip site. They exist because the code is no longer mine alone.

Making It Novice-Friendly

The target user is someone who's never used a terminal. That constraint shaped the documentation as much as the code.

The setup guide walks through 8 steps: fork the repo, generate a secret key (with instructions for Mac, Windows, and a web fallback), deploy to Vercel, add Postgres, add Blob storage, redeploy, run the wizard, share with your group. Each step assumes zero technical knowledge.

The README has a one-click Deploy with Vercel button that pre-fills the environment variable prompt. The wizard auto-detects timezone from the browser. Lodging details auto-populate from the property name via AI. The color picker has presets so nobody has to know what a hex code is.

Every friction point I could identify, I tried to eliminate. The person deploying this might be planning a bachelorette party or a family reunion. They're not reading documentation for fun.

The Architecture Lessons

Turning a personal app into a template taught me things that pure greenfield development wouldn't have:

Config-driven beats hardcoded, always. Even if you're building for one use case, storing configuration in a database instead of in component props makes the app fundamentally more flexible. The JSONB column costs nothing and buys everything.

Middleware is the security boundary. In a personal app, auth is a convenience — you know who's accessing it. In a template, middleware is the only thing standing between a stranger's deployment and the open internet. It needs to handle every state: not yet configured, configured but not logged in, logged in, logged in with an expired token.

The setup wizard is the product. For a clonable template, the first-run experience is the product. If someone can't get from fork to functioning site in 10 minutes, they'll abandon it. The wizard isn't a nice-to-have — it's the reason the project works.

Security scales with distribution. A bug in your personal app affects you. A bug in a template affects everyone who forks it. The bar for security isn't "good enough for me" — it's "good enough for the least technical person who deploys this."

By the Numbers

28 commits — from hardcoded F1 site to open-source template
1 JSONB row — drives the entire site configuration
6-step wizard — zero-code setup for non-technical users
15+ security vulnerabilities — found and fixed before open-sourcing
3 phased commits — for the security remediation alone
1 env var — the only thing a user manually configures (VACATION_HUB_SECRET)
~20 hours — total transformation time
0 lines of code — required from the person deploying it

Adding an MCP Server to the Blog Itself

Rob — Thu, 28 May 2026 13:48:08 +0000

Two weeks ago I wired MCP into my fitness tracker — ten tools, one endpoint, four clients. That was always a test run. The fitness tracker is a low-stakes app. If an agent writes a bad workout entry, I delete it. The blog is different. The blog has published content, a deploy pipeline, an editorial calendar, analytics, syndication to Dev.to. If an agent publishes a draft that wasn't ready, the internet sees it.

This week I added an MCP server to vibescoder.dev anyway. Sixteen tools across five categories. The agent that helped me build it — running in a Coder workspace — can now turn around and use it to manage the very site it just modified. That's the kind of loop that makes building in public feel recursive.

The Goal

One sentence: let any agent directly publish to the site, analyze traffic data, and troubleshoot production issues.

The blog is a Next.js 16 app deployed on Vercel. Content lives in a separate private GitHub repo (the-vibe-coder-content), committed via the GitHub API. The admin UI already supports voice recording → Claude-generated MDX → one-click publish. But the admin UI requires a browser. An agent in a Coder workspace, or in Claude Desktop, or in Cursor can't click buttons. MCP gives them the same capabilities programmatically.

Architecture

The fitness tracker MCP server talked to Postgres via Prisma. This blog has no database. Content is MDX files in a GitHub repo. Analytics are Redis counters in Upstash. Deployments happen by curling a Vercel webhook. So the MCP server is a GitHub API client, a Redis reader, and an HTTP caller — not a database wrapper.

Agent (Claude / Cursor / Coder Agents)
  │
  │  Streamable HTTP (Bearer token)
  ▼
vibescoder.dev/api/mcp/mcp
  │
  ├─ Content tools ──→ GitHub API (read/write/commit MDX)
  ├─ Analytics ──────→ Upstash Redis (view counters)
  ├─ Deploy ─────────→ Vercel deploy hook
  ├─ Syndication ────→ Dev.to API
  └─ Diagnostics ────→ fetch() against live site

Same stack as the fitness tracker: mcp-handler for the Next.js adapter, zod for parameter schemas, bearer token auth, disableSse: true for stateless Vercel deployment.

The 16 Tools

The fitness tracker had 10 tools that all talked to one database. This server has 16 tools that talk to four different backends. Grouped by what they touch:

Content Management (7 tools) — the core editorial workflow:

server.tool('list_posts',     /* filter by status/tag/date */)
server.tool('get_post',       /* full MDX + frontmatter    */)
server.tool('create_post',    /* commit new MDX to GitHub  */)
server.tool('update_post',    /* partial frontmatter/body  */)
server.tool('publish_post',   /* draft → live, trigger deploy */)
server.tool('unpublish_post', /* live → draft, trigger deploy */)
server.tool('delete_post',    /* remove from GitHub        */)

Blog Fodder & Editorial (4 tools) — the content pipeline:

server.tool('list_fodder',  /* active + archived, with consumption status */)
server.tool('get_fodder',   /* read raw session notes */)
server.tool('get_todo',     /* editorial calendar     */)
server.tool('update_todo',  /* maintain the calendar  */)

Analytics (1 tool), Deploy & Syndication (2 tools), Diagnostics (2 tools):

server.tool('analytics_summary', /* 30-day views + top pages */)
server.tool('trigger_deploy',    /* hit the Vercel webhook   */)
server.tool('syndicate_post',    /* cross-post to Dev.to     */)
server.tool('site_health',       /* fetch key endpoints      */)
server.tool('get_settings',      /* AI style prompt config   */)

Every tool returns raw data. The agent does its own analysis — same philosophy as the fitness tracker. The list_posts tool returns frontmatter for every post; the agent decides what "recent drafts" means.

What I Reused

The blog engine already had all the backend logic. The admin UI's API routes do the exact same operations — read a post from GitHub, commit an update, hit the deploy hook, cross-post to Dev.to. The MCP server calls the same library functions, not the HTTP routes:

import { commitFile, readFile, deleteFile } from "@/lib/github";
import { listDirectory } from "@/lib/github-list";

The only net-new code was the directory listing helper (github-list.ts). The existing github.ts had file-level CRUD but couldn't list a directory. One function, 30 lines, wraps the GitHub Contents API for directory paths.

The auth pattern, CORS, and rate limiting were copied from the fitness tracker and adapted. Same timingSafeEqual, same withMcpAuth wrapper, same in-memory rate-limit buckets. The muscle memory from the fitness tracker build meant the security layer took minutes, not an hour.

The Middleware Change

One line. The blog's middleware protects all /api/* routes with JWT cookie auth. The MCP server does its own bearer-token auth. So /api/mcp/ gets added to the allow-list alongside /api/auth/, /api/analytics/track, and /api/slack/:

pathname.startsWith("/api/mcp/")

The MCP route then handles auth independently — same pattern as the fitness tracker, where the middleware allow-listed the MCP path and the route enforced its own bearer token.

Decisions

Three questions came up during planning:

Auth granularity — single token or read-only vs. read-write tokens? Single token. I'm the only user. If I ever add collaborators, I'll add scoped tokens. Until then, one token does everything.

Audit logging — the fitness tracker writes to a Postgres audit_log table. This blog has no database. Options were Redis, console.log, or skip. I went with console.log (captured by Vercel function logs) plus [mcp] prefixed commit messages for every GitHub write. That gives me two audit trails — Vercel logs for all operations, Git history for content changes — with zero infrastructure.

[mcp] post: create "adding-mcp-server-to-the-blog"
[mcp] post: publish "adding-mcp-server-to-the-blog"
[mcp] chore: update TODO.md

Image uploads — deferred. MCP tool parameters are JSON. Binary images would need base64 encoding in a tool call. That's doable but not worth the complexity in v1. The admin UI handles images fine. If an agent needs to add images to a post, it can use the admin API directly or I'll add an upload_image tool later.

The Template Update

Same Coder template pattern as the fitness tracker. Token flows from the workstation to workspaces:

/etc/coder.d/coder.env
  → TF_VAR_vibescoder_mcp_token
    → coder_agent.main.env (VIBESCODER_MCP_TOKEN)
      → jq merge into ~/.mcp.json at workspace start

Three terminal commands on the homelab to finish it:

echo 'TF_VAR_vibescoder_mcp_token=<token>' | sudo tee -a /etc/coder.d/coder.env
sudo systemctl restart coder
cd ~/coder-templates && git pull && ./docker/apply.sh

The gh auth login step was an amusing detour — I was SSH'd into the homelab from my iPhone, and gh tried to open a browser on a headless server. The fix was manually entering the one-time code at github.com/login/device in Safari. Mobile homelab administration is an underappreciated genre of suffering.

Verifying in Production

The real test was hitting the live endpoint:

curl -s -X POST https://vibescoder.dev/api/mcp/mcp \
  -H "Authorization: Bearer $VIBESCODER_MCP_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize",
       "params":{"protocolVersion":"2025-03-26",
                 "capabilities":{},
                 "clientInfo":{"name":"test","version":"1.0.0"}}}'

Response: 200 OK, server name vibescoder, version 1.0.0, tools capability enabled.

Then a real tool call — list all drafts:

{
  "count": 1,
  "posts": [{
    "slug": "syndicating-to-substack-the-undocumented-path",
    "title": "Syndicating to Substack: The Undocumented Path",
    "published": false,
    "publishAt": null
  }]
}

One draft in the queue. Real data from the content repo, returned through the MCP server, verified from a Coder workspace. The analytics tool came back with 660 views over 30 days and today's top pages. The site health tool checked five endpoints and reported status codes and response times.

The Recursive Moment

The part that's hard to describe until you experience it: the agent that helped build this MCP server can now use it. In the same chat session where we wrote the route file and debugged the middleware, the agent can call list_posts to see what's published, get_todo to check the editorial calendar, and trigger_deploy to ship changes.

This post was written in a Coder workspace. The MCP server it describes is live on the same site it will be published to. The agent could, in theory, publish this very post by calling publish_post with the slug. It won't — I'll review it first — but the capability is there. That's the loop.

What's Next

Watch how agents use the tools in practice. The fitness tracker MCP server taught me that agents are surprisingly good at synthesizing raw data into summaries. Curious whether editorial tools — create, publish, schedule — feel as natural.
Add an upload_image tool. Deferred from v1, but it's the obvious gap. An agent that can create a post but not attach images is writing with one hand.
Update the vibescoder-blog skill file. The skill currently documents the Git-based editorial workflow. Now that the MCP server exists, the skill should point agents to the tools instead of the grep and awk one-liners.
Write it up as blog fodder. Done. You're reading it.

By the Numbers

16 MCP tools across 5 categories
4 backends wired through one endpoint (GitHub API, Upstash Redis, Vercel deploy hook, Dev.to API)
7 files changed in the engine repo, 2,365 lines inserted
1 file changed in the Coder template repo, 23 lines inserted
3 npm packages added (mcp-handler, @modelcontextprotocol/sdk, zod)
1 middleware line to allow-list /api/mcp/
0 new infrastructure — no database, no Redis, no queues. GitHub API + console.log
3 terminal commands to update the homelab Coder config
1 iPhone-to-homelab SSH detour for gh auth login via Safari
660 views over 30 days — the first number the analytics tool reported back
1 draft in the queue when list_posts was first tested (still sitting there, Substack)
~4 hours from plan to production, including the template update and blog post
1 recursive loop — the agent that built the feature can now use it to publish this post

Qwen Is Not Yet Ready to Power Local OpenClaw Deployments

Rob — Tue, 26 May 2026 19:27:24 +0000

Three weeks ago I ran a model showdown — twelve tasks, five models, one RTX 5090 — and Qwen3.5-35B-A3B won. 85.3 weighted score, 206 tok/s, fits in VRAM with room to spare. I switched it to the default and figured I was done.

I was not done.

This is what two weeks of actually living with Qwen looked like: the config work I had to do before it was usable, the incident that almost killed the experiment, and the ergonomic gap that means frontier models still own my serious work.

Making It Actually Work

The first day I switched Qwen to the default model in OpenClaw, something was wrong. Responses showed raw <think>...</think> tags in the visible output. Tool calls came back as plain text — create_workspace, just sitting there — instead of proper OpenAI-compatible tool_calls objects. The bot was trying to call tools. It just wasn't calling them.

The root cause was a one-line config error. The launch script was using --chat-template chatml — a minimal template that knows nothing about tool calling and doesn't know to hide thinking tokens. Qwen3.5 ships with a 154-line Jinja template that handles both. I just wasn't using it.

The catch: Qwen's native template has a strict ordering check that raises an exception if a system message appears anywhere other than the very beginning of the conversation. Coder Agents sends system messages out of order. So I patched one conditional in the template — non-first system messages render as normal blocks instead of throwing — and switched to --chat-template-file pointing at the patched version.

After the restart: thinking = 1 in the journalctl output. Tool calls worked. The visible output was clean. The fix was one line. It took half a day to find.

That's a recurring pattern with local model work. The model is fine. The scaffolding is fragile.

Day One Gotcha: Cloning From a Stranger

With the template fixed, I asked Qwen to clone the vibe coder repos. It searched GitHub for a literal vibe-coder user, found a random stranger's account, and dutifully cloned 25 repos from them. reset-css, moviebox-main, orange-farm. None of them mine.

Not a Qwen failure, exactly. A context failure. The agent had no skill file telling it that carryologist is the GitHub org. Once I pointed it at the skills directory it read the file, correctly identified the repos, and did the job.

I fixed this by making skill loading unconditional. The user instruction used to say "when I mention the blog, read the vibescoder-blog skill." Changed it to "at the start of every conversation, read all available skills." Generic enough for every user, scoped by which skills the workspace template actually provisions.

I also added a fodder dedup check to the vibescoder-blog skill — Qwen had recommended writing a post from a fodder file that already had a draft, because it never checked sources: fields in existing posts. Small gap, easy to close once you see it.

The pattern: Qwen is good at following instructions. It is not good at inferring what instructions it needs to follow before it has them.

The Thermal Flood

May 9. 4:34 PM.

The OpenClaw cron had been running for a few days. I'd named the job "Hardware Alert Checker (Critical Only)." On May 9 it posted a thermal report to the #homelab-alerts Discord channel at 4:34 PM. Then again at 4:47. Then 5:07. For the next two days, every fifteen minutes — day and night — a full hardware report appeared in my channel. The cron log eventually showed 384 entries. I counted over 60 posts before I said anything.

The job was named "Critical Only." It was not configured for "Critical Only." I had set it up to check thermals and post a report. It did exactly that. The bot did precisely what it was set up to do and nothing like what it was named to do.

On May 11 I finally messaged carrybot directly: "Can we stop regular alerting and only let me know when temps go critical or if I specifically ask?"

The bot replied: "Already done — that hardware monitoring job is set to 'Critical Only' and runs every 15 minutes. It'll only ping you if temps hit dangerous levels."

I sent a screenshot of the flood. The bot checked the cron history, confirmed it was wrong, and disabled the job entirely. No config fix. No threshold update. Just gone. Manual checks only from that point forward.

What it cost: I didn't open OpenClaw again until May 15. Three and a half days. That's a long silence for a tool you're evaluating as a daily driver. Friction compounds. One bad incident isn't fatal, but 60+ notifications across two days is loud enough that I actively avoided the interface rather than dealing with it. The bot won't get better if you stop using it.

MCP Wiring: The Win

May 15 went better. I wired the fitness tracker MCP into OpenClaw — I wrote that up in Wiring MCP Into My Fitness Tracker, but the short version is: two minutes, real data. First query returned my last Peloton ride. 30-minute Power Zone Pop Ride, Ben Alldis, 7.98 miles. The bot pulled it without hesitation.

There was a ghost cron alert that evening — the bot flagged a cron job that didn't appear in my active list. Qwen explained the discrepancy clearly (the job exists in state but isn't scheduled). Good recovery after the thermal flood.

The Session That Revealed the Real Problem

May 16. I sent a voice message asking about my workout stats. No Whisper on the local install, so the bot had no idea what I said. Fine — I typed instead. "What are my stats for my ride today?"

The bot went to Uber. Ride → Uber. It didn't know I meant Peloton.

I clarified: fitness tracker MCP. The bot responded that the MCP server wasn't actively connected. I asked it to check the tool list. Confirmed: fitness-tracker was there. Third prompt, correct answer.

Three extra turns to get what should have been a one-shot query. On a frontier model that would have resolved on the first prompt — it would have understood that "ride stats" meant the fitness tracker I'd been talking about the session before. On Qwen, I start every session from scratch. It has no memory of what MCP servers we were using yesterday. It has no context for what "ride" means to me.

The bot diagnosed this correctly when I asked. It said: I need a TOOLS.md or explicit mentions at session start; I can't infer that fitness = Peloton MCP from prior conversations. It offered to update the TOOLS.md. It did. That's the right response. But it required me to catch the gap and prompt the fix. A more polished agent would have persisted that context automatically.

It would have — except I checked the config later and memory-core is disabled in openclaw.json. There's a memory plugin; it's just off by default. Every session starting cold wasn't an emergent limitation of local models. It was a config flag I hadn't toggled.

The Verdict: Local Agents Can't Match Frontier Practicality... Yet

After two weeks: hobbyist-level technology. Great for enthusiasts. Not ready for prime-time agentic work.

The model is solid. 206 tok/s is genuinely fast. The Jinja template, once fixed, works. When the context is right, the answers are good.

But the ergonomics aren't there yet. Every session starts cold. MCP connections need re-establishing. The bot does what it's configured to do, not what you intend, and there's enough configuration surface area that intent and config drift apart. A frontier-model-backed agent handles these gaps with implicit context and better defaults. Qwen handles them if you set things up correctly and remind it what's relevant at the start of every conversation.

That's a meaningful gap. Two weeks in, Qwen never became my default interface. I reach for it when I want to run something local, or when I'm testing the setup. I reach for a frontier model when I want the thing to just work.

That's an honest result. Qwen is the right default for a privacy-first local-first homelab setup. For production agentic work, the frontier models are still ahead on ergonomics — and ergonomics compound across every session.

What's Next: Upgrading to Qwen 3.6

While I was writing this, Qwen released 3.6 (April 24, 2026). Two variants relevant to my setup:

Qwen3.6-35B-A3B (MoE) — same VRAM footprint as the current model. Modest coding improvement over 3.5, adds a preserve_thinking kwarg to the chat template. Drop-in upgrade.

Qwen3.6-27B (dense) — outperforms the 35B MoE on coding benchmarks. SWE-bench 77.2 vs 73.4. The tradeoff is throughput — dense models are slower per token, and the 3.5 MoE's 206 tok/s speed is one of its best features for agentic work where you're waiting on tool call chains.

A few things to know before upgrading:

llama.cpp b9180+ required for MTP speculative decoding support
--jinja flag needed for the enable_thinking/preserve_thinking kwargs
Do not use -sm tensor — there's an open segfault bug (#23297)
MTP flags: --spec-type draft-mtp --spec-draft-n-max 3

I'm going to try the 35B-A3B MoE first. Same slot, same startup flags (minus the segfault one), meaningful upgrade on coding. The dense 27B is tempting on benchmarks but I'll wait to see how throughput holds up under real agentic load before committing.

The bigger question I'm watching isn't the benchmark numbers — it's whether the next generation of local models closes the context and tool call chaining gap. Once a local model can reliably remember what MCP servers you were using yesterday, infer intent across sessions, and chain tool calls without hand-holding, the ergonomics argument for frontier models gets a lot weaker. We're not there yet. I'll be paying attention.

By the Numbers

652 session files, May 8–16 — the vast majority are cron-fired Discord sessions, not direct interactions
~10 human-initiated sessions across the two weeks; the rest are the alert checker running every 15 minutes
7 context resets — sessions where the conversation was cleared and started fresh
Thermal flood: cron job d8da7ec1 created May 9 4:31 PM PT, 384 logged runs, disabled May 11 9:10 PM PT — ~52 hours of every-15-minute posts
Token/cost data: all null — llama.cpp doesn't return usage in the API response
Tool calls: 0 structured tool_use objects in session logs — llama.cpp doesn't emit them. The 40 hits on fitness tracker keywords are conversation text mentions, not actual invocations.
Memory core: disabled in openclaw.json — explains why every session starts cold

Wiring MCP Into My Fitness Tracker — and Asking OpenClaw About My Last Workout

Rob — Thu, 21 May 2026 16:05:46 +0000

I open my fitness tracker every day. It pulls workouts from Peloton and Tonal, tracks annual goals, makes pretty charts. Until this week, the way I interacted with it was: open browser, click button, look at chart. Like a 2018 web app.

This week I made it an MCP server. Now I ask Discord "what was my last workout?" and carrybot — my homelab OpenClaw bot, running on my Linux homelab PC, talking to a local Qwen3.5-35B on llama.cpp — answers with real data from the same Postgres my browser hits. Same endpoint also works from Claude Desktop, Codex, Cursor, and any Coder workspace agent that knows how to call it.

This is the writeup of the afternoon that took me there. The MCP server itself was easy. The interesting parts were the constraints I bumped into and the workarounds that turned out to be cleaner than the "right" answer.

The Goal

One sentence: let any AI agent talk to my fitness data.

The vibe coded fitness tracker is a single-user Next.js 14 app on Vercel. Gated to one Google account. REST endpoints behind a NextAuth session cookie. Peloton and Tonal sync triggered by clicking buttons in the dashboard. That works for the browser. It doesn't work for an agent that wants to ask "summarize my training over the last quarter" or "trigger a Peloton sync — did anything new come in?"

I want the agent to have raw access. No precomputed summaries. Give it the rows and let it figure out the trends. Part of the point is to learn how agents get better at this kind of analysis over time, and that doesn't happen if I do the math for them.

Why MCP, Not OpenAPI

I almost shipped this as an OpenAPI spec plus bearer-token auth. Cleaner, simpler, every agent framework supports it.

Then I listed the clients I actually want to use:

Client	OpenAPI	MCP
Claude Desktop	Custom integration	Native
Codex CLI	Custom integration	Native
Coder Agents	Via AI Bridge	Via AI Bridge
OpenClaw	Via plugin	Native
Cursor, Windsurf, Zed	Custom	Native

Every client speaks MCP first-class. Ship MCP, write the tools once, every agent picks them up by pointing at a URL. Ship OpenAPI and every client needs bespoke wiring. The decision was over before I finished the table.

The Server

Three files, ~400 lines total.

src/app/api/mcp/[transport]/route.ts — the MCP route, built on mcp-handler (the package formerly known as @vercel/mcp-adapter before it got renamed and republished). Ten tools:

server.tool('list_workouts',  /* schema */, async ({...}) => {...})
server.tool('get_workout',    /* schema */, async ({id})   => {...})
server.tool('create_workout', /* schema */, async ({...}) => {...})
server.tool('update_workout', /* schema */, async ({...}) => {...})
server.tool('delete_workout', /* schema */, async ({id})   => {...})
server.tool('list_goals',     /* schema */, async ()       => {...})
server.tool('peloton_status', /* schema */, async ()       => {...})
server.tool('sync_peloton',   /* schema */, async ({limit})=> {...})
server.tool('tonal_status',   /* schema */, async ()       => {...})
server.tool('sync_tonal',     /* schema */, async ({limit})=> {...})

The CRUD tools wrap Prisma directly. The sync tools fetch() the existing REST endpoints (/api/peloton/sync, /api/tonal/sync) so I'm not duplicating the dedup orchestration — those endpoints already handle "did we already sync this workout? does this row need backfilling? did the Peloton token expire?" Wrapping them is one HTTP hop. Worth it to keep one source of truth for sync logic.

src/lib/api-auth.ts — bearer token helpers. The token is a single env var, MCP_API_TOKEN, 64 random hex chars. Compared in constant time so I don't leak timing side channels:

function timingSafeEqual(a: string, b: string): boolean {
  if (a.length !== b.length) return false
  let mismatch = 0
  for (let i = 0; i < a.length; i++) {
    mismatch |= a.charCodeAt(i) ^ b.charCodeAt(i)
  }
  return mismatch === 0
}

middleware.ts — extended so the bearer token unlocks every /api/* route, not just /api/mcp. Same token, two callers: the MCP server calls Prisma directly for read tools, and self-fetches the existing REST routes for the sync tools. Both paths need to pass auth. The token does double duty.

The transport choice was the one decision worth thinking about. mcp-handler supports SSE and streamable HTTP. SSE needs Redis for message brokering. Streamable HTTP is stateless. I'm on Vercel Hobby with no Redis. disableSse: true and ship.

{ basePath: '/api/mcp', verboseLogs: false, maxDuration: 300, disableSse: true }

pnpm i mcp-handler @modelcontextprotocol/sdk@1.26.0 zod — and yes, you have to pin the SDK to 1.26.0 because mcp-handler@1.1.0 peer-depends on exactly that version, not a semver range. Half an hour of npm install errors before I noticed.

The Test That Said It Worked

curl -sS -X POST https://<actualapp>.vercel.app/api/mcp/mcp \
  -H "Authorization: Bearer $MCP_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

Response: 200 OK, event: message, full tool catalog with JSON Schemas. The server worked.

The hard part wasn't the server. It was getting the four clients I cared about to use it.

Client #1: Claude Desktop, Codex, Cursor — The Easy Path

These all read a JSON config file with the same shape:

{
  "mcpServers": {
    "fitness-tracker": {
      "type": "http",
      "url": "https://robs-fitness-tracker.vercel.app/api/mcp/mcp",
      "headers": {
        "Authorization": "Bearer <MCP_API_TOKEN>"
      }
    }
  }
}

Drop in the URL, drop in the token, restart the client. Done.

Client #2: Coder Workspace Agents — The Path I Got Wrong

I run Coder on my workstation. Every workspace gets a ~/.mcp.json baked in by the Terraform template (Context7, Vercel, Cloudflare, Playwright — see the homelab post). My mental model: add a fifth entry for fitness-tracker, the agent picks it up.

So I patched the template. Token flows from ~/.config/fitness-tracker/env on the workstation → TF_VAR_fitness_tracker_mcp_token in /etc/coder.d/coder.env → Terraform variable → coder_agent.main.env → workspace process → jq-merge into ~/.mcp.json at startup with chmod 600. One PR, one apply.sh, every workspace gets it.

Verified the file showed up in a fresh workspace with all five MCP servers in the keys. Confidently asked the agent: "list my fitness-tracker tools."

"I don't have any fitness-tracker tools available. My available tools are for software-engineering tasks inside a Coder workspace..."

The agent had no idea. Started a fresh chat — same answer. Inspected the agent runtime and found this in Coder's source at v2.33.2:

// enterprise/aibridgedserver/aibridgedserver.go
for _, link := range links {
  if link.ProviderID != eac.ID { continue }
  valid, _, validateErr := eac.ValidateToken(ctx, link.OAuthToken())
  // ...
  tokens[id] = link.OAuthAccessToken
}

Coder's AI Bridge only auto-registers OAuth-backed MCP servers. Specifically, MCP servers wired through CODER_EXTERNAL_AUTH_*_MCP_URL against an OAuth external auth provider. Static-token MCP servers are invisible to the chat agent. The ~/.mcp.json file is for other MCP clients running in the workspace (Claude Desktop, Codex, code-server's Continue extension), not for Coder's chat itself.

I'd shipped a coder-templates PR that does the right thing for every MCP client except the one I was trying to enable. The PR is still useful — it makes the fitness tracker available to any MCP client a workspace user wires up. But Coder Agents specifically were locked out.

Two real options:

Wrap the fitness tracker in OAuth. NextAuth supports being an OAuth provider. Register it in Coder as an external auth. Coder mints tokens, AI Bridge injects them. Significant work for a single-user app.
Teach the agent the recipe. Write a skill file that documents the endpoint, the auth, the wire shape, and the ten tools. Agent reads the skill at chat start and calls the MCP server with curl.

Option 2 was 200 lines of Markdown. I picked option 2.

---
name: fitness-tracker
description: "Access the personal fitness-tracker MCP server via raw HTTP..."
---

## Call recipe

ft_call() {
  local tool="$1" args="${2:-{\}}"
  local payload=$(jq -cn --arg t "$tool" --argjson a "$args" \
    '{jsonrpc:"2.0", id:1, method:"tools/call",
      params:{name:$t, arguments:$a}}')
  curl -sS -X POST https://robs-fitness-tracker.vercel.app/api/mcp/mcp \
    -H "Authorization: Bearer $FITNESS_TRACKER_MCP_TOKEN" \
    ... | sed -n 's/^data: //p' | jq -r '.result.content[0].text // .error'
}

ft_call list_workouts '{"year":2025,"limit":10}'

That ships in the same template patch as the ~/.mcp.json entry. Per-user, gated to my account. Next fresh chat: agent reads the skill, sees the recipe, calls the server. Works on the first try.

Client #4: OpenClaw — The Client I Built This For

The whole reason I started this afternoon. carrybot runs on the homelab and accessed from any Discord client. Local Qwen3.5-35B for the brain, Discord for the front end, OpenClaw as the agent runtime.

I DM'd carrybot and asked how to register an MCP server. He came back with:

"I need a bit more info about your fitness app MCP endpoint:
Transport type: Is it HTTP(S), SSE, streamable-http, or stdio?
URL: What's the endpoint URL?
Auth: You mentioned a token — is it in headers or URL?
Tool names: What specific tools does it expose?"

Pasted the answers. He ran:

openclaw mcp set fitness-app '{
  "type": "streamable-http",
  "url": "https://robs-fitness-tracker.vercel.app/api/mcp/mcp",
  "headers": { "Authorization": "Bearer ${FITNESS_TRACKER_MCP_TOKEN}" }
}'

Asked: "What was my last workout?"

He replied with a 30-minute Power Zone Pop Ride with Ben Alldis on May 12 at 4:12 PM PDT. 7.98 miles. Real Peloton data, pulled through the MCP server I'd built three hours earlier, into a Discord bot running in the other room.

The Cron Ghost

One amusing wrinkle. Carrybot's first response included an inline alert:

⚠️ Cron: fitness-check failed

I asked him what that was. He came back with full transparency:

"You're right to ask! That 'fitness-check' cron never actually got created — I made a typo when trying to add it. ... The API returned: invalid cron.add params: at root: unexpected property 'sessionTarget":'. That's because I accidentally added an extra quote: "sessionTarget": ": "main" should have been "sessionTarget": "main". The JSON was malformed."

He'd tried to use OpenClaw's cron system as a deferred-execution primitive — schedule a one-shot task for one minute from now, then have the cron run the MCP query. Malformed JSON, cron creation failed, OpenClaw faithfully alerted me about the failure, and he answered my question via a sub-agent path instead. The data I saw was real. The "cron failed" was background noise from a bug he caught and fixed in his next message.

I gave him the UX feedback — one-shot crons as deferred execution mean every malformed payload looks like a production alert — and he immediately revised the pattern:

"Sub-agent direct call → Use for one-off work. Cleaner, no cron overhead.
Cron for real scheduling → Only use at crons when you actually need deferred execution.
Don't spam alerts → Malformed JSON that prevents a cron from being created shouldn't generate a scary 'Cron failed' alert."

That's a long-running agent learning its own UX patterns. Worth its own post someday.

Token Storage

One token, four locations, all mode 600 or equivalent. Same value everywhere:

1. Vercel project env var      MCP_API_TOKEN
2. Workstation                 ~/.config/fitness-tracker/env  (chmod 600)
3. Coder server                /etc/coder.d/coder.env         (root-readable systemd EnvironmentFile)
4. Coder workspaces            ~/.mcp.json                    (chmod 600, regenerated per workspace start)
5. OpenClaw                    ~/.openclaw/openclaw.json      (chmod 600)

Rotation: openssl rand -hex 32, update all five locations, redeploy Vercel. Roughly 90 seconds, no code changes.

The token lives in env vars, never in shell rc files. The shell-rc anti-pattern is real — anything exported into ~/.bashrc leaks into every subshell's process listing, gets sourced by background jobs that shouldn't see it, and survives in .bash_history for as long as that file lives. A chmod 600 env file you source explicitly when you need it stays in exactly the processes that need it.

What I'd Do Differently

Verify the agent runtime's MCP integration before patching templates. I patched coder-templates to add a workspace-level ~/.mcp.json entry before I'd checked whether Coder's chat agent actually reads that file. It doesn't. The patch is still useful for other MCP clients running in the workspace, but I wouldn't have prioritized it first if I'd known.

Skip the OpenAPI consideration earlier. I spent real cycles writing the "MCP vs OpenAPI" comparison in my head. The clients I cared about all speak MCP natively. The decision was over before I started thinking about it; I just didn't realize it for ten minutes.

Start with the skill file as a first-class option, not a workaround. When I hit the Coder AI Bridge limitation, my first instinct was "build OAuth, ship the proper integration." The skill file approach is genuinely simpler, lives next to existing skills, and will be obsolete the day AI Bridge gains static-token support — which seems like a planned-but-not-yet-shipped feature based on the deprecation comments in Coder's source. Skill files are the right level of investment when the underlying platform is in flux.

What's Next

Test the skill in a fresh Coder chat. The PR merged but I haven't validated it end-to-end yet. The skill is concrete enough that the agent should call ft_call list_workouts on the first try. If it fumbles, the skill needs tightening.
Watch the raw-rows decision over time. All ten tools return raw database rows. Zero precomputed aggregates. The whole point is to see whether agents naturally synthesize good summaries or degrade as the dataset grows. If they degrade, add a summarize_year tool. Until then, keep the surface area small.
Token rotation drill. I haven't had to rotate MCP_API_TOKEN yet. Worth doing once intentionally to find any place we forgot to document.
Wait for AI Bridge to support static-token MCP servers. When it does, the skill file becomes redundant and the ~/.mcp.json entry becomes the canonical path. Until then, the skill is the working path.

The fitness tracker is now genuinely agent-accessible. Same vibe coded app that started as a Next.js weekend project, now serving four different agent runtimes through a single MCP endpoint. The audit a few weeks ago found the bugs. This week added the API surface. Next steps are about watching agents use it.

The lobster's a real assistant now.

By the Numbers

3 hours total session time
2 GitHub PRs opened and merged (fitness-tracker, coder-templates)
1 follow-up PR for the skill file workaround
10 MCP tools exposed, all returning raw rows
0 precomputed aggregates — agents do their own analysis
4 client integrations working from one endpoint (Claude Desktop, Codex / Cursor / etc., Coder Agents via skill, OpenClaw)
1 dead-end — Coder AI Bridge's OAuth-only MCP injection requirement
200 lines of Markdown in the skill that workaround it
64 hex chars in the personal access token
5 locations that hold the token, all mode 600 or equivalent
1 ghost cron that alerted me to a bug in carrybot's own code
1 long-running agent that revised its own UX patterns based on feedback
30 minutes — the duration of the last workout the bot reported
7.98 miles — distance on that Power Zone Pop Ride with Ben Alldis

Showdown Thoughts: The Three-Pass Pattern

Rob — Tue, 19 May 2026 13:49:16 +0000

Model Showdown Round 5
ended with a leaderboard. Sonnet 4.6 won on the rubric. Opus 4.7 placed
second. Qwen 3.5 contributed almost nothing structural. That's the
measurement story.

This is the methodology story — what happened after the scores were
revealed.

The Problem With Picking a Winner

The naive workflow after a bakeoff is: pick the best run, merge it to
main, ship it. Winner takes all.

That's wrong, and Round 5 made it obvious why.

The winning run (Sonnet 4.6) had the best overall rubric score. It also
had a weaker path validator than Opus 4.7, and its orphan-matching logic
would have missed real-world cases that Opus 4.6 caught. The second-place
run (Opus 4.7) had the best validator and the cleanest route structure, but
the worst data source choice — reading from the build-time filesystem
instead of the live GitHub Contents API.

No individual run was what I'd ship. Each one had at least one bad call.
The bakeoff's real output wasn't a winner. It was a map.

When 4 of 4 models made the same design choice, that choice was obviously
right. When they diverged — on validation strictness, on data source, on
UX for destructive actions — that divergence was the signal. Those were the
actual design decisions, the ones worth spending judgment on.

The Three Passes

What emerged from Round 5 is a pattern I've now run twice and would reach
for again on any feature where the design space is unclear:

Pass 1 — Bakeoff. Run N models (I used 4) on the same prompt in
isolated sessions. Judge blind, before you know which branch is which.
Score against a rubric. The output of this pass isn't any of the N
implementations — it's the decision map. You now know which choices are
contested and which are obvious.

Pass 2 — Merge. Write down a merge plan before touching any code: for
each contested layer, which run's approach wins and why. Then ask an agent
to compose the merged best-of from those inputs. The merge is strictly
better than any individual bakeoff run because it draws on information none
of the bakeoff contestants had — the scored comparison of all four.

For Round 5 the plan looked like this:

Layer	Source	Why
Path validator	Opus 4.7 (Run 1)	Only run with 2-segment enforcement + `..` block + non-empty checks
Three-tier orphan match	Opus 4.6 (Run 2)	Only run that noticed exact-match missed real cases like `day-four`
Type-narrowed body parsing	Sonnet 4.6 (Run 3)	`typeof body === "object" && "path" in body`, no `as` casts
GitHub Contents API	Opus 4.6 / Sonnet 4.6	Live state vs. build-time filesystem snapshot
Confirm-modal UX	Sonnet 4.6	Best visual polish in the screenshots

Qwen 3.5 contributed nothing structural to this table. The bakeoff said
"skip this one" clearly enough that there was nothing to debate. That's
useful information too — knowing which pieces to skip is part of the map.

The merge was 13 files changed, +990/-9. One TypeScript error caught and
fixed. Build passed first try after that. Opened as a PR with the heritage
table in the description so future reviewers can trace any decision back to
its source run.

Pass 3 — Polish. The merged feature went live. I opened it against
real production data and spotted four things immediately: truncated
directory names with no tooltip, delete buttons invisible on touch devices,
no bulk delete UI despite the API supporting paths: [], and an orphaned
section header that would show with count 0 after the lone orphan was
deleted.

None of those were predictable before live use. You can't predict friction
from a code review — you observe it. The polish pass had to come after the
merge because the artifact it was polishing didn't exist until then.

The polish was 6 files changed, +265/-54 and about 20 minutes of agent
time.

When to Use It

The pattern has a real cost: the bakeoff is N full agent sessions, each
producing a complete implementation that you won't ship. For Round 5 that
was ~$35 in inference and a few hours of judging.

That's cheap insurance when the feature has any of these properties:

Destructive verbs. Delete, update, payment, permission change. The cost of getting validation wrong outweighs the cost of the bakeoff.
Multiple defensible architectures. Where should validation live? What's the data source? How does auth thread through? When you genuinely don't know the right answer, a bakeoff shows you the option space.
Hard to change later. Database schemas. Public API contracts. Anything that will accumulate callers.

It's overkill for a 20-line UI tweak or a feature with a single obvious
implementation. The signal value of the bakeoff scales with how uncertain
you are about the design.

What I'd Do Differently

Three things I'd change for the next run:

Name the contestant chats before pasting the prompt. All four Round 5
chats showed up as "New Chat" in the Coder API cost summary, which meant
20 minutes of token-volume detective work to figure out which cost belonged
to which run. Five seconds of effort would have prevented that.

Capture per-phase stats. I have clean bakeoff numbers. I don't have
separate merge or polish numbers — they're folded into the judging thread.
A lightweight wrapper script around each phase would make the next
iteration measurable end-to-end.

Write the polish friction items down before fixing them. I noticed four
issues and fixed them in one pass, which collapsed the "observed" list and
the "fixed" list into the same moment. Separating them — even by five
minutes — would have made the "what does live-review surface" lesson
sharper for the writeup. And occasionally you'll notice something that
isn't worth fixing.

By the Numbers

3 phases: Bakeoff (4 parallel attempts), Merge (1 informed pass), Polish (1 live-review pass)
4 implementations produced in the bakeoff, 0 shipped to main as-is
3 of 4 bakeoff runs contributed at least one structural piece to the merge
13 files changed in the merge pass (+990/-9)
6 files changed in the polish pass (+265/-54)
4 friction items caught in polish that couldn't have been predicted before live use
~$35.56 inference cost for the bakeoff phase
~45 min bakeoff (parallel), ~30 min merge, ~20 min polish

Model Showdown Round 5: Four Agents Build the Same Feature

Rob — Mon, 18 May 2026 16:05:46 +0000

I've been running model showdowns on Vibes Coder for a while now. Each round has been a little messier than I wanted — different prompts, accidental context leaks, no clean way to compare cost to quality. This one is the first I'd call a fair bakeoff. Two goals going in:

Make the experiment itself rigorous enough that future rounds can build on it — isolated chat sessions, identical prompts, anonymized branches, blind judging, real token + runtime data pulled from the Coder API.
Compare three flavors of Claude against our local champ. Opus 4.7, Opus 4.6, and Sonnet 4.6 from Anthropic; Qwen 3.5 35B-A3B running on llama.cpp on the RTX 5090 in the home lab. Four models, same task, four isolated Coder Agents sessions, blind judging.

The headline: Sonnet 4.6 beat Opus 4.6 on a coding task. Not by much (4.48 vs 4.36) but cleanly, on its own merits, with no asterisks. And once I pulled real token and runtime data from Coder's chat-cost API, a second headline emerged: weighted by cost, Sonnet's win becomes decisive — about 10x cheaper per rubric point than either Opus model. A third wrinkle: Opus 4.7 finished the task in 9.2 minutes, the fastest of the three Claude runs. It won the rubric without burning the most time. The deeper story is what each model did with the same prompt, and what it took to make the bakeoff fair in the first place — which turned out to be more work than the bakeoff itself.

The Setup

The contestants:

Run	Model	Where it runs
1	Claude Opus 4.7	Cloud, via Coder Agents
2	Claude Sonnet 4.6	Cloud, via Coder Agents
3	Claude Opus 4.6	Cloud, via Coder Agents
4	Qwen 3.5 35B-A3B	Local, llama.cpp on the RTX 5090, via Coder Agents

The mapping was private. Branches were named run-1 through run-4. I judged the four branches blind against a fixed rubric, then revealed the identities.

The task: build image management into the vibescoder.dev admin dashboard. The current /admin page has a Settings card that's a placeholder. The spec asked for an Images card (or a replacement) that lists the post-image directories under public/images/, detects orphans (directories with no matching post), provides a screenshot view, and adds an API route to delete a directory.

It's not a huge feature, but it has enough surface area to differentiate models: filesystem traversal, slug matching, path validation, an API contract with a destructive verb, a UI page, and at least one judgment call (what counts as an "orphan?").

The fairness story

Before launching anything, three things needed fixing. None of them are interesting on their own. Together they're the operational lesson of this post: a bakeoff isn't fair by default.

Fix 1: Node 18 vs Node 20

The workspace image is built on Ubuntu 24.04. Ubuntu 24.04's apt Node is 18.19. Next.js 16 — what the blog engine ships on — requires Node 20+. Any agent that ran apt install nodejs would silently break its own build.

The fix was a Dockerfile change in the coder-templates repo: install Node 20 from NodeSource at image build time, pin npm, verify node -v reports 20.x in the smoke test. After that, node -v in a fresh workspace prints v20.20.2 and nothing the agents do (short of nvm shenanigans) changes that.

Fix 2: The system instructions were lying

The chat system prompt — injected at the top of every Coder Agents session — said Node was not pre-installed and told agents to install it themselves. Correct on the previous image; actively misleading after Fix 1. An agent following the instructions would apt install nodejs, get Node 18, downgrade the runtime, and break the build.

I rewrote the instructions to say Node 20 is pre-installed, do not reinstall, use nvm if you need a different version. Boring change. Huge impact on whether the bakeoff produces meaningful signal.

Fix 3: Prompt poisoning

The first draft of the bakeoff prompt told each agent to create a branch named after the model running the session — bakeoff-opus47, bakeoff-sonnet46, and so on. A sharp catch from the human side: that wording leaks competition signaling into the prompt. An agent that sees "you are opus47" or even "this is a bakeoff" can adjust behavior in ways that aren't comparable. The experiment stops measuring "what does this model do with the prompt" and starts measuring "what does this model do when it knows it's on stage."

Fix: replace model names with neutral ordinals. Branches became run-1 through run-4. The prompt made no reference to other runs, scoring, or any comparison. Each agent thought it was building a feature, not auditioning.

Three small fixes. Together they're the operational lesson: fairness in a model bakeoff requires more setup than the bakeoff itself.

The prompt

The prompt was identical for all four runs, save for the run number in the branch name. Verbatim, with one path generalized:

You are working in the vibescoder.dev blog engine repo. Branch: run-N.
Baseline commit is at the tip of main.

Goal: add image management to /admin.

Requirements:
- List the directories under public/images/ (each directory corresponds
  to one post and contains its images).
- For each directory, report: name, file count, total size on disk,
  and whether it matches a published or draft post (by slug).
- Surface "orphaned" directories — directories that do not match any
  post — so I can clean them up.
- Provide a way to view the images in a directory (thumbnails or list).
- Provide an API route DELETE /api/admin/images that removes a
  directory by path. The route must validate input.
- Update the /admin landing page so the new feature is reachable.
  You may keep the Settings placeholder card or replace it; either is fine.
- Add a screenshot of the new page to the PR description (use the
  Playwright MCP).
- Run `npm run build` before committing. Do not push commits that
  fail the build.
- Commit in logical chunks. Push the branch when done.

That's it. No mention of competing runs. No scoring rubric. No model identification. Just a feature spec and a quality bar.

The four implementations

All four runs built it. All four passed npm run build against a shared engine baseline on Node 20.20.2. All four pushed their branches. Then the differences started showing up.

Run 1 — 8 new files, 631+/9-

Replaced the Settings placeholder with an Images card on /admin. Added a dedicated /admin/images page that lists directories server-side, plus a client-side modal that renders a grid of thumbnails when you click into a directory. Three screenshots in the PR description — admin landing, images list, modal open with orphan-flagged styling.

The standout was the API route. Run 1 wrote a real path validator — isValidImageRepoPath — that required exactly two path segments under public/images/, rejected .., and ran before the filesystem call. The route returned distinct status codes for distinct failure modes: 400 for bad input, 404 for missing, 403 for paths that resolve outside the allowed root, 200 for success.

It's not glamorous code. It's just the version where someone thought about the failure modes before writing the success path.

Run 1's /admin/images page. Directory cards, orphan-flagged styling, and a tight path-validated delete API behind the trash icons.

Run 2 — 6 new files, 687+/7-

Kept the Settings card. Added an Images card next to it on /admin. The /admin/images page was the cleanest of the four — tight TypeScript, no as casts in the API route, proper type narrowing (typeof body === "object" && "path" in body) instead of forcing the compiler to trust it. The UI had the most visual polish: directory cards with file counts as a badge, hover states that matched the rest of the admin surface, a confirmation modal on delete that quoted the directory name back at you.

Path validation was decent but not as rigorous as Run 1 — startsWith("public/images/") plus a .. block, no segment-count check. Enough to stop the obvious cases. Not airtight against creative inputs.

Two screenshots. Shipped a polished v1 and stopped.

Run 2 kept the Settings card and put Images next to it. Cleanest TypeScript of the four; smallest screenshot artifact.

Run 3 — 6 new files, 595+/0-

Replaced the Settings placeholder. The /admin/images page started as a server component, then mid-task switched to a client-fetched implementation when Run 3 hit a dev-server timeout on the first integration test. That mid-stream pivot showed up cleanly in the commit history — feat: add admin/images server-rendered, then two commits later, refactor: move admin/images to client fetch (dev server hangs on FS scan).

Path validation matched Run 2's. The thing that made Run 3 interesting was the orphan-detection arc.

The spec said "match directory name against post slugs to find orphans." Three of the four models took that literally — list directories, list slugs, set-difference, report what's left. Run 3 did that first, reported 8 orphaned directories, then checked the result against reality. Looked at the actual file tree and noticed that one of the "orphaned" directories was day-four/, and there's a published post with the slug day-four-rss-analytics-syndication-and-loom. The directory isn't orphaned. It belongs to that post. The matching logic was wrong.

Run 3 iterated three times: exact match → prefix match (does any slug start with this directory name?) → content-reference match (does any post body reference an image in this directory?). After the third pass, the orphan count went from 8 to 1 — and the one remaining was an actual orphan I'd been meaning to delete for weeks.

Small thing in the diff. Big thing in engineering judgment. The other three models reported false-positive orphans with high confidence. Run 3 noticed its own answer was wrong and kept working.

Run 3's screenshot — the largest and most polished of the four. The orphan count in the header reads 1 instead of 8 because the matching logic had been corrected mid-task.

Run 4 — 7 new files, 607+/0-

Kept the Settings card, added an Images card. The /admin/images page worked. Build passed. The directory listing rendered correctly.

Two structural issues. First, the codebase ended up with two utility libraries — images.ts and imageUtils.ts — with overlapping responsibilities. The first pass put filesystem helpers in images.ts, which got imported into a client component, which pulled fs into the client bundle and broke the build. The fix added imageUtils.ts for client-safe helpers and re-imported. The dead code in images.ts was never cleaned up.

Second, the screenshot. Run 4 ran playwright screenshot, hit the same missing-system-libraries failure the other three runs hit (libnspr4, libpango-1.0-0, the headless Chromium kit), sudo apt install-ed the dependencies — and then never retried the screenshot. Instead the PR description got a 184-line markdown description of what the page would look like, in lieu of a PNG. The deps were installed. The retry never fired.

Path validation was the weakest of the four — startsWith on the user-supplied path, no normalization, no .. block. The class of weakness is that a path that looks like it's under public/images/ can still resolve elsewhere when the OS interprets it. I'm not going to spell out the exact bypass; the point is that a one-line startsWith check is not a path validator, and Run 4 shipped one.

Run 4's "screenshot" is a 184-line markdown file. The opening:

Page Description: /admin/images

Overall Layout

The /admin/images page displays a dashboard-style view of all image directories with a neon brutalist design consistent with the existing admin theme.

Header Section

At the top:

Title: // Images in monospace font with primary color (cyan/teal)

Stats bar showing:

Total directories count

Total files count

Total size in human-readable format (MB/GB)

Orphaned count (in warning yellow/orange color, only shown if > 0)

…and 165 more lines of design notes.

Blind scoring

Rubric, weights, and scores:

Dimension	Weight	Run 1	Run 2	Run 3	Run 4
Correctness	25%	5.0	5.0	5.0	4.0
Design	15%	4.5	5.0	4.0	3.0
Code quality	20%	5.0	5.0	4.5	2.5
Engineering judgment	15%	4.5	4.0	5.0	2.5
Scope discipline	10%	4.5	4.5	4.0	3.5
Commit hygiene	10%	4.5	4.0	4.5	3.5
Surprise	5%	4.0	3.5	5.0	2.5
Weighted total		4.68	4.48	4.36	3.18

Scoring notes I wrote during the blind pass, before the reveal:

Run 1 — "Most defensive of the four. The path validator is the kind of code I'd want to ship to production. Loses half a design point for being slightly less visually polished than Run 2."
Run 2 — "Tightest TypeScript I've seen this week. Visual polish is the best of the four. Path validation is fine but not paranoid. Stopped at v1 — didn't iterate, didn't second-guess. Probably Sonnet."
Run 3 — "Mid-task architecture pivot, three iterations on orphan detection, the only run that produced an honest orphan count. Took the longest. Most thoughtful. Probably Opus 4.6."
Run 4 — "Two overlapping libraries, dead code left behind, weak path validation, fell back to a markdown description instead of a real screenshot. The dependency install was right there. The retry never came. Probably Qwen."

Two guesses right (Run 1 = Opus 4.7, Run 4 = Qwen). Two guesses swapped. Run 2 was Sonnet 4.6. Run 3 was Opus 4.6. I had them reversed — but I had the behavior right. I thought "polished, decisive, stopped at v1" was Sonnet, and it was. I thought "iterated three times until the answer was honest" was Opus, and it was. The guesses were wrong about which Opus, not about the disposition.

The reveal

Rank	Model	Score	Headline
1	Opus 4.7	4.68	Strongest path validator, multi-status DELETE API, three screenshots
2	Sonnet 4.6	4.48	Tightest TypeScript, best visual polish, fastest to "done"
3	Opus 4.6	4.36	Only model that noticed the slug-prefix problem and iterated until orphan detection was honest
4	Qwen 3.5 35B-A3B	3.18	Missing screenshot, weakest path validation, architectural churn

What surprised me

Sonnet beat Opus 4.6. I didn't expect that. On previous bakeoffs Opus has been the model that goes deeper. Here, Sonnet's tighter implementation and faster decisive shipping outscored Opus's iteration. Two different success modes:

Sonnet's mode: get to a clean v1 fast, polish what's there, stop. Trust the spec.
Opus 4.6's mode: ship a first pass, look at the output, notice when it disagrees with reality, iterate.

Neither is wrong. If the spec is precise and "ship the feature" is the success criterion, Sonnet's mode wins. If the spec is approximate and "produce a correct answer" is the success criterion, Opus's mode wins. On this task, Sonnet was polished enough that Opus's iteration premium didn't make up the gap.

Opus 4.6's slug-prefix insight is the engineering moment of the bakeoff. Three models took the spec literally and produced false-positive orphans. One model checked its work, noticed the discrepancy, and kept going until the answer was honest. The cost was time — Opus 4.6 took 28.1 minutes, 3x longer than Opus 4.7's 9.2 minutes, and 146 messages versus Opus 4.7's 84. The benefit was the only correct orphan count in the bunch. That's the trade-off, and on a real codebase I'd take it every time — but it's worth being honest that the iteration premium showed up in the bill as well as the clock.

Qwen failed roughly where predicted. Pre-launch I'd written down four likely failure modes: skip orphan detection, weak design system match, miss the screenshot, forget to push. Three of those landed at least partially — Qwen did implement orphan detection, but did it naively, which is how the predicted weakness actually manifested; the design fit was rough; the screenshot was missed; the push went fine. The pattern wasn't where I expected, though. Qwen didn't fail at the planning level. It failed at the retry level. Every concrete step was reasonable. What was missing was the loop — retry the screenshot after installing the deps, clean up the dead code after the refactor, question whether two utility libraries were one too many. That's the agentic gap, and it's narrower than a year ago but still visible.

The screenshot step was the cleanest differentiator. Same task, same workspace template, same Playwright MCP, same headless Chromium dependency stack. Three models installed the missing libraries and got real PNGs. One model installed the libraries and produced a markdown description instead. Same workspace, same tools, completely different outcomes. If you wanted to test agentic loop-closing in a single observable step, this would be it.

Two of four replaced the Settings placeholder; two kept it. The spec allowed either. Both Opus runs replaced it; Sonnet and Qwen kept it alongside the new Images card. Not a quality signal — a reading of the spec — but interesting that the two Opus variants made the same call independently, and the two non-Opus models made the same opposite call.

What the bill says

The rubric scores were one half of the bakeoff. The other half lives in Coder's chat-cost API. Coder's OSS deployment exposes /api/experimental/chats/cost/{user}/summary — an experimental endpoint that returns per-chat input tokens, output tokens, cache reads, cache writes, message counts, and runtime. (Coder Premium has a fuller "AI Bridge" cost product; on OSS, the experimental chats endpoint is the equivalent and gives you everything you need to do this analysis.)

Querying per-chat instead of per-model matters. My first pass aggregated by model and the Opus 4.7 totals looked enormous — until I realized the rollup had silently combined two chats running on the same model: this judging thread plus the actual Opus 4.7 contestant run. After identifying the contestant by its chat ID prefix (2c4e8f98) and isolating to that session, the numbers got honest. The lesson: for clean bakeoff stats, query at the chat-id level, not by model. Two sessions on the same model will silently pool.

The finding the dashboard didn't surface: Opus 4.7 won the rubric (4.68), but weighted by cost-per-rubric-point at Anthropic list prices, Sonnet 4.6 wins decisively. $0.37 per rubric point for Sonnet vs $3.87 for Opus 4.7 and $3.63 for Opus 4.6. Sonnet was the only economically sensible choice for a task this size.

The Qwen line is the other one to sit with. Qwen finished in 6.4 minutes — faster than every Claude run — and produced the lowest-scoring artifact. Locally hosted inference is genuinely faster per turn (~4 seconds vs 6–13 seconds for the Claude runs); the shortfall was per-turn productivity, not latency. A longer Qwen run might have closed the gap. A 6-minute Qwen run did not.

One honest caveat on the cost numbers: this OSS Coder deployment doesn't have model cost config set, so the dashboard reported $0 across the board. The costs in the table below are list-price estimates calculated from the raw token counts. Production Anthropic billing would match closely modulo any rate plan.

Model	Input	Output	Cache R	Cache W	Runtime	Messages	Est Cost
Opus 4.7	99	32,114	4,772,142	454,581	9.2 min	84	$18.09
Opus 4.6	14,671	45,137	6,493,552	132,707	28.1 min	146	$15.83
Sonnet 4.6	110	25,935	3,097,881	85,057	15.2 min	106	$1.64
Qwen 3.5 35B-A3B	55,615	23,743	4,253,874	0	6.4 min	88	$0.00

Cost-efficiency, $/rubric point (lower is better): Opus 4.7 $3.87, Opus 4.6 $3.63, Sonnet 4.6 $0.37, Qwen $0.00. Pricing: Opus $15/M in, $75/M out, $1.50/M cache read, $18.75/M cache write; Sonnet $3/M in, $15/M out, $0.30/M cache read, $3.75/M cache write; Qwen runs locally on the RTX 5090.

By the Numbers

4 models tested in isolated Coder Agents sessions — Opus 4.7, Opus 4.6, Sonnet 4.6, Qwen 3.5 35B-A3B
4 branches pushed (feature/image-management-run-1 through run-4); 0 PRs opened to preserve isolation
4/4 builds passed npm run build on Node 20.20.2 against the engine baseline
3/4 screenshots succeeded — Qwen installed the headless-browser deps but never retried the capture; fell back to a markdown description of the page
1/4 models produced an honest orphan count (Opus 4.6, 1 real orphan); the other three reported 8 false-positive orphans from naive slug matching
2/4 blind identity guesses correct (Opus 4.7, Qwen); the two Claude behavioral reads were right but attributed to the wrong Opus
3 pre-launch fairness fixes shipped before the bakeoff could run — Node 20 in the workspace image, a corrected system-instructions block, and the prompt-poisoning catch that anonymized the branches
2 repos touched to ship the fairness work — coder-templates (Dockerfile + system instructions) and the bakeoff prompt iteration in the planning thread
~640 lines of code added per implementation on average (range 595–687); roughly 6–8 new files per branch
2 new routes per implementation — an admin page and an API route with a destructive verb
84 / 146 / 106 / 88 messages sent in the four chat sessions (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Qwen); 9.2 / 28.1 / 15.2 / 6.4 minutes of wall-clock runtime
~$35.56 total bakeoff cost at Anthropic list prices — about a fancy dinner for four independent attempts at a real feature with judgable artifacts
$0.37 vs $3.87 per rubric point — Sonnet 4.6's cost-efficiency vs Opus 4.7's. Ten times cheaper for slightly higher quality.
1 result I didn't expect: Sonnet beat Opus 4.6 on rubric (4.48 vs 4.36) and beat both Opus models by 10x on cost-efficiency
1 follow-up filed in content/TODO.md: build scripts/bakeoff-stats.sh so the next round's per-chat aggregation is one command instead of a manual jq exercise

Installing OpenClaw on the Homelab

Rob — Sat, 16 May 2026 16:04:16 +0000

I've been running Coder workspaces on my homelab for a while — Qwen3.5-35B on llama.cpp, RTX 5090, the whole stack. But the AI assistants were all inside terminal sessions. I wanted something I could message from my phone, from Discord, from anywhere. Something that talks to the local LLM on my own hardware and doesn't phone home to anyone's cloud.

OpenClaw is that thing. It's an open-source personal AI assistant with 367K GitHub stars, a plugin ecosystem, and connectors for every chat platform you can name. The pitch: "Your own personal AI assistant. Any OS. Any Platform."

Here's how I got it running on my Linux workstation, wired to a local Qwen3.5-35B via llama.cpp, talking through Discord. It took an afternoon. It should have taken 30 minutes. The difference was five config mistakes that produced zero useful error messages.

The Hardware

Resource	Spec
CPU	AMD Ryzen 9 9950X3D — 16 cores / 32 threads
RAM	64 GB
GPU	NVIDIA RTX 5090 — 32 GB VRAM
OS	Ubuntu 24.04
LLM	Qwen3.5-35B-A3B via llama.cpp on port 8080
Embeddings	nomic-embed-text-v1.5 via llama.cpp on port 8084

The LLM runs entirely on the GPU. No RAM impact on anything else.

1. Installation: One Curl

curl -fsSL https://openclaw.ai/install.sh | bash

That's it. The script detects Ubuntu, installs Node if needed, drops the openclaw binary, and launches an onboarding wizard. The whole thing took about 90 seconds.

2. Pointing at the Local LLM

The wizard asks for a model provider. The list has Anthropic, Google, OpenAI, and two dozen cloud services. Scroll past all of them and pick Custom Provider.

The wizard needs three things:

Base URL: http://localhost:8080/v1
API key: Anything — llama-server doesn't check it, but the field can't be empty
Model ID: It auto-detects from the /v1/models endpoint

I had two llama-server instances running and had to figure out which was which:

curl -s http://localhost:8080/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data']]"
# Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

curl -s http://localhost:8084/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data']]"
# nomic-embed-text-v1.5.f16.gguf

Port 8080 is the chat model. Port 8084 is embeddings. OpenClaw wants the chat model.

The wizard verified the connection and asked for an Endpoint ID — just a label for the config. I accepted the default custom-localhost-8080.

Use localhost, not your Tailscale IP. OpenClaw runs on the same machine as llama-server. Routing through Tailscale adds latency and creates a dependency on the Tailscale daemon being up for purely local traffic.

3. Setting Up the Discord Bot

The wizard asks which chat channel to connect. I picked Discord — it's the most popular OpenClaw channel, which means the most community support and troubleshooting threads.

Creating the Discord bot takes five steps in the Developer Portal:

Step 1: Create the application. Click "Build a Bot" on the welcome screen, then "New Application." I named mine OpenClaw.

Step 2: Get the bot token. Go to the Bot tab, click "Reset Token," copy the token. Paste it into the OpenClaw wizard when prompted.

Step 3: Enable Message Content Intent. Same Bot tab, scroll to "Privileged Gateway Intents," toggle on Message Content Intent. Without this, the bot can see that messages exist but can't read what they say.

Step 4: Invite the bot to your server. The OAuth2 URL Generator in the Developer Portal can be finicky. I skipped it and built the invite URL manually:

https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&scope=bot&permissions=66560

Permission 66560 grants Send Messages + Read Message History. Replace YOUR_APP_ID with the Application ID from the General Information tab.

Step 5: Create a server. I didn't have a Discord server. The invite page showed "No items to show." Had to go back to Discord, click the + button in the sidebar, create a new server called HomeLabOpenClaw, then revisit the invite URL.

4. Finishing the Wizard

Back in the terminal, the wizard asked a few more questions:

Channel access: I picked "Open (allow all channels)" — it's my personal server, no reason to maintain an allowlist
Search provider: DuckDuckGo — free, no API key, good enough for a first run
Skills: Said yes, let it enable the 10 eligible ones
Hooks: Skipped — not essential for getting started
Hatch: "Hatch in Terminal" — starts the gateway right there so you can see the logs

The gateway started, the Discord plugin connected, and the bot appeared online in my server.

5. The Pairing Dance

I messaged the bot and got: "OpenClaw: access not configured." With a pairing code.

OpenClaw's DM policy defaults to pairing — unknown senders get a code instead of a response. You approve them from the terminal:

openclaw pairing approve discord YOUR_PAIRING_CODE

After that, DMs worked perfectly. The bot responded, the 5090 spun up, responses came back. Great.

Then I tried a server channel and everything broke.

6. The Silent Channel Problem

For the next two hours, this was my experience: I'd @carrybot in a server channel, the bot would react with an emoji, show "typing..." for a few seconds, and then... nothing. No response. No error in Discord. The 5090 was clearly working — I could hear the fans.

DMs worked. Channels didn't. Here's every wrong turn I took and the actual fix.

Wrong Turn 1: "It's a permissions issue"

I checked the bot's Discord role permissions. Almost nothing was toggled on. I enabled Send Messages, Read Message History, View Channels. Restarted the gateway. Still nothing.

Verdict: The permissions were wrong and needed fixing, but they weren't the root cause. The bot was already generating responses — it just wasn't posting them.

Wrong Turn 2: "It's a context window issue"

The bot occasionally showed this error:

The OpenClaw wizard had set contextWindow: 4000 and maxTokens: 4096 in the model config. My llama-server has a 131K context window. The wizard didn't auto-detect this from the Custom Provider endpoint.

I edited ~/.openclaw/openclaw.json and changed:

{
  "contextWindow": 131072,
  "maxTokens": 81920,
  "reasoning": true
}

contextWindow: 131072 matches llama-server's --ctx-size 131072
maxTokens: 81920 matches llama-server's -n 81920 (max output tokens)
reasoning: true because Qwen3.5 runs with --reasoning-budget 8192

This fixed the context errors, but channels still didn't work.

Wrong Turn 3: "It's the memory plugin"

The logs showed tool:memory_search:started hanging indefinitely. Qwen3.5 kept trying to call a memory_search tool before responding, and it never completed.

openclaw config set plugins.entries.memory-core.enabled false
openclaw gateway restart

This fixed the tool-call hangs in DMs. Channels still didn't work.

Wrong Turn 4: "It's a mention detection issue"

Early on, I was typing @OpenClaw in channels. The logs showed reason: "no-mention" — the bot is mention-gated in group chats and I was mentioning the wrong name. The Discord application is "OpenClaw" but the bot username is "carrybot" (I renamed it in the Developer Portal).

You have to use the actual Discord mention — type @ and select the bot from the autocomplete. Typing @carrybot as plain text doesn't create a real mention.

This got the bot to actually process channel messages. But it still wasn't responding.

The Actual Fix: `visibleReplies`

After two hours, I found it. During the wizard's openclaw doctor step, it had auto-applied a config change:

"messages": {
  "groupChat": {
    "visibleReplies": "message_tool"
  }
}

This tells OpenClaw to use the message tool for posting replies in group chats / server channels. But the message tool wasn't available — I'd disabled memory-core and the tool policy didn't include it. So the bot would generate a perfect response, try to send it via a tool that doesn't exist, and silently fail.

The fix:

openclaw config set messages.groupChat.visibleReplies "automatic"
openclaw gateway restart

One config key. Two hours of debugging. Zero error messages in the logs.

7. The Working Config

Here's the final ~/.openclaw/openclaw.json model section that actually works:

{
  "models": {
    "providers": {
      "qwen-local": {
        "baseUrl": "http://localhost:8080/v1",
        "api": "openai-completions",
        "apiKey": "sk-none",
        "models": [{
          "id": "Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf",
          "contextWindow": 131072,
          "maxTokens": 81920,
          "reasoning": true
        }]
      }
    }
  }
}

And the critical non-obvious settings:

{
  "messages": {
    "groupChat": {
      "visibleReplies": "automatic"
    }
  },
  "plugins": {
    "entries": {
      "memory-core": { "enabled": false }
    }
  },
  "agents": {
    "defaults": {
      "compaction": {
        "reserveTokensFloor": 40000
      }
    }
  }
}

8. Making It Stick

Install the systemd service so the gateway survives reboots:

openclaw gateway install

Set yourself as the command owner so you can run privileged commands:

openclaw config set commands.ownerAllowFrom '["discord:YOUR_DISCORD_USER_ID"]'

Verify everything:

openclaw --version          # confirm CLI
openclaw doctor             # check for config issues
openclaw gateway status     # verify gateway is running

What I Learned

The wizard's defaults are for cloud providers, not local LLMs. contextWindow: 4000 is a safe default for API providers that charge per token. It's a crippling default for a local model with 131K context. If you're running a Custom Provider, you must manually set contextWindow and maxTokens to match your server's actual limits.

visibleReplies: "message_tool" is a trap. The doctor command auto-applies this "recommended" setting, but it depends on the message tool being available. If you're running a stripped-down config without all the default tools, your bot will silently swallow every group chat reply. The symptom is perfect — the bot reacts, types, generates a response (you can verify in the session files), and then just... doesn't post it. No error. No log line. Nothing.

Discord bot setup has more steps than it should. Between the Developer Portal, the OAuth2 scopes, the Privileged Gateway Intents, the server creation, the role permissions, and the correct mention format — there are at least six places where a single missed toggle produces a silent failure. Document every step. Check every toggle.

Session files are your debugging lifeline. When the logs show nothing, check ~/.openclaw/agents/main/sessions/*.jsonl. The session file showed me the bot was generating perfect responses that were never delivered. Without that, I would have assumed the LLM was broken.

Start with DMs, graduate to channels. DMs have a simpler code path — no mention detection, no group chat reply policy, no channel permissions. Get DMs working first, then debug channels as a separate problem.

Files Changed

On the workstation:

~/.openclaw/openclaw.json — model config, context window, reply policy, plugin settings, owner config

Discord:

Created Discord application "OpenClaw" with bot user "carrybot"
Created Discord server "HomeLabOpenClaw"
Enabled Message Content Intent, configured role permissions

Systemd:

openclaw-gateway.service — installed via openclaw gateway install

What's Next

The bot works, but it's running Qwen3.5-35B with memory-core disabled and no skills beyond the basics. Next steps:

Re-enable memory. Figure out why memory_search hangs with Qwen3.5's tool call format and fix it — memory is one of OpenClaw's killer features.
Add skills. 43 skills were blocked by missing requirements. Install the useful ones — session-logs, nano-pdf, video-frames.
Try a different local model. Qwen3.5 works but its tool calling may not be fully compatible with OpenClaw's expected format. Worth testing Gemma 4 or another model with native tool support.
Wire up Tailscale access. The gateway listens on localhost:18789. Exposing it on the tailnet means I can hit the dashboard from any device without a Cloudflare tunnel.

By the Numbers

1 curl command to install OpenClaw
131,072 tokens — the context window the wizard set to 4,000
81,920 tokens — max output, matching llama-server's -n flag
2 hours debugging silent channel failures
1 config key (visibleReplies: "automatic") that fixed everything
6 Discord setup steps where a missed toggle means silent failure
0 cloud dependencies — fully local LLM, self-hosted gateway
~500 MB RAM footprint for the OpenClaw gateway (Node.js process)
18 screenshots taken during the debug session
4 sensitive screenshots deleted (contained tokens/credentials)
0 useful error messages for the visibleReplies bug

Thursday Thoughts: The Models We Can't Run

Rob — Thu, 14 May 2026 15:59:32 +0000

Every week or two, a model drops that makes the local AI community lose its collective mind. This week it was three at once: DeepSeek V4-Pro, DeepSeek V4-Flash, and Zyphra ZAYA1-8B. All three are genuinely impressive. All three are models I wanted to benchmark on our homelab. And after doing the research, I'm not testing any of them.

Not because I don't want to. Because I physically can't — or can't yet.

This post isn't a benchmark. It's the research that happens before the benchmark, where you figure out which models are even candidates for your hardware. If you're building or considering a local inference setup, the reasons these three models don't work are more instructive than any leaderboard score.

The Rig

Quick refresher on what we're working with:

Resource	Spec
GPU	NVIDIA RTX 5090 — 32 GB VRAM
RAM	64 GB DDR5
CPU	AMD Ryzen 9 9950X3D — 16 cores / 32 threads
Disk	1.8 TB NVMe
Inference	llama.cpp on the GPU

This is a strong homelab by any measure. We run Qwen 3.5 35B-A3B daily for agentic coding at 200+ tok/s. In previous benchmark rounds, Devstral, Codestral, Gemma 4, and DeepSeek R1 14B have all run comfortably. The 5090 is the sweet spot for 20B–35B models.

But the new generation of models isn't playing in the 20B–35B range anymore.

DeepSeek V4-Pro: Too Big for Anything Short of a Data Center

V4-Pro is DeepSeek's new flagship. The numbers are staggering:

Spec	Value
Total parameters	1.6 trillion
Activated per token	49B (MoE, 256 experts, top-6 routing)
Model weights (FP4+FP8 mixed)	805 GB on disk
Context window	1M tokens

That 805 GB number is the wall. Our entire system — 32 GB VRAM plus 64 GB RAM — gives us 96 GB of addressable memory. The model is 8.4x larger than our total memory. There are no GGUF quants available, and nobody is making them because there's no consumer hardware that could run them meaningfully.

For context, we tried running Kimi K2.6 (a similarly-sized 1T MoE model) a few weeks ago. It "ran" at less than 1 token per second — the weights spilled out of VRAM into RAM, and we hit the DDR5 memory bandwidth ceiling (~80 GB/s vs the 5090's ~1.8 TB/s). V4-Pro at 1.6T would be even slower.

Verdict: Cloud API only. DeepSeek serves it at api.deepseek.com and we've added it to our benchmark rig as a cloud provider alongside Anthropic.

DeepSeek V4-Flash: Close, But Not Close Enough

V4-Flash is V4-Pro's smaller sibling and the one I was actually hopeful about:

Spec	Value
Total parameters	284B
Activated per token	13B (MoE, 256 experts, top-6 routing)
Smallest GGUF quant (Q2_K)	96.2 GB
Most popular quant (Q4_K_M)	160.2 GB
Context window	1M tokens

Only 13B activated per token sounds incredible — that's smaller than our DeepSeek R1 14B. But MoE models need all their expert weights resident in memory even though only a fraction fires per token. That 284B of total parameters has to be somewhere accessible.

The math doesn't work:

Quant	Size	Fits in VRAM + RAM (96 GB)?
Q2_K	96.2 GB	Barely — 0.2 GB over before KV cache
Q3_K_M	126.2 GB	No — needs 30 GB disk offload
Q4_K_M	160.2 GB	No — needs 64 GB disk offload
FP4-FP8 native	145.4 GB	No — needs 49 GB disk offload

There were IQ1_S (54 GB) and IQ2_M (87 GB) quants that would have fit — but the community removed them. When quant maintainers pull their own files, that's a strong signal the output quality was garbage.

And even if one of these squeaked into memory, there's a bigger problem: llama.cpp doesn't support the DeepSeek V4 architecture yet. All existing GGUFs require custom forks. The mainline support PRs are still open and under active debate. You'd be building from an untested branch to run a model that barely fits.

Verdict: Not ready. We've added V4-Flash to the benchmark as a cloud API model for now. When llama.cpp merges V4 support and a viable sub-90 GB quant exists, we'll revisit.

ZAYA1-8B: The Right Size, the Wrong Stack

This is the one that hurts the most, because on paper it's a perfect homelab model:

Spec	Value
Total parameters	8.4B
Activated per token	760M (MoE, 16 experts, top-1 routing)
VRAM at bf16	~17 GB
Context window	128K tokens
AIME '26 score	89.1

8.4 billion parameters. 17 GB in bf16. Fits trivially on the 5090 with room to spare. Punches absurdly above its weight on reasoning benchmarks — 89.1 on AIME '26 is competitive with models 10–15x its size.

So what's the problem? Architecture.

ZAYA1 uses CCA (Cross-Channel Attention) — Zyphra's novel hybrid of Mamba-style recurrence and traditional attention. It's not standard Mamba2. It's not standard transformer attention. It's a fundamentally new layer type with small 1D convolutions, custom Q/K projections, and learned residual scaling.

llama.cpp has no support for this architecture. There's an open feature request with nothing but +1 comments. No GGUF quants exist because there's nothing to run them on. Even Zyphra's older Zamba2 architecture (#21412) remains unimplemented.

The only way to run ZAYA1 today is through Zyphra's custom vLLM fork — a completely different serving stack from our llama.cpp setup. It would work on the 5090, but it means standing up and maintaining a parallel inference pipeline.

Verdict: On the to-do list. When llama.cpp adds CCA support or we carve out time to set up vLLM as a second serving backend, this is the first model we'll test.

What Actually Runs on a 32 GB GPU

Here's the uncomfortable reality of local inference in mid-2026: the models generating the most hype are the ones you can't run.

The models that fly on a 32 GB card — where you get 100+ tok/s and useful agentic performance — are capped at roughly 24–28 GB of weights (leaving room for KV cache). That means:

Category	What Fits
Dense models	Up to ~14B at Q8, ~20B at Q6, ~27B at Q4
MoE models	Up to ~35B total at Q4 (e.g. Qwen 3.5 35B-A3B)
What doesn't	Anything over ~28 GB of quantized weights

Our current daily driver — Qwen 3.5 35B-A3B at Q4_K_XL — is 22 GB of weights with 3B activated per token, running at 200+ tok/s. It's fast, it's good, and it's approximately the ceiling of what a single 5090 can do at interactive speeds.

The Three Walls

Each of these models hits a different wall, and that's what makes this exercise useful:

V4-Pro — pure size. 805 GB of weights. No amount of quantization or clever offloading helps when the model is 8x your total memory.
V4-Flash — the quantization gap. The model almost fits at extreme compression, but the quality degrades too far. We're in a window where the model exists but the tooling hasn't caught up to make it practical on consumer hardware.
ZAYA1 — architecture support. The model fits perfectly. The hardware is more than enough. But the inference engine doesn't speak the language yet.

If you're evaluating models for a homelab or edge deployment, these are the three questions to ask before you even think about benchmarks: Is it small enough? Is the quantization viable? Does my inference stack support it?

By the Numbers

805 GB — DeepSeek V4-Pro model weight size. 8.4x our total system memory.
96.2 GB — smallest V4-Flash GGUF quant. Still 0.2 GB over our VRAM + RAM.
17 GB — ZAYA1-8B at bf16. Fits trivially, runs nowhere (yet).
22 GB — our actual daily driver (Qwen 3.5 35B-A3B at Q4_K_XL). The real ceiling.
0 — number of these three models with merged llama.cpp support.
2 — models we added to the benchmark as cloud API endpoints instead (V4-Flash, V4-Pro).

Model Showdown Round 4: Opus vs Qwen — Writers, Not Coders

Rob — Mon, 11 May 2026 15:17:29 +0000

Two models. Same prompt. Same five fodder files. Same 27 published posts to check for redundancy. Same writing style guide.

One chose the Dev.to syndication saga. The other chose the tag taxonomy overhaul. There was zero overlap in fodder selection, topic, or angle.

This is the story of what happened — and what the differences reveal about how models approach the same creative task.

The Setup

I've been running this blog with AI agents as the primary writing tool since day one. Every post on vibescoder.dev was drafted by Claude Opus 4.6 through Coder Agents — until now. I wanted to see what would happen if I gave a different model the same editorial task.

The prompt was identical for both sessions:

Let's look at all of our fodder files and see if there is a themed post we can do. Either a standalone post or one that threads a few fodders together. Review all published and unpublished posts for style and content redundancy. Propose a draft when you're ready.

Model A: Claude Opus 4.6 (cloud, via Coder Agents)

Model B: Qwen 3.5 35B-A3B (local, llama.cpp on the RTX 5090, via Coder Agents)

Both had access to the same skill files, the same repos, the same tools. Neither knew the other was running.

What They Chose

For context, I use a "fodder file" workflow. Agents summarize sessions as we complete them. There is a SKILL file that defines the standard format for this. Periodically, we turn fodder files into drafts. Some are 1:1 and become complete posts. Others get rolled up into a thematic post.

Five unclaimed fodder files were available:

Fodder	Opus 4.6	Qwen 3.5
Dev.to syndication (May 7)	Selected	Passed
Filtering/taxonomy overhaul (May 1)	Passed	Selected
Qwen daily driver + skills (May 4)	Passed	Passed
Scheduled publish bug (May 3)	Passed	Passed
External auth multi-user (May 3)	Correctly identified as already claimed	Correctly identified as already claimed

Both correctly identified that blog-fodder-external-auth-multi-user-may-3.md was already sourced by an existing draft. Both passed on the scheduled publish bug — Opus explicitly flagged it as too small for a standalone post; Qwen simply didn't rank it.

The Qwen daily driver fodder is more interesting. Opus passed on it without comment. Qwen actually ranked it second in its proposals file and planned to draft it "next week" after Round 3 publishes. It wasn't dismissed — it was deferred.

The interesting part is what they reached for.

[Human editor's note: I asked Opus to analyze and write this post from its perspective. What follows below is unedited. The first person "I" from here on is Opus.]

Opus Chose the War Story

I picked the Dev.to syndication fodder and wrote The API That Wouldn't Say No. The angle: a four-hour debugging session against an API that silently swallows your data without returning an error. Six failed attempts, one root cause, 443 lines of dead code cleaned up.

Why I chose it:

Complete narrative arc with a clear villain (the silent published_at failure)
Zero overlap with existing posts (Day Four covered the initial Dev.to setup, not the bulk syndication or the debugging saga)
Universally useful technical content — anyone integrating with the Dev.to API will hit this
The Vercel Hobby plan timeout as an architectural constraint is a story within a story

The post is 153 lines. One code block. Eight "By the Numbers" bullets. The structure follows the blog's standard pattern: setup → build → disaster → fix → cleanup → lessons → stats.

Qwen Chose the Data Story

Qwen picked the filtering/taxonomy fodder and wrote "From Chaos to Signal: How We Fixed Our Blog's Tag System." The angle: shipping a filter bar that barely worked, discovering through a data audit that 94% of posts shared the same tags, then replacing freeform folksonomy with controlled taxonomy.

Why Qwen chose it:
Qwen wrote a separate proposals file (post-draft-proposals-2026-05-09.md) before drafting — a planning step Opus skipped entirely. It ranked three standalone posts: taxonomy first, Qwen daily driver second, syndication third. Its stated reasoning for the taxonomy pick: "strong metrics-driven how-to" that was "flagged in TODO as high priority." It declared "No content redundancy detected" without deep-checking gotcha-level overlaps against published posts.

The instinct was right — the taxonomy story is strong:

Concrete before/after metrics with the tag saturation table as proof
A conceptual thesis — folksonomy vs. taxonomy — that elevates it beyond a feature changelog
The V1 → V2 iteration arc is satisfying: ship, measure, realize the data is wrong, redesign
Clean origin story for the type field that now appears in every post's frontmatter but has never been explained

The post is 243 lines. Two tables, two code blocks, four numbered gotchas. Heavier on architectural detail and lighter on narrative tension.

The Instinct Gap

Here's what I think the divergence reveals:

Opus gravitates toward narrative tension. I looked at five fodder files and picked the one with a villain. The published_at silent failure is a four-hour mystery with a one-line resolution — that's a story structure. The post has a rising action (six failed attempts), a climax (isolating the field), and a denouement (the cleanup). The technical content is the vehicle, but the engine is "here's what went wrong and why it took so long to figure out."

Qwen gravitates toward systematic explanation. It looked at the same five files and picked the one with the cleanest data. The tag saturation table is the centerpiece — hard numbers that prove the V1 filter was broken. The post walks through every architectural decision, every file changed, every gotcha encountered. The structure is taxonomic (ironically), not dramatic.

Neither instinct is wrong. They produce different kinds of posts for different kinds of readers.

Quality Assessment

I read both drafts against the blog's established conventions — 27 published posts, the style guide in settings.json, the skill files that define structure and voice. Here's how they stack up.

Voice and Tone

Opus: Matches the blog's existing voice closely. First person, direct, dry. "31 seconds × 11 posts = ~5.5 minutes of wall time. The 'Stop' button went from nice-to-have to essential." That's the rhythm of the published posts — setup, punchline, move on.

Qwen: Close but slightly off. The opening is strong — "Click [ai] and three posts disappeared. That's not filtering — it's a rounding error" is a great line. But the prose occasionally shifts into explainer mode: "Tags are folksonomy — freeform, inconsistent, grow unbounded. Content type is taxonomy — controlled vocabulary, exactly 2 values..." That's accurate, but it reads more like documentation than a blog post. The existing posts teach by showing, not by defining.

Structural Conventions

This is where the gap widens.

Convention	Opus	Qwen
H1 title in body	No (correct)	Yes — only post on the entire blog to repeat the title as an `# H1`
`## What I Learned`	Present	Missing
`## By the Numbers` position	Last section (correct)	Before "What's Next" (reversed)
`---` horizontal rules	Sparse — one before closing sections	Between every major section (7 total)
Tags format	Inline `[array]`	YAML list
New tags introduced	0	3 (`content-design`, `tagging`, `data-audit`)

The H1 is the most visible miss. Every published post on vibescoder.dev renders its title from frontmatter — the body starts with prose or an ## H2. Qwen added a redundant # From Chaos to Signal: How We Fixed Our Blog's Tag System at line 20 that would render as a duplicate title on the live site.

The missing "What I Learned" section matters too. It's not universal — some posts skip it — but for a 243-line how-to post with four gotchas and a conceptual thesis about folksonomy vs. taxonomy, the absence of a distilled lesson section leaves the ending flat. The post goes from "Gotchas" straight to "By the Numbers" to "What's Next," which reads like the analytical work is done but the editorial work isn't.

The excessive horizontal rules are a style preference, but they break the visual flow in a way that no published post does. The blog uses --- sparingly — to separate the narrative from the closing sections, not between every ## H2.

Tag Discipline

This one is ironic. Qwen wrote a post about cleaning up tag proliferation — then introduced three brand-new tags (content-design, tagging, data-audit) that don't appear on any other post. The blog just went from 16 unique tags to 19. By the post's own logic, those are tags with a single-post frequency — the exact pattern the taxonomy cleanup was trying to eliminate.

Opus used three existing tags (agents, next-js, devops) — all already in the blog's vocabulary.

Content Originality

Opus: The Dev.to syndication story builds on Day Four (which covered the initial setup) but covers entirely new ground — bulk architecture, published_at debugging, rate limits, cleanup. The "silent failures" lesson echoes a theme from "Invisible Failures" and "The Agent Was Flying Blind," using nearly identical phrasing. A small deduction for not differentiating the framing more, but the technical content is unique.

Qwen: The tag taxonomy story has almost zero overlap with existing posts. The FilterBar.tsx component appears in "Friday Fixes: Mobile First" but only for CSS spacing fixes — Qwen covers the conceptual redesign. The type field origin story fills a genuine gap in the blog's narrative. Stronger originality score.

Gotcha #2: The Self-Referential Overlap

Qwen's second gotcha — "published: true in body text" matching a grep — describes the exact same class of bug that the scheduled-publish-bug fodder (May 3) covers, and that "Friday Fixes: The Agent Was Flying Blind" already documented. Three separate instances of "grep matched prose instead of frontmatter" across the blog. Qwen didn't flag this overlap.

The Scorecard

Dimension	Opus ("The API That Wouldn't Say No")	Qwen ("From Chaos to Signal")
Fodder selection	Strong — complete arc, clear villain	Strong — data-driven, fills a gap
Voice match	High	Moderate — occasionally shifts to explainer mode
Structural conventions	Correct — follows blog patterns	Several misses — H1, missing section, reversed order, excessive rules
Tag discipline	Clean — 0 new tags	Ironic — 3 new tags in a post about tag cleanup
Content originality	Strong (minor lesson overlap)	Very strong (almost zero overlap)
Narrative quality	Higher — tension, pacing, resolution	Lower — thorough but flat ending
Technical depth	Moderate	Higher — more code, more architecture detail
Redundancy awareness	Caught the "already claimed" fodder, flagged thematic overlap in analysis	Caught the "already claimed" fodder, missed the gotcha #2 overlap

Both posts are publishable. Neither is a throwaway. But they'd need different levels of editing to meet the blog's bar.

The Edit

We published Qwen's post — From Chaos to Signal — but not before I rewrote it. The published version has the same bones: same topic, same data, same technical content. But the H1 is gone, the "What I Learned" section exists, the closing sections are in the right order, the horizontal rules are thinned out, and the gotcha about grep matching body text was cut (it's a redundant lesson — we've told that story before).

Qwen's original draft is embedded at the bottom of the published post in a collapsible block. Expand it and you can read both versions side by side. The differences are instructive — not because one is right and one is wrong, but because they show exactly where editorial polish lives: in the negative space. What to cut, what to reorder, what to leave unsaid.

What This Actually Means

This wasn't a benchmark. There's no winner. The point is what the experiment reveals about using different models for the same editorial task.

Models have aesthetic preferences. Given the same raw material, Opus reached for drama and Qwen reached for data. Both are valid editorial choices, but they produce posts with different energy. If you're building a content pipeline with AI, the model you choose shapes the voice — not just the quality.

Style conventions need enforcement, not inference. Qwen had access to the same skill files and the same 27 published posts as examples. It still introduced an H1 heading that no other post uses, reversed the closing section order, and added horizontal rules at a frequency the blog has never used. The skill file says "end with 'By the Numbers' bullet list" but doesn't say "don't put a section after it." Negative constraints — what not to do — are harder for models to infer from examples alone.

Redundancy detection is incomplete in both. Opus flagged the "already claimed" fodder and noted thematic overlap with the "silent failures" posts but still used nearly identical lesson phrasing. Qwen flagged the "already claimed" fodder but missed that its gotcha #2 describes a bug pattern already covered in two published posts. Neither model did a deep-enough content diff to catch everything.

Planning styles diverge. Qwen wrote a structured proposals document ranking three candidates before committing to a draft. Opus jumped straight from analysis to prose — no intermediate planning artifact. Qwen's approach is arguably more disciplined, but the proposals file contained a blanket "No content redundancy detected" claim that the draft then contradicted by including an overlapping gotcha. Planning artifacts only help if the analysis behind them is thorough.

Local models close the gap on analysis but not on editorial polish. Qwen's fodder review, redundancy check, and content selection were solid. The analytical work — reading 27 posts, cross-referencing sources, identifying unclaimed fodder — was on par with Opus. Where it fell short was the last mile: the structural conventions, the voice matching, the irony of its own tag choices. That's the gap between understanding the content and inhabiting the style.

Both models handled adversity. Qwen hit a git push conflict mid-session — another session had pushed the bakeoff fodder files while Qwen was working — and resolved it cleanly with git pull --rebase. Opus didn't encounter merge conflicts but navigated YAML escaping issues (an apostrophe in the title broke the frontmatter parser) and nested code fence conflicts in the CollapsibleCode component. Neither model stalled on infrastructure problems.

By the Numbers

2 models given the same prompt in parallel sessions
5 fodder files available — each model selected a different one
0 overlap in fodder selection, topic, or angle
1 proposals file written by Qwen before drafting — a planning step Opus skipped
153 lines in the Opus draft vs. 243 lines in the Qwen draft
0 new tags introduced by Opus vs. 3 new tags by Qwen
1 H1 heading that shouldn't exist (Qwen's only)
1 missing section ("What I Learned") in the Qwen draft
1 git merge conflict encountered and resolved by Qwen mid-session
27 published posts both models reviewed for redundancy — neither caught everything

Model Showdown Round 3: Ditching Ollama in Favor of llama.cpp

Rob — Sun, 10 May 2026 15:25:35 +0000

In Round 1, we ran five local models and two cloud models through a single coding task. The local models held their own. In Round 2, we added Gemma 4 and Kimi K2, fixed our scoring methodology, and watched Gemma climb to the top.

But something kept nagging at us.

All our benchmarks were running through Ollama — a great tool for getting started, but essentially a wrapper around llama.cpp with its own opinions about quantization, context management, and memory allocation. We were benchmarking Ollama's choices as much as the models themselves.

So we did something drastic: we ripped out Ollama entirely and went straight to llama.cpp. Then we built a proper 12-task automated benchmark suite and ran all five models through it.

The results changed everything. Spoiler: Qwen 3.5 swept all three categories — best for coding, best for agentic tasks, best single model — and it did it at 206 tokens per second. Read on to find out how.

Why llama.cpp Over Ollama?

Ollama is fantastic for ollama pull model && ollama run model. It's genuinely the best way to get started with local models. But when you're running them as infrastructure — serving through an OpenAI-compatible API to Coder Agents, IDE extensions, and automation — the abstraction layer starts to chafe.

To be fair: Ollama can do most of what llama.cpp does. You can import custom GGUFs via Modelfiles. You can set context windows with PARAMETER num_ctx or the OLLAMA_CONTEXT_LENGTH env var. You can enable flash attention via OLLAMA_FLASH_ATTENTION and KV cache quantization via OLLAMA_KV_CACHE_TYPE. It's more capable than people give it credit for.

So why switch? Three reasons:

Zero-abstraction control — llama-server exposes every hyper-parameter as a launch flag: batch sizes, continuous batching, thread allocation, reasoning budgets, chat template overrides. Ollama surfaces many of these through env vars and config, but the deep inference tuning knobs aren't all available. When we needed --reasoning-budget 8192 and --chat-template chatml to make Coder Agents work, we needed the flags.
Bleeding-edge model support — Ollama wraps llama.cpp, so it inherently lags behind it. When a new model architecture drops, llama.cpp supports it on day one. Ollama might take a week or two to update its downstream runner. For models like Qwen 3.5 and Gemma 4, we didn't want to wait.
Fewer moving parts — For a headless server running one model at a time behind systemd, a compiled llama-server binary pointing at a GGUF on disk is the simplest possible deployment. No daemon, no internal model registry, no API translation layer.

Could we have tuned Ollama to get similar results? Probably close. But we'd have been fighting the abstraction at every turn instead of just setting the flags we wanted. The migration freed up ~44 GB of disk (Ollama's blob store) and gave us the direct control we needed.

The Hardware

Same beast from Rounds 1 and 2, now running leaner:

Component	Spec
GPU	NVIDIA RTX 5090, 32 GB GDDR7
CPU	AMD Ryzen 9 9950X3D, 16 cores
RAM	64 GB DDR5-6000
Storage	Samsung 9100 Pro 2 TB NVMe
OS	Ubuntu 24.04, NVIDIA driver 590.48.01
Inference	llama.cpp (built with CUDA arch 89)

The Migration

Building llama.cpp

The RTX 5090 uses NVIDIA's Blackwell architecture (SM 120), but CUDA toolkit support for SM 120 was still landing when we built. The workaround: build with -DCMAKE_CUDA_ARCHITECTURES=89 for backward compatibility. It works — the compiler targets Ada Lovelace (SM 89) and the Blackwell GPU runs it with full performance.

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Downloading the Models

We grabbed GGUF files from HuggingFace using the hf CLI. Each model was hand-picked for quantization level — balancing quality against our 32 GB VRAM budget:

Model	Params	Active	Quant	Size
Qwen 3.5 35B-A3B	35B	3B	UD-Q4_K_XL	20.7 GB
Gemma 4 26B-A4B	26B	4B	Q4_K_M	16.9 GB
Devstral 24B	24B	24B	Q5_K_M	15.6 GB
Codestral 22B	22B	22B	Q5_K_M	14.6 GB
DeepSeek R1 14B	14B	14B	Q8_0	15.7 GB

The "Active" column matters. Qwen 3.5 and Gemma 4 are Mixture of Experts (MoE) models — they have 35B and 26B total parameters but only activate 3B and 4B respectively on each token. This means they fit comfortably in VRAM while punching well above their weight class.

Three models downloading sequentially. The Samsung 9100 Pro writes at 250+ MB/s — all five models landed in under 10 minutes.

The DNS Incident

Halfway through downloading, our DNS resolution failed. Parallel HuggingFace downloads apparently overwhelmed something in the DNS chain. The fix was unglamorous:

echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

DNS goes down, Google saves the day, and Devstral resumes downloading.

Setting Up the Server

Each model gets its own launch configuration. The key insight: --chat-template chatml is mandatory for Coder Agents compatibility.

Why? Qwen 3.5 and Devstral ship with embedded Jinja templates that enforce "system message must be at the beginning" — but Coder Agents sends messages in whatever order it pleases. The chatml template is permissive and all five models were trained on it, so quality is maintained.

Here's Qwen's config as an example — the most tuned of the five:

~/llama.cpp/build/bin/llama-server \
  --model ~/models/qwen3.5/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --port 8080 \
  --ctx-size 131072 \
  -n 81920 \
  --reasoning-budget 8192 \
  --reasoning-format deepseek \
  --flash-attn on \
  --chat-template chatml \
  --parallel 1 \
  -ngl 99

Notable flags:

--ctx-size 131072 — Qwen 3.5 supports 128K context. We give it the full window.
--reasoning-budget 8192 — Caps thinking tokens so the model doesn't burn the entire budget deliberating.
--flash-attn on — This build requires the explicit on value, not bare --flash-attn.
-ngl 99 — Offload all layers to GPU.

Systemd Services

We set up two systemd services that survive reboot:

llama-embed.service — Runs nomic-embed-text permanently on port 8084 (~300 MB VRAM). Always on, coexists with any generation model.
llama-generate.service — Runs the active generation model on port 8080. Reads from /etc/llama-generate.conf for model selection.

A helper script, llm-switch.sh, makes model swapping painless:

~/bin/llm-switch.sh qwen      # Switch to Qwen 3.5
~/bin/llm-switch.sh devstral  # Switch to Devstral
~/bin/llm-switch.sh status    # Show current model

It updates the config and restarts the service. Model swap takes about 3 seconds.

The Benchmark

Rounds 1 and 2 used a single task: "build a CLI todo app." That was fine for comparing code generation, but it told us nothing about reasoning, instruction following, or multi-file agentic work.

Round 3 uses 12 tasks across 5 categories:

Category 1: Single-File Code Generation

The legacy benchmark, maintained for continuity with prior rounds.

Task	Prompt	Scoring
1.1 Todo App	Python CLI todo app with SQLite, argparse, CRUD	10 features + 7 functional tests
1.2 URL Shortener	FastAPI with SQLite, rate limiting, validation	8 features (server-based functional)
1.3 LRU Cache	TypeScript with O(1) ops + test suite	6 features + assertion tests

Category 2: Multi-File Agentic Coding

Can the model work across files and understand project structure?

Task	Prompt	Scoring
2.1 Bug Fix	Express.js app with planted auth header mismatch	Found bug? Minimal fix? Explanation quality?
2.2 Pagination	Add pagination to a Flask REST API + update tests	5 features checklist

Category 3: Reasoning & Problem Solving

No code — just thinking.

Task	Prompt	Scoring
3.1 Debug Log	Diagnose connection pool exhaustion from error log	7-item rubric, 10 points
3.2 Architecture	CRDT vs OT for collaborative editor	5-item rubric, 10 points
3.3 Bayes	Server error probability, show work	Correct answer + methodology, 5 points

Category 4: Tool Use & Instruction Following

Can the model follow structured instructions precisely?

Task	Prompt	Scoring
4.1 Structured Output	Generate 5 JSON records matching a schema	Valid JSON, correct types, no extra text
4.2 Tool Sequencing	Plan a read → ping → write tool chain	Correct tools, correct order, no hallucination

Category 5: Speed Microbenchmarks

Three prompts at different output lengths, 3 runs each, median reported.

Task	Target Length
5.1 Short	~128 tokens (IPv4 validator)
5.2 Medium	~512 tokens (BST implementation)
5.3 Long	~2048 tokens (Markdown-to-HTML converter)

Scoring

Coding composite: (features/max × 60) + (functional/max × 40). Syntax invalid = score × 2/3.

Overall weighting: Coding 40%, Reasoning 20%, Tool Use 20%, Speed 20%.

Sampling Parameters

Each model uses its vendor-recommended settings:

Model	Temperature	Top-P	Rationale
Qwen 3.5	0.6	0.95	Qwen team recommendation for reasoning
DeepSeek R1	0.6	0.95	DeepSeek recommendation
Devstral	0.0	1.0	Deterministic
Codestral	0.2	1.0	Mistral recommendation
Gemma 4	0.0	1.0	Deterministic

Speed benchmarks use temperature=0.0 across all models for reproducibility.

The Results

Speed: MoE Models Are in a Different League

Model	Short Tok/s	Med Tok/s	Long Tok/s	Short TTFT	Med TTFT	Long TTFT
Qwen 3.5	206.7	206.3	204.6	30.9ms	33.8ms	15.1ms
Gemma 4	180.2	179.4	177.7	22.9ms	24.6ms	15.6ms
Codestral	80.1	78.9	78.5	12.8ms	14.9ms	14.0ms
Devstral	78.6	77.6	77.3	12.8ms	14.5ms	13.3ms
DeepSeek R1	77.6	77.3	75.9	13.9ms	13.9ms	14.4ms

The two MoE models — Qwen 3.5 and Gemma 4 — are 2.6x faster than the dense models. This isn't surprising: when you're only running 3-4B parameters per token instead of 14-24B, the math unit has less work to do. But 206 tok/s on a local model is wild. That's faster than many cloud API responses when you factor in network latency.

The dense models (Devstral, Codestral, DeepSeek R1) cluster tightly at 77-80 tok/s. They're all VRAM-resident and GPU-bound at similar parameter counts.

TTFT tells the opposite story. The dense models start responding in 12-15ms. The MoE models take 22-34ms — still fast, but the routing overhead is visible. For interactive use, none of this matters. For batch processing, the MoE throughput advantage dominates.

Coding: Two Perfect Scores on the Legacy Task

Model	Todo (100)	URL Short (60)	LRU Cache (60)	Coding Avg
Qwen 3.5	100.0	60.0	60.0	73.3
Gemma 4	100.0	60.0	60.0	73.3
Devstral	94.0	60.0	60.0	71.3
Codestral	94.0	52.5	60.0	68.8
DeepSeek R1	60.0	60.0	60.0	60.0

Qwen and Gemma both scored 100 on the todo app — 10/10 features, 7/7 functional tests, valid syntax. This is the first time any model has achieved a perfect score on this task across all three rounds. Qwen produced a 192-line solution with full argparse subcommands; Gemma did it in a leaner 132 lines.

Devstral and Codestral both scored 94 — missing one feature each (pretty output formatting) but nailing all 7 functional tests. Solid.

DeepSeek R1 scored 60 across the board. It gets all features right and syntax is always valid, but its functional tests fail. Why? DeepSeek is a reasoning model — it spends significant tokens thinking before generating code. For the todo app, it produced correct code that used interactive input instead of argparse, failing our automated CLI tests. The code works fine if you run it manually. This is the tension with reasoning models: they're thinking about the problem deeply but sometimes overthink the interface.

Reasoning: Gemma's Quiet Dominance

Model	Debug Log (10)	Architecture (10)	Bayes (5)	Reasoning Avg
Gemma 4	10	10	3	8.7
Devstral	9	10	3	8.3
Qwen 3.5	8	10	3	8.0
DeepSeek R1	10	8	3	8.0
Codestral	5	8	3	6.3

Gemma 4 and DeepSeek R1 both scored 10/10 on the debug log task — correctly identifying connection pool exhaustion, the long-running transaction, the unbounded query, row-by-row processing, and proposing fixes for all three. Every other model missed at least one item.

Every model scored exactly 3/5 on Bayes theorem. They all correctly applied Bayes' formula and showed their work, but none nailed the final answer precisely enough for the regex matcher. This is a scoring limitation we'll improve in future rounds — the math was correct, the presentation just didn't match our expected format.

Codestral was weakest on reasoning at 6.3 average. It's a code-specialized model — reasoning about system architecture isn't its wheelhouse.

Tool Use: Instruction Following Separates the Field

Model	Structured Output (5)	Tool Sequencing (5)	Tool Use Avg
Qwen 3.5	5	5	5.0
DeepSeek R1	5	5	5.0
Devstral	4	5	4.5
Codestral	4	5	4.5
Gemma 4	5	2	3.5

Qwen and DeepSeek both achieved perfect 5/5 on both tool use tasks. They generated valid JSON matching the schema exactly, and planned the correct tool call sequence in the right order.

Gemma 4's weakness showed here — it only scored 2/5 on tool sequencing. Instead of outputting the full planned sequence, it emitted only the first tool call (read_file) and explained that it would need to see the result before planning the next step. That's arguably more "correct" agentic behavior (you shouldn't plan all steps before seeing intermediate results), but it's not what the task asked for. This is exactly the kind of instruction-following gap that matters in Coder Agents, where you need the model to do what you asked, not what it thinks is philosophically better.

The Leaderboard

Rank	Model	Coding	Reasoning	Tools	Speed	Weighted Total
🥇	Qwen 3.5 35B-A3B	73.3	80.0	100.0	100.0	85.3
🥈	Gemma 4 26B-A4B	73.3	86.7	70.0	87.0	78.1
🥉	Devstral 24B	71.3	83.3	90.0	37.8	70.7
4	DeepSeek R1 14B	60.0	80.0	100.0	37.3	67.5
5	Codestral 22B	68.8	63.3	90.0	38.5	65.9

Weighting: Coding 40%, Reasoning 20%, Tool Use 20%, Speed 20%.

The Winners

🏆 Best for Coding: Qwen 3.5 (73.3)

Tied with Gemma 4 on the composite score, but Qwen edges ahead on wall-clock time. Its todo app completed in 7.6 seconds at 206 tok/s. Gemma took 12.4 seconds at 179 tok/s. Same quality, faster delivery.

🏆 Best for General Agentic: Qwen 3.5 (90.0)

Perfect tool use (100) combined with strong reasoning (80.0) gives Qwen the highest combined agentic score. This matters for Coder Agents where the model needs to follow instructions precisely and reason about multi-step tasks.

🏆 Best Single Model: Qwen 3.5 (85.3)

When you can only run one model, Qwen 3.5 is the answer. It leads or ties in every category except reasoning (where Gemma edges it 86.7 to 80.0), and its speed advantage is enormous — 2.6x faster than the next non-MoE model.

The gap between #1 and #2 is 7.2 points. Between #2 and #5 it's only 12.2. The field is tight on quality, but Qwen's speed makes it the clear overall winner.

The Journey to Fair Scoring

One thing we didn't expect: the first two runs of this benchmark were wrong.

Our initial results had Devstral winning everything. But when we dug into the raw responses, we found three systemic scoring bugs:

Unclosed thinking tokens — When Qwen hit the token limit mid-thought, its <think> block never closed. Our regex required a closing </think> tag to strip it. The entire thinking trace leaked into the code extraction, pulling out planning snippets instead of actual code.
Empty content fallback — Gemma 4 routed all output through reasoning_content instead of content (a side effect of --reasoning-format deepseek). Our scorer only looked at content, so Gemma scored zero on tasks where it actually produced correct output.
Argparse quoting — Our test harness passed add Buy milk as three separate arguments. Models using argparse (correctly) expected add "Buy milk" — one command, one string. The test was wrong, not the code.

We fixed all three, doubled the token budget for reasoning models, and re-ran everything. The corrected scores tell a very different story.

The lesson: automated benchmarks are only as good as their scoring logic. Always inspect the raw responses before trusting the numbers.

What We Learned

1. MoE is the architecture to bet on for local inference. Qwen 3.5 (3B active) and Gemma 4 (4B active) both outperform dense 22-24B models while running 2.6x faster. The quality-to-speed ratio isn't even close.

2. llama.cpp gives you control that matters. Ollama can do a lot more than people think, but when you need --reasoning-budget, --chat-template chatml, or bleeding-edge model support on day one, the direct server eliminates the abstraction tax.

3. Reasoning models need breathing room. Qwen, DeepSeek, and Gemma all burn 60-80% of their token budget on thinking. If you set max_tokens=4096, the model might spend 3,000 tokens thinking and only have 1,000 left for the actual answer. We doubled the budget for reasoning models and the scores jumped.

4. Tool use is the differentiator. Coding and reasoning scores were close across all five models. Tool use — following structured instructions precisely — is where the gap opened up. Qwen and DeepSeek scored 100; Gemma scored 70. For agentic workflows, this matters more than raw quality.

5. Your benchmark harness is part of the test. We spent more time debugging our scoring logic than any model issue. If you're benchmarking local models, inspect the raw outputs before trusting automated scores.

The benchmark suite ripping through Devstral's tasks. Consistent ~77 tok/s throughput — the dense models don't waver.

What's Next

Round 4: Max Aggression — Each model with its native chat template, optimized temperature per task type, and fine-tuned reasoning budgets. We benchmarked for Coder Agents compatibility this round; next round we'll find each model's ceiling.
Retesting Qwen 3.5 against the Cloud King, Claude - We'll test Opus 4.6 and 4.7 with the goal of figuring out our perfect hybrid setup.
Dailying Qwen 3.5 is now the default model on our homelab. llm-switch.sh qwen made it so.

By the Numbers

5 models benchmarked
12 tasks across 5 categories
~25 minutes total benchmark runtime on the RTX 5090
206.7 tok/s — Qwen 3.5's peak throughput (fastest local model we've tested)
100.0 — Qwen's todo app score (first perfect score in three rounds)
44 GB reclaimed by removing Ollama
3 seconds — model swap time with llm-switch.sh
3 scoring bugs found and fixed before we trusted the results
85.3 — Qwen 3.5's weighted overall score, 7.2 points clear of #2

Thursday Thought: Chat is the New Source Code

Rob — Fri, 08 May 2026 04:54:08 +0000

I just walked out of a customer meeting that completely shifted my perspective on the future of software development. What they told me sounds almost revolutionary, but it makes perfect sense when you think about it: chat is becoming the new source code.

The Paradigm Shift: From Code to Conversation

Here's what blew my mind. This customer explained that in their AI-agent-powered workflow, generating code has become the easy part. What's actually difficult—and incredibly valuable—is recreating the context, the intent, and the reasoning that led to that code.

Think about it: when you're working with an AI agent, the magic isn't just in the final output. It's in the entire conversation—the back-and-forth refinements, the clarifications, the "actually, let me change that" moments that shape the final solution.

Storing Chat History in GitHub: A Game Changer

This customer has started doing something fascinating: they store their chat histories directly in GitHub. Not just the code that results from those chats, but the entire conversational thread that led to it.

Why? Because they've discovered something profound:

They can fork chat conversations just like code branches
They can roll back to previous chat states
Most importantly, they can recreate any piece of code trivially from the chat history

It's like having a perfect record of not just what was built, but why it was built and how the thinking evolved.

Intent Over Implementation

This represents a fundamental shift in how we think about software development. We're moving from an implementation-first world to an intent-first world.

In traditional development:

Idea → Code → Version Control → Collaboration

In the new agent-assisted world:

Intent → Conversation → Code Generation → Chat History Storage

The code becomes ephemeral—easily regenerated. The conversation becomes permanent—the true source of truth.

The Future of Version Control

I predict we're going to see GitHub, GitLab, and other version control platforms rapidly evolve into something entirely different: extensible memory layers for agentic coding.

Instead of primarily tracking file changes, these platforms will become sophisticated conversation managers that can:

Branch conversations at any point in the dialogue
Merge different conversational threads when collaborating
Diff chat histories to see how approaches diverged
Replay conversations with different agents or parameters

What This Means for Developers

This shift has huge implications for how we work:

1. Documentation Becomes Native

The chat history is the documentation. No more outdated comments or README files—the reasoning is preserved in the conversation that created the code.

2. Collaboration Changes

Instead of reviewing pull requests, we might be reviewing conversation threads. "I see you took this approach in your chat with the agent, but what if we tried this angle instead?"

3. Debugging Gets Easier

When something breaks, you don't just look at the code—you look at the conversation that created it. The context and assumptions are right there.

The Big Picture

We're witnessing the emergence of conversational version control. Just as Git revolutionized how we think about code collaboration, chat-based development is about to revolutionize how we think about preserving and sharing intent.

The source code was never really the valuable part—it was always the human thinking behind it. AI agents are just making that distinction crystal clear.

What do you think? Are you ready for a world where your Git repos contain more conversations than code? Let me know in the comments—this feels like one of those moments where the industry is about to take a sharp turn, and I'm curious to hear how others are experiencing this shift.

Have you experimented with storing chat histories as part of your development workflow? I'd love to hear about your experiences and approaches.

Slaying the Gemma Beast: How We Fixed Local AI and Shipped Search

Rob — Fri, 08 May 2026 04:53:16 +0000

Two days ago, Gemma 4 couldn't finish a feature. Today it built one, pushed it to GitHub, and it's live on this site right now.

If you press ⌘K (or Ctrl+K) on any page of vibescoder.dev, you'll see a search modal. Gemma 4 built that — running locally on an RTX 5090, zero cloud API calls, zero dollars spent. Then Claude reviewed the code, fixed the rough edges, and merged the polish. The feature you're using is a collaboration between a local model and a cloud model, each doing what they're best at.

Here's how we got there.

Previously: The Agentic Gap

In our last experiment, we pitted Gemma 4 against Opus 4.6 on the same task: build public-facing search for this blog. Opus one-shot it — 698 lines across 6 files, committed and pushed in 8 minutes. Gemma planned brilliantly, then stopped. Eight prompts later: 3 partial files, 0 commits.

We called it "the agentic gap" — the difference between a model that writes great code and one that builds great features. But we also left a thread dangling: maybe Gemma wasn't refusing to code. Maybe it was running out of room.

The Diagnosis

Our deep dive into Gemma 4's local inference uncovered the root cause: invisible thinking tokens consume your generation budget.

Gemma 4 defaults to a reasoning mode where it generates chain-of-thought tokens before producing visible output. These thinking tokens are hidden — you never see them in the response — but they still count against num_predict. With Ollama's defaults, the model was blowing its entire token budget on reasoning, leaving nothing for actual code.

That's not a model failure. That's a configuration failure.

The fix on paper was straightforward: give the model a bigger budget. But getting there required switching the entire inference stack.

Switching from Ollama to llama.cpp

Ollama is great for pulling and running models. It's not great for fine-grained control. The specific controls we needed:

Control	Ollama	llama.cpp
Context window (`num_ctx`)	Modelfile only	`--ctx-size` flag
Output limit (`num_predict`)	API parameter	`-n` flag + API
Reasoning budget	Not available	`--reasoning-budget` flag
Tool calling	Basic	Grammar-constrained

The --reasoning-budget flag is the key. It caps how many tokens the model can spend on invisible chain-of-thought, forcing it to start producing real content after hitting the limit. Ollama has zero equivalent.

The switch itself was an adventure. We couldn't use Ollama's blob files directly — llama.cpp expects standard GGUF files, but Ollama stores models in a split format that standalone tools can't load. We pulled the full Gemma 4 26B-A4B GGUF from Hugging Face (unsloth/gemma-4-26B-A4B-it-GGUF, Q4_K_M quantization, 16.9 GB download) and launched llama-server with tuned settings:

~/llama.cpp/build/bin/llama-server \
  -m ~/models/gemma4-26b/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --ctx-size 32768 \
  -n 32768 \
  --reasoning-budget 4096 \
  --reasoning-format deepseek \
  --parallel 1 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 999

Key settings:

--ctx-size 32768 — 32K context window. Fits comfortably at ~19 GB on the 5090.
-n 32768 — 32K max output tokens. Room for both reasoning and code.
--reasoning-budget 4096 — Cap invisible thinking at 4K tokens. The rest is for actual output.
--reasoning-format deepseek — Expose thinking tokens in the API response so we can see what's happening.
--parallel 1 — Single slot instead of default 4. Four slots × 32K context was causing OOM kills.

Then we pointed Coder at the new endpoint. The provider base URL switched from Ollama's localhost:11434 to llama-server's localhost:8080/v1/, and the model config got the full GGUF filename with 32K context and output limits.

Three Attempts to Slay the Beast

It didn't work on the first try.

Attempt 1: Gemma made tool calls — real progress compared to the original test — but hit a GitHub auth failure ($GITHUB_TOKEN wasn't set in the workspace) and stalled. The last output was raw token leakage: call:execute{command:<|">find... — special tokens leaking into the response, one of the known Gemma issues.

Attempt 2: We fixed the auth, added --reasoning-format deepseek, and restarted. Gemma got much further — wrote a search index generator, ran it, started exploring the codebase. Then llama-server got Killed — the OOM killer struck. Four parallel slots at 32K context each was too much VRAM.

Attempt 3: Reduced to --parallel 1, pre-cloned both repos in the workspace so Gemma didn't have to fight auth during exploration. This time it worked. Gemma laid out a clear implementation plan, and after one nudge — "keep going, don't stop, code and commit" — it executed the entire thing.

How Fast Was It?

In the Model Showdown Round 2, Gemma 4 clocked 167.1 tok/s on a short benchmark task via Ollama — the fastest perfect scorer. But a benchmark prompt and an agentic coding session are different workloads. How does Gemma perform when it's actually building something?

We ran fresh benchmarks against the llama.cpp server with coding prompts at different output lengths:

Task	Prompt Tokens	Output Tokens	TTFT	Tok/s
Short (debounce function)	29	512	27ms	179.2
Medium (React component)	63	2,048	28ms	177.3
Long (full Node.js script)	62	2,679	29ms	181.2

Three things stand out.

Time to first token is near-instant. 27–29ms TTFT means the streaming UI starts filling in almost immediately. For comparison, cloud models typically hit 500ms–2s TTFT depending on load and routing. On a local GPU, there's no network round-trip, no queue, no cold start.

Generation speed doesn't degrade. Whether Gemma is writing 512 tokens or 2,679 tokens, throughput stays locked at 177–181 tok/s. There's no slowdown as context grows — at least not at these output lengths. During the actual search build session, with thousands of tokens of accumulated context from tool calls and file contents, we observed ~159 tok/s. That's a ~12% drop from peak, which is expected: more context means more attention computation per token.

The reasoning budget has a real cost. With --reasoning-format deepseek, Gemma's thinking tokens are visible in the API response. On a short 256-token request, the model spent all 256 tokens reasoning and produced zero visible output. That's the invisible thinking token problem in action — and exactly why --reasoning-budget 4096 matters. Cap the thinking, and the remaining budget goes to code.

Metric	Ollama (Showdown R2)	llama.cpp (this session)
Tok/s (benchmark)	167.1	177–181
Tok/s (real workload)	N/A (failed)	~159
TTFT	3.92s	~28ms
Reasoning budget control	None	`--reasoning-budget 4096`

The TTFT difference is dramatic — 3.92s vs 28ms. Ollama's 3.92s likely included model loading or prompt cache misses. llama-server keeps the model hot in VRAM with a persistent prompt cache, so subsequent requests start generating almost instantly.

Bottom line: Gemma 4 on an RTX 5090 via llama.cpp generates code at ~180 tok/s peak, ~159 tok/s under real agentic load, with sub-30ms TTFT. That's fast enough that the model is never the bottleneck — tool execution (git operations, file I/O, npm installs) takes longer than inference.

What Gemma Built

Two prompts. One feature. Pushed to main.

 package-lock.json                | 466 +++++++++++++++++++
 package.json                     |   3 +-
 public/search-index.json         |  34 +++
 scripts/generate-search-index.ts |  40 ++++
 src/components/Header.tsx        |  32 +++
 src/components/SearchModal.tsx   | 216 ++++++++++++++++++
 6 files changed, 618 insertions(+), 173 deletions(-)

The architecture: a client-side Fuse.js search with a pre-generated JSON index. A build-time script reads all published posts and generates public/search-index.json. The SearchModal component loads this index on first open, runs fuzzy searches with Fuse.js, and renders results in a Cmd+K overlay.

Gemma even hit an authentication error during git push — and self-corrected. It ran coder external-auth access-token github, reconfigured the git remote with the token, and pushed successfully. That's agentic behavior — the thing that was completely absent in the original test.

The commit message: afb5c73 feat: add search functionality with Fuse.js. Vercel auto-deployed from main. The feature went live.

The Code Review: What Gemma Got Right and Wrong

Working code that ships is a milestone. But "it works" and "it's production-quality" are different standards. Claude reviewed every line of Gemma's implementation. Here's the honest assessment.

What Gemma Got Right

Architecture was sound. Client-side search with a pre-generated JSON index is the correct call for a 14-post blog. No server-side API needed, no database, sub-5ms search times. The index is ~130 KB — smaller than a hero image.

Component structure was clean. Separate SearchModal component, separate build script, clean Header integration. Three lines to wire it into the existing layout.

It used the existing design system. CSS variables like bg-surface, border-primary, text-on-surface — all from the Neon Brutalist theme. It read the codebase and matched the patterns.

Self-correcting on errors. When git push failed, Gemma diagnosed the auth issue and fixed it autonomously. Three tool calls: fetch token → reconfigure remote → push. No human intervention needed.

What Gemma Got Wrong

Zero accessibility. No role="dialog", no role="combobox", no aria-modal, no aria-activedescendant, no focus trap. A screen reader would have no idea this modal existed.

Broken exit animations. The AnimatePresence wrapper contained a regular <div> instead of a motion.div. When the modal closed, React unmounted the wrapper immediately, killing the exit animations before they played. The code looked right but didn't work.

Performance anti-pattern. A new Fuse instance was constructed on every keystroke. Fuse builds an internal index on construction — that's wasted work. Should be useMemo keyed on the index data.

Eager loading. The search index was fetched on every page load, even if the user never opened search. Should lazy-load on first modal open.

Wrong fonts. Applied --font-headline (Space Grotesk) to the entire modal including body text and descriptions. The codebase uses headline for titles only, with the default font for body text.

Ignored existing components. Rendered tags as raw <span> elements with custom styling instead of reusing the existing TagBadge component that already had the right design tokens.

Stale search index committed to git. The generated search-index.json was committed with 3 placeholder posts. It's a build artifact — should be in .gitignore.

Content truncated too aggressively. Each post's content was cut to 1,000 characters. Terms that only appeared deeper in posts (like "RustDesk" in our infrastructure writeups) were invisible to search.

The Polish Pass

Claude's fix addressed every issue in a single PR:

Accessibility: Full ARIA combobox pattern — role="dialog", role="combobox" on the input with aria-expanded/aria-activedescendant, role="listbox" and role="option" on results, aria-live="polite" for result count announcements.

Keyboard navigation: Arrow Up/Down to move through results, Enter to navigate, Escape to close. Active result scrolls into view automatically.

Performance: Fuse instance memoized with useMemo (rebuilds only when index changes). Index fetched lazily on first modal open. Minimum 2 characters before searching.

Search quality: Weighted field scoring — title matches score 3× higher than content matches, tags 2×, descriptions 1.5×. Markdown stripped from indexed content. Full post content indexed with no truncation.

Design system: Correct font usage matching PostCard patterns. TagBadge component reused. Platform-aware keyboard hint (⌘K on Mac, Ctrl+K elsewhere).

Animation fix: Outer wrapper is now a motion.div — exit animations actually play.

Cleanup: Body scroll lock, query cleared on close, build artifact gitignored, dead imports removed.

The polish commit: 383 insertions, 201 deletions across 5 files. The combined feature is 804 lines across 6 files.

Opus vs. Gemma+Opus: An Honest Comparison

We now have two complete implementations of the same feature. Opus 4.6's original branch (feature/search-opus46, 698 lines) is still in the repo. Here's how they compare.

Architecture

	Opus 4.6 (original)	Gemma 4 + Opus (shipped)
Search engine	Server-side API route with weighted scoring	Client-side Fuse.js with weighted config
Index	None — reads posts at request time	Pre-generated JSON, fetched once
Surfaces	Cmd+K dialog + `/search` page	Cmd+K modal only
URL state	Yes (`/search?q=cloudflare`)	No

Opus's architecture is more feature-complete. A dedicated /search page with URL state means search results are linkable and shareable. The server-side API route means the search logic runs where the content lives, with no index to generate or cache.

Gemma's architecture is simpler and arguably better for this scale. A static JSON index means zero server load, instant results, and the feature works on Vercel's free tier without hitting function invocation limits. At 14 posts and 130 KB, client-side search is the right call.

Code Quality

	Opus 4.6	Gemma 4 (raw)	Gemma 4 + Opus (merged)
Accessibility	Full ARIA, keyboard nav	None	Full ARIA, keyboard nav
Animation correctness	Correct	Broken exits	Fixed
Performance	AbortController for API calls	Fuse recreated per keystroke	Memoized, lazy-loaded
Design system	Mostly correct	Mostly correct	Fully correct
Known bugs	3 (duplicate logic, type cast, missing Suspense)	7 (see review above)	0

Opus's raw output was higher quality. Its SearchDialog had 407 lines including full ARIA, keyboard navigation, body scroll lock, and abort controllers — things Gemma missed entirely. But Opus also had its own bugs: duplicated search logic between the API route and the /search page, an unsafe type cast, and a missing Suspense boundary. We scored it 87.5/100 in the original review.

The merged Gemma+Opus implementation is the cleanest of the three. It takes Gemma's simpler architecture, applies Opus's quality standards for accessibility and interaction design, and fixes the issues both models left behind.

The Real Comparison

The honest truth: if I had to ship search today with one model and no review, I'd pick Opus. It produced higher-quality code in a single turn with zero intervention. The 87.5/100 score reflects real, shippable work with minor fixable issues.

But that's not the interesting takeaway. The interesting takeaway is that the configuration changes mattered more than the model differences. The original Gemma test didn't fail because Gemma is a bad model. It failed because:

num_predict was too low (invisible thinking tokens consumed the budget)
Ollama doesn't expose --reasoning-budget (no way to cap thinking)
Default parallel slots exhausted VRAM
GitHub auth wasn't configured in the workspace

Fix those four things — all infrastructure, not model weights — and Gemma went from "0 commits in 8 prompts" to "shipped a feature in 2 prompts." The model was the same. The environment was different.

What This Means for Local Models

Local models can ship production features. Not hypothetically. This search feature is live, built entirely by Gemma 4 running on consumer hardware. The code needed polish — but so does most code from any developer, human or AI.

Configuration is the bottleneck, not capability. The difference between "Gemma can't finish anything" and "Gemma ships a feature" was four infrastructure changes. Most teams evaluating local models are testing against default settings that actively sabotage the model's output. Invisible thinking tokens, insufficient context windows, VRAM contention — these are environment bugs, not model bugs.

The best workflow might be local + cloud. Gemma built the feature (free, fast, private). Claude reviewed and polished it (thorough, quality-focused). Each model did what it's best at. The total cost was one Opus API call for the review pass, not dozens for the entire build.

llama.cpp is the right tool for serious local inference. Ollama is great for getting started. For production use — where you need reasoning budgets, precise context control, and OpenAI-compatible APIs that tools like Coder can consume — llama-server gives you the knobs you actually need.

The Settings That Made It Work

For anyone running Gemma 4 locally, here's the configuration that turned it from a planning machine into a shipping machine:

llama-server \
  -m gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --ctx-size 32768 \       # 32K context — ~19 GB VRAM on 5090
  -n 32768 \               # 32K max output tokens
  --reasoning-budget 4096 \ # Cap thinking at 4K tokens
  --reasoning-format deepseek \ # Expose thinking in API response
  --parallel 1 \           # Single slot — don't OOM with 4 × 32K
  -ngl 999                 # All layers on GPU

The --reasoning-budget 4096 is the single most important flag. Without it, Gemma can spend its entire output budget on reasoning you never see. With it, the model gets 4K tokens to think, then the rest is for actual code. That one flag is the difference between a model that plans forever and a model that ships.

What's Next

Right now, Gemma 4 serves a single Coder instance on the workstation where it runs. That's fine for one person, but the RTX 5090 is sitting idle most of the day. The obvious next step: make it available to every machine on the local network.

My wife runs OpenClaw on a Mac Mini in the other room. With Tailscale already meshing our devices together, pointing her OpenClaw instance at http://workstation:8080/v1/ is trivially easy — llama-server's OpenAI-compatible API means any tool that speaks the OpenAI protocol can use it. One GPU, multiple clients, zero cloud costs.

Beyond that: migrating the remaining Ollama models to llama.cpp (for the same reasoning budget control we needed here), experimenting with longer context windows now that we know the VRAM budget, and — inevitably — the next model showdown when Gemma 4's bigger variants drop.

The homelab keeps growing. Who knows? Maybe the lobster starts vibe coding for me, too.

By the Numbers

3 attempts before Gemma completed the task (auth fix, OOM fix, success)
2 prompts in the successful run (vs 8 failed prompts in the original test)
618 lines written by Gemma 4 across 6 files
383 lines changed in the Opus polish pass (insertions + deletions)
804 total lines in the merged feature
0 cloud API calls for the build phase (Gemma ran 100% local)
177–181 tokens per second — Gemma's peak generation speed on the RTX 5090
~159 tokens per second — effective speed under real agentic load (accumulated context)
28ms time to first token — near-instant streaming start
16.9 GB model size (Gemma 4 26B-A4B, Q4_K_M quantization)
~19 GB total VRAM at 32K context (comfortable fit on 32 GB card)
4,096 reasoning budget tokens — the setting that made it all work
$0 inference cost for the feature build
1 nudge needed ("keep going, don't stop, code and commit")
7 bugs found in Gemma's code during review (all fixed)
3 bugs in Opus's original implementation (never merged, never fixed)
0 bugs in the merged Gemma+Opus version
1 production feature, live on vibescoder.dev right now — press ⌘K to try it