Rob

Posted on May 8 • Originally published at vibescoder.dev

The Agentic Gap: Claude Oneshots, Gemma Fails

#ai #llm #benchmark #homelab

Two days ago, Gemma 4 topped our local model benchmark — 167 tokens per second, perfect code quality score, smallest download. Faster than Sonnet. Faster than Opus. The blog post said "Gemma 4 is the new default."

Today we tested whether that's actually true.

The Experiment

Instead of another toy benchmark, we pulled a real item off the vibescoder.dev backlog: public-facing search across all blog posts. Multi-file feature, architectural decisions required, design system integration, no specification beyond "make search work."

Two models. Same prompt. Same codebase. Same workspace template. One shot — no follow-up instructions, no hand-holding. Walk away and see what happens.

	Gemma 4 27B	Opus 4.6
Provider	Ollama (local, RTX 5090)	Anthropic API (cloud)
Benchmark speed	167.1 tok/s	74.3 tok/s
Benchmark score	100/100	100/100
Cost	$0	Per-token pricing

The prompt was deliberately vague on implementation details:

Add public-facing search to vibescoder.dev. Users should be able to search across all published blog posts by title, content, tags, and description. The search should feel fast and match the site's existing Neon Brutalist design system. Consider: how users discover search, how results display, empty/no-result states, search state management (URL, keyboard shortcuts). Must be accessible from any page, work on mobile, and not introduce new design libraries. Commit and push when complete. Do not ask clarifying questions — make your own decisions.

Setting Up the Arena

Each model got its own Coder workspace with identical starting conditions: same Docker template, same base commit on main, same content repo.

Both workspaces built from the same Docker template — only the model selection differed.

We created two feature branches from the same commit (12fd589) and verified Vercel was configured to auto-build preview deployments for any branch push.

Vercel preview deployments would give us side-by-side URLs to compare the finished features.

Both prompts were delivered at the same time. Then we stepped back.

Opus 4.6: The Quiet Professional

Opus received the prompt and went silent. No questions. No plan narrated back. Just the spinning indicator showing it was working.

Over the next eight minutes, Opus:

Cloned both repos and installed dependencies
Read package.json, tsconfig.json, the app layout, existing components, lib/posts.ts, lib/types.ts, the design system in globals.css, and the middleware
Made architectural decisions: Cmd+K dialog with live API results for quick navigation, plus a full /search page for detailed browsing
Built a weighted scoring search API (/api/search) that ranks title matches above tag matches above content matches
Created a 407-line SearchDialog component with keyboard navigation, body scroll lock, abort controllers for in-flight requests, and ARIA accessibility
Built a server-rendered search results page with debounced URL state
Modified Header.tsx — three lines: import, component placement, mobile nav link
Updated middleware to whitelist the search API route
Committed everything in one clean commit and pushed

One prompt. One commit. 698 lines across 6 files. Pushed to GitHub, Vercel preview building.

 src/app/api/search/route.ts     | 104 ++++++++++
 src/app/search/SearchInput.tsx  |  78 ++++++++
 src/app/search/page.tsx         |  97 ++++++++++
 src/components/Header.tsx       |  10 +
 src/components/SearchDialog.tsx | 407 +++++++++++++++++++++++++++++++++
 src/middleware.ts               |   3 +-
 6 files changed, 698 insertions(+), 1 deletion(-)

What Opus Built

A Cmd+K search dialog. Press Cmd+K (or /) from any page and a full-screen overlay appears with a search input. Results appear live as you type, debounced at 200ms, with scores-based ranking. Arrow keys navigate results, Enter selects, Escape closes. The dialog shows up to 8 results with title, date, tags, reading time, and a context snippet showing where the match was found.

A full search page at /search. Accessible from the mobile hamburger menu and via a "View all results" link in the dialog. Server-rendered with URL state (/search?q=cloudflare). Shows the full PostCard component for each result — consistent with the blog's existing post listing.

A scored API route. Title matches score 100 points (150 for exact match). Tag matches score 50. Description matches score 25. Content matches score 10 plus 3 per occurrence. Results are sorted by score descending, capped at 20. The API strips markdown from content before extracting snippets.

The Header diff tells the story of good integration:

+import { SearchDialog } from "@/components/SearchDialog";
 ...
+          <SearchDialog />
           <ThemeToggle />
 ...
+          <Link href="/search" ...>Search</Link>

Three lines to wire a 407-line feature into the existing layout. That's how you integrate with a codebase.

Gemma 4: The Brilliant Planner Who Never Coded

Gemma's run was a different experience entirely.

Prompt 1 — the original task. Gemma cloned the repos, checked out the branch... and stopped. Checkmark. Done. It treated the setup commands as the entire task.

Gemma completed the setup and declared victory. The search feature hadn't been mentioned yet.

Prompt 2 — "You only completed the setup steps. Now read the codebase and implement the search feature." Gemma cloned the content repo, listed the directory, read package.json, wrote a thoughtful analysis plan (Architecture, Data Flow, Design System)... and stopped again. "I will begin by reading package.json."

Prompt 3 — "Keep going. Execute your plan — read the files, build the search feature, commit and push." Gemma read more files, outlined a detailed preliminary plan with four numbered steps... and stopped. "I will start by reading the core scripts directory."

Prompt 4 — "Do not stop between steps. Read all the files you need, write all the code, and push to git — all in one go. Do not pause or ask for confirmation."

The plan was actually good — client-side JSON index, search in header, /search?q=query routing. It just wouldn't execute it.

Gemma responded with its most detailed plan yet. A JSON search index builder, modifications to fetch-content.sh, a SearchInput component, Header integration, a search results page. Smart architectural choices. Then: "I will perform all these changes now." And stopped.

Prompt 5 — "Stop planning. Start coding. Write the files now."

After being told to code, Gemma showed code in the chat window instead of writing it to disk.

This time Gemma actually wrote some code — build-search-index.js, an edit to fetch-content.sh, and SearchInput.tsx. Three files to disk. Progress. Then it listed the three remaining tasks (Header, search page, commit) and stopped.

Prompts 6, 7, 8 — "Go." / "Go." / Explicit task list with three items. Gemma showed "Thinking..." briefly, then nothing. No output. No tool calls. The workspace eventually showed "unhealthy."

Eight prompts. Three partial files. Zero commits.

The AGENTS.md Experiment

Before giving up, we tried one more thing. We added explicit agentic behavioral instructions to AGENTS.md in the repo — the file that Coder agents read for project-level guidance:

# Agentic Execution Rules

You are an autonomous coding agent. Execute tasks end-to-end
in a single turn. Never stop to describe what you will do
next — just do it.

## What You Must Never Do

- Output a multi-step plan and then stop.
- Describe code you intend to write instead of writing it.
- Leave uncommitted changes in the workspace.

Started a fresh Gemma session with the same prompt. Same result. Clone, read package.json, plan, stop.

The instructions were clear. Gemma read them. And then it planned what it was going to do next and stopped.

The Scoreboard

	Opus 4.6	Gemma 4 27B
Prompts needed	1	8 (incomplete)
Files changed	6	3 (never committed)
Lines written	698	~150 (partial, uncommitted)
Commits pushed	1	0
Feature complete	Yes	No
Time to completion	~8 minutes	Never
Errors self-corrected	Yes (middleware, routing)	N/A
Design system match	Yes (Neon Brutalist tokens)	N/A
Keyboard shortcuts	Cmd+K, /, Escape, arrows	N/A
Mobile support	Yes (hamburger menu link)	N/A
Accessibility	Full ARIA	N/A

Technical Review: Opus Implementation

The code isn't perfect. A few things to fix before merging:

Duplicate search logic. The API route uses weighted scoring. The search page uses flat boolean filtering. Same query, different result order depending on which surface you use.
Unsafe type cast. post as Post in the search page strips content then casts back to Post, which expects a content field. Works at runtime but lies to TypeScript.
Missing Suspense boundary. useSearchParams() in SearchInput needs a Suspense wrapper for Next.js 14+.

But these are code review items — the kind of things you'd catch in a PR review and fix in 10 minutes. The feature works, the architecture is sound, the UX is polished.

Score: 87.5/100 across correctness (88), architecture (82), code quality (90), performance (85), completeness (92), and integration (91).

What We Learned

Benchmarks test generation, not agency. Gemma 4 writes excellent code when you tell it exactly what to write. That's what our todo-app benchmark measured — single-turn code generation from a clear spec. Agentic coding is a different skill entirely: reading a codebase, making decisions, chaining dozens of tool calls, self-correcting, and maintaining a plan across many steps. Gemma can't do that yet.

The plan-and-stop pattern is a model behavior, not a configuration problem. We tried explicit instructions ("do not stop"), behavioral directives in AGENTS.md, and increasingly urgent nudges. Gemma consistently planned what it would do, narrated the plan in detail, and then yielded control back to the user. It's not a token limit or context issue — it's how the model was trained to interact.

Speed doesn't matter if you can't finish. Gemma generates at 167 tok/s. Opus generates at 74 tok/s. But Opus delivered a complete, working, tested feature in 8 minutes with zero human intervention. Gemma delivered nothing usable in 20+ minutes with eight human prompts. The fastest model in our benchmark is the slowest in production.

The daily driver earned its spot. Opus 4.6 has been behind every line of code on vibescoder.dev since day one. This experiment didn't just confirm that choice — it quantified why. On a real task, the gap between "writes great code" and "builds great features" is the difference between a benchmark score and a shipping product.

Local models aren't there yet for agentic coding. This isn't a permanent verdict. Gemma 4 was released weeks ago. Agentic capabilities are the frontier every model vendor is racing toward. But today, if you need an AI agent that can autonomously build features, cloud models with tool-calling training (Claude, GPT) are still the only game in town.

What's Next

We have a working search feature on a Vercel preview branch, courtesy of Opus. Next step is reviewing the code, fixing the three issues identified, and merging it to production. vibescoder.dev gets search.

But we're not done with Gemma. The more we dug into the results, the more we think this shootout wasn't a fair fight. Our Gemma 4 deep dive ran Gemma through Ollama with default settings — and we've since discovered that Gemma's reasoning tokens are invisible but still eat your generation budget. With num_predict: 16384, the model may have blown its entire token budget on chain-of-thought we never saw, leaving nothing for actual code output. That would explain the plan-and-stop pattern perfectly: Gemma wasn't refusing to code — it was running out of room.

So we're rerunning the shootout. This time we're loading both models through llama.cpp directly, giving us fine-grained control over thinking budgets and VRAM allocation. We'll crank num_predict and num_ctx to 32K+, experiment with --reasoning-budget to cap invisible thinking tokens, and give Gemma the full 32 GB of RTX 5090 VRAM to work with. No more starving the local model on default settings and then calling it a fair comparison.

If Gemma was choking on its own reasoning, the fix might be as simple as giving it room to breathe. If it still can't finish — even with aggressive resources and tuned inference settings — then the agentic gap is real and it's in the model weights, not the configuration. Either way, we'll have a definitive answer.

By the Numbers

2 models tested head-to-head on a real feature
1 prompt — identical for both, no follow-ups allowed (for Opus)
8 prompts needed for Gemma before it stalled permanently
698 lines of working code from Opus
0 lines committed by Gemma
6 files changed by Opus (API route, search dialog, search page, input component, header, middleware)
3 files partially written by Gemma (never committed)
8 minutes from prompt to pushed commit (Opus)
20+ minutes of attempted nudging before calling it (Gemma)
87.5/100 technical review score for Opus implementation
407 lines in SearchDialog.tsx alone — keyboard nav, ARIA, scroll lock, abort controllers
3 code review items to fix before merging (duplicated logic, type cast, Suspense)
$0 spent on Gemma inference (also $0 of value delivered)
1 AGENTS.md rewrite attempted to fix Gemma's behavior (didn't work)
1 clear winner

DEV Community