Bishwas Bhandari

Posted on Apr 20 • Originally published at webmatrices.com

Gemma4 vs Claude Code: I Tried the Switch. Here's What Broke First.

#ai #webdev #programming #productivity

Every few months someone drops a new open model and the local AI community collectively loses their minds. "This is the one. This kills the subscriptions." It happened with Llama 3. With Qwen. With Gemma 3.

None of them actually did it.

So when Gemma 4 landed and the numbers looked genuinely scary, I decided to stop theorizing and just try it — wired into my actual dev workflow, building actual software. Not chat. Not benchmarks. Shipping code.

Here's what happened.

What Made This Test Worth Running

I'll be honest: I almost didn't bother. I'd been burned too many times by models that looked great in a playground and fell apart the second I asked them to do real work.

But Gemma 4's numbers are different. 31B parameters, #3 on Chatbot Arena, Apache 2.0 license, runs on a single GPU. Someone got it running on a 6GB phone. A developer built an Android automation agent with it before the week was out. The 26B MoE variant hits 80–110 tokens per second on an RTX 3090.

And here's the number that actually convinced me to test it: the τ²-bench agentic tool-use score. Gemma 3 scored 6.6% on that benchmark — meaning it failed 93 out of 100 tool calls. Basically useless as an agent. Gemma 4 31B scores 86.4% on the same test. That's not a marginal improvement. That's an entirely different model category.

So I tested it. And I kept notes.

The First 4 Hours Are Genuinely Impressive

Single-file edits? Fast and accurate. Writing fresh functions from scratch? Solid. It understood what I was asking, gave clean code, and didn't hallucinate imports.

Speed alone made me want this to work. No API latency. No waiting. Just results.

But then I gave it a real task.

Where It Broke

I asked it to refactor a module that touched 4 files. Nothing exotic — just a cleanup that involved renaming a function and updating its callers.

Here's what happened:

File 1: Edited perfectly. I was impressed.
File 2: Hallucinated the path. Generated changes to a file that didn't exist at that location.
File 3: Wrote code that called the function it had just deleted in File 1.

Classic context collapse. The model understood the task. It just couldn't hold the thread across multiple files under load.

I went back to Claude Code. Same refactor. One shot. Done.

That's the gap right now — and it's not about intelligence.

The Tool Calling Problem Is Worse Than the Benchmarks Suggest

Here's the thing about that 86.4% τ²-bench score: it's for the 31B Dense model. Most people running Gemma 4 on consumer hardware are using the 26B MoE variant — because it's faster and fits in less VRAM.

That model scores 68% on the same agentic benchmark. Qwen's comparable variant scores 81%.

In a 15-step workflow, a 68% tool-call success rate isn't a stat. It's a guarantee of failure somewhere in the middle.

I ran a test that made this painfully obvious. I asked Gemma 4 to scaffold a SvelteKit project — search the web for the latest setup command, then build it. Simple, explicit instructions. It said it couldn't access the web. I pointed it at a GitHub link. It said it couldn't open URLs. I pointed it at an MCP tool that was literally already connected and listed in its available tools.

It asked me for clarification instead of using the tool.

The same prompt, sent to a different model, returned a web search, a shell command, and a running project. No hand-holding needed.

(The side-by-side screenshots from the developer community testing this are wild — you can see both responses in our forum thread linked below. One says "I can't access that." The other just builds the thing.)

There's Hidden Performance That Hasn't Shipped Yet

This is where it gets interesting.

Some researchers digging through Gemma 4's model weights found undocumented multi-token prediction heads baked into the architecture. Speculative decoding, essentially — just not officially enabled. Google confirmed the finding but said it's "not yet officially supported."

So there's raw performance sitting in the model that nobody can use yet.

If they release proper MTP support alongside tool-calling fixes in a point release, the numbers improve significantly. And that's before mentioning the Ollama bugs that are tripping people up on Apple Silicon right now — a streaming bug that routes tool-call responses to the wrong field, and a Flash Attention freeze on prompts over 500 tokens.

These are fixable. The question is when.

What Claude Code Actually Does That's Hard to Replace

People frame this as a capability question. It's not. It's a reliability question.

Claude Code isn't the best model on any single benchmark. What it does is boring things consistently — reads your files, understands what's already in your codebase, calls tools in the right sequence, and doesn't lose the thread halfway through a 50-step workflow.

One developer I spoke with said it best: working with Gemma 4 on an existing codebase felt like pairing with a smart contractor who refuses to read the existing code. Technically correct suggestions, all of them wrong for this project.

That's not a parameter count problem. It's a training problem. And it's hard to fix.

So What Actually Happens to the Subscription Economy?

Here's my actual prediction, not the polite one:

Gemma 4 eats ChatGPT's casual usage hard. If you're using a subscription for summarizing stuff, answering questions, writing emails — Gemma 4 running locally is free and nearly as good. That category is gone.

But agentic coding is a different product. Maintaining context across 50 files, calling 15 tools in sequence, not hallucinating paths at 2am when you're exhausted — that's not what benchmarks test, and it's not what Gemma 4 is reliably doing yet.

The smart move right now is probably hybrid: Gemma 4 for the 80% of tasks that are fast and straightforward. Claude Code for the 20% that need depth and reliability. Especially for enterprises with data residency requirements — Apache 2.0 means you can finally run a capable model entirely on your own infrastructure.

The "one model" framing is wrong. It's becoming a stack.

The Conversation Developers Are Actually Having

I wrote a longer version of this take in our community forum and the responses were more honest than anything I've seen in a formal comparison. Real developers. Real workflows. Real failure modes.

One person tried switching for a weekend and lasted 4 hours. Another found the MTP heads in the weights. Another showed the side-by-side screenshots of Gemma 4 refusing to use available tools while a different model just executed.

That thread is here → webmatrices.com/post/will-gemma-4-actually-replace-claude-code

Worth reading if you're thinking about making this switch — or if you already tried and want to compare notes.

Gemma 3 couldn't have this conversation. Gemma 4 makes it real. Gemma 5 might actually answer it.

Have you wired Gemma 4 into a real dev workflow and tried to ship something? Not chat — actual coding work. Drop your experience in the comments or the forum thread. The honest data is in the field reports, not the leaderboard.