This is a submission for the Gemma 4 Challenge: Write About Gemma 4
Every open-weight model release in 2026 comes with a benchmark table and a claim about efficiency. Most of them are incremental. Gemma 4 has one number that isn't: 6.6% to 86.4% on agentic tool use. That's not an improvement. That's a category change.
The Number That Actually Matters
When Google DeepMind dropped Gemma 4 on April 2, 2026, the coverage focused on the headline scores - AIME 2026, LiveCodeBench, Arena AI rankings. Those numbers are impressive. The 31B dense model scores 89.2% on AIME (up from Gemma 3 27B's 20.8%), 80% on LiveCodeBench (up from 29.1%), and sits third among all open models on Arena AI.
But the benchmark that actually changes what developers can build is τ2-bench - the agentic tool use evaluation that measures whether a model can reliably execute multi-step tasks across real tool schemas, partial information, and policy constraints. Gemma 3 27B scored 6.6% on τ2-bench Retail. Gemma 4 31B scores 86.4%.
Put that concretely: Gemma 3 failed 93 times out of 100 on structured tool use. Gemma 4 fails roughly 14 times out of 100. Those aren't the same class of model for anyone building agents.
The 26B MoE variant scores 85.5% on the same benchmark while activating only 3.8 billion of its 26 billion parameters per forward pass. You get near-flagship agentic capability at a fraction of the inference cost.
What Changed Architecturally
The τ2-bench jump didn't happen because Google made a bigger model. Gemma 4 31B has roughly the same parameter count as Gemma 3 27B. What changed is how the model was trained and what capabilities were baked in natively.
Gemma 4 ships with native function calling via dedicated control tokens - structured tool use is built into the model's vocabulary rather than bolted on through prompt engineering. It has configurable thinking modes where the model can generate 4,000+ tokens of step-by-step reasoning before committing to a tool call, which directly improves accuracy on complex multi-step pipelines. And it has native system prompt support, meaning you can define agent behavior, tool schemas, and constraints in the system turn without workarounds.
The architecture also came from the same research stack as Gemini 3, Google's closed frontier family. The knowledge transfer is visible in the benchmark gaps - particularly on tasks requiring multi-turn planning and policy-compliant tool execution, which are exactly the conditions τ2-bench tests.
One important hardware caveat on the 26B MoE: while it activates only 3.8B parameters per token during generation, all 26 billion parameters must be loaded into memory for routing. Its memory footprint is close to a dense 26B model, not a 4B one. The speed advantage is real - the MoE reaches 40+ tokens per second on consumer GPUs versus 10+ for the dense 31B - but size your VRAM accordingly before assuming it runs like a small model.
Why This Matters for Developers Building Agents
Before Gemma 4, the honest answer to "should I use a local open model for my agent?" was usually no - at least not for anything where tool call reliability mattered. A 6.6% success rate on structured tool use means the agent fails almost every time it needs to call a function, check a schema, or chain tool outputs. That's not a foundation for anything in production.
86.4% changes the calculation. It's not at parity with frontier closed models - GPT-5.4 still leads on complex multi-step benchmarks - but it's in the range where developers can build real agentic workflows locally, catch edge cases with retries and error handling, and ship something that actually works. The failure modes are now manageable rather than fundamental.
This matters especially for three deployment contexts that couldn't practically use local models before.
Privacy-sensitive agentic applications. Healthcare tools, legal review pipelines, financial compliance agents - any workflow where raw query data can't leave the device. Gemma 4's native function calling running locally means the model decides which tool to call on-device, and only the structured API request goes out over the network. Your prompt, your context, and your intermediate reasoning stay local.
Cost-controlled production agents. Per-token API costs accumulate fast in multi-step agentic workflows where each task triggers 5–20 tool calls. Running Gemma 4 26B MoE locally on a consumer GPU eliminates that variable entirely. The 26B MoE's inference speed (40+ tokens/sec on an RTX 4090) is fast enough for real-time agentic loops without the latency penalty you'd expect from a model this capable.
MCP-integrated local pipelines. Gemma 4's native function calling maps directly to Model Context Protocol tool schemas. The setup is straightforward: run Gemma 4 via llama.cpp or vLLM with an OpenAI-compatible endpoint, point your MCP client at it, and the model handles tool selection and call generation locally. What previously required a cloud model API can now run on your own infrastructure with no per-call cost and no data leaving your server.
Picking the Right Model for Agentic Work
Gemma 4 ships as a family of four, and the right choice for agentic deployment isn't automatically the biggest one.
The 31B dense model is the accuracy ceiling - highest τ2-bench score, best reasoning on complex multi-step tasks, strongest fine-tuning base. It runs unquantized on a single 80GB H100, and quantized (Q4_K_M) on consumer GPUs with 24GB+ VRAM. If you're building a server-side agent where quality is the constraint and hardware isn't, start here.
The 26B MoE is the practical production choice for most agentic deployments. 85.5% on τ2-bench is close enough to the 31B that the tradeoff is almost always worth it: 4x faster token generation, lower GPU memory pressure during inference, same 256K context window. For agents running continuous loops or handling high request volume, the speed difference compounds significantly.
The E4B (4B edge model) hits 52% on LiveCodeBench and supports native audio input - the only model in the family that handles speech natively. If you're building on-device Android agents that need voice input or mobile-first agentic workflows, this is your model. The agentic tool use scores are lower, but the hardware targets are completely different: this runs on a phone.
The E2B (2B edge model) reaches 133 prefill tokens/sec on a Raspberry Pi 5 CPU. For IoT agents, offline-first deployments, or anything constrained to sub-1.5GB RAM, it's the only viable option in this family and still handles multimodal input.
The Apache 2.0 License Is Not a Minor Detail
Every previous Gemma release shipped under a Google proprietary license. Gemma 4 is the first under Apache 2.0.
For agentic AI specifically, this matters more than it does for general language model use. Agents get embedded in products. They get fine-tuned on proprietary data. They get wrapped in commercial services that customers pay for. All of that required legal review and negotiation under the old Gemma license. Under Apache 2.0, you can build, ship, fine-tune, and commercialize without clearing Google's terms first.
For startups and solo developers building on open-weight models, this is one less legal headache at exactly the moment when the model became capable enough to actually deploy in production.
Getting Started
# Pull with Ollama - fastest path to a running model
ollama pull gemma4:31b
ollama pull gemma4:26b-moe
# Or via Hugging Face
pip install transformers
Google AI Studio has the 31B and 26B MoE available in-browser with no local setup. Google AI Edge Gallery covers the E4B and E2B for on-device testing. Full framework support at launch includes Hugging Face Transformers, vLLM, llama.cpp, MLX, NVIDIA NIM, SGLang, Ollama, LM Studio, and more.
For MCP integration, the gemma-mcp package handles client setup against a locally-served Gemma 4 endpoint.
One practical note if you're running the 26B MoE via Ollama on Apple Silicon: as of v0.20.3 there's a known streaming bug that routes tool-call responses to the wrong field. Use llama.cpp directly or wait for the Ollama fix before deploying in an agentic context.
The Honest Caveat
86.4% on τ2-bench Retail is not 100%. In agentic pipelines where tool calls chain across 10–20 steps, a 14% per-call failure rate compounds. Production deployments need retry logic, error handling, and validation layers between tool outputs - the same engineering discipline you'd apply to any distributed system with failure modes.
Gemma 4 doesn't eliminate the need for defensive agent architecture. It makes the failure rate manageable enough that the architecture is worth building.
That's the real shift. Not that local open models are now perfect for agentic work. It's that they crossed the threshold from "interesting experiment" to "defensible production choice" - and they did it on your hardware, under a license you can actually ship with.
Follow for more coverage on MCP, agentic AI, and AI infrastructure.
Top comments (0)