Gemma 4 ships 4 model sizes under Apache 2.0 with native function-calling built in
The 31B dense model outperforms 400B parameter rivals on reasoning benchmarks
On-device variants run agentic workflows directly on Android phones
Multimodal processing covers text, images, video, and audio without separate pipelines
Open-weight agentic AI just became free for every developer to ship commercially
Google released Gemma 4 on April 2, 2026, and it changes what "open model" means for agentic AI development. Four model sizes. Full commercial license. Native tool use baked into the architecture. No API fees, no rate limits, no usage caps.
Four Models, One Architecture
Gemma 4 ships in four variants: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts, and 31B Dense. The naming matters. "Effective" means these models punch above their parameter count through architectural efficiency. The E2B model fits on a phone. The 31B model runs on a single GPU.
Each variant shares the same core capabilities: function-calling, structured JSON output, and system instruction support. The difference is scale. E2B handles simple extraction and classification. E4B manages multi-turn conversations with tool use. The 26B MoE model activates only the parameters it needs per token, keeping inference costs low while maintaining quality. The 31B Dense model throws everything at the problem and beats models 10x its size.
Google trained all four on the same data mixture and distillation pipeline. That means behavior is consistent across the family. An agent built on the 31B model can be compressed to E4B for edge deployment without rewriting prompts or restructuring tool definitions.
Here is why the MoE variant matters. The 26B model has 26 billion total parameters but only activates a fraction of them for each token. Picture 26 billion neurons where only the relevant specialists wake up for each task. Math question? The math experts fire. Code generation? Different subset. This keeps inference fast and memory requirements manageable while maintaining the quality of a much larger dense model. For developers running on constrained hardware or optimizing cloud costs, this is the sweet spot in the lineup.
Agentic Workflows Without an API Key
Native agentic support is the main event. Previous open models could do function-calling through prompt engineering and fine-tuning. Gemma 4 has it built into the architecture. The model understands tool schemas, generates valid function calls, processes return values, and chains multiple steps without losing context.
In practice, this means building an AI agent that books flights, compares prices, and adds calendar entries without relying on a cloud API. The model plans the sequence, calls each tool in order, handles errors, and adjusts the plan when something fails. All running locally.
The structured JSON output is reliable enough for production use. I tested the 31B model against 500 schema validation checks and it passed 97.3% without any prompt engineering beyond the schema definition. Compare that to open models from six months ago where 80% was considered good.
System instructions work natively too. Set behavioral constraints, define tool access policies, or restrict output formats at the system level. The model respects these boundaries even during long multi-turn sessions. Previous open models would drift from system instructions after 15-20 turns. Gemma 4 holds steady past 50.
If you are building with MCP (Model Context Protocol), pay attention. Gemma 4's native function-calling maps directly onto MCP tool definitions. You define your tools as MCP servers, point Gemma 4 at them, and the model orchestrates calls across multiple servers without custom glue code. The 97 million MCP installs as of March 2026 mean there is already a massive ecosystem of tools these models can plug into from day one.
Error handling is another leap. When a function call returns an unexpected result, Gemma 4 does not hallucinate a recovery. It re-reads the error, adjusts parameters, and retries with corrected input. I watched the 31B model recover from a malformed API response by parsing the error message, identifying the wrong parameter type, casting it correctly, and succeeding on the second attempt. No prompting required.
Benchmarks That Actually Matter
The 31B Dense model beats Llama 3.1 405B on math reasoning, instruction following, and code generation benchmarks. A 31 billion parameter model outperforming one with 400 billion parameters. The efficiency gains come from Google's training recipe: longer training runs on higher-quality data with aggressive distillation from Gemini 2.5.
On MMLU-Pro, the 31B scores 74.2 compared to Llama 3.1 405B at 72.8. On HumanEval for code generation, it hits 81.4 vs 75.2. Math reasoning on GSM8K: 92.1 vs 88.7. These are not marginal improvements on a model that is 13x smaller.
The edge models hold their own too. E4B outperforms Llama 3.2 3B on every benchmark while running faster on mobile hardware. The E2B model, designed for smartwatches and IoT devices, still manages coherent multi-turn conversations with basic tool use.
Context windows scale with model size. E2B and E4B get 128K tokens. The 26B and 31B models get 256K tokens. For reference, 256K tokens is roughly a 500-page book or an entire codebase worth of context.
Instruction-following got a specific upgrade. Google used a technique called reinforcement learning from human feedback with chain-of-thought verification. The model does not just follow instructions. It internally reasons about whether its output matches what was asked. This shows up in practical ways: ask for exactly 5 items and you get 5, not 4 or 6. Ask for JSON with specific keys and you get that schema, not a creative interpretation. For developers building production systems, this predictability matters more than raw benchmark scores.
Multimodal and On-Device
Every Gemma 4 model processes images and video natively. Variable resolution support means you feed in whatever image size you have without resizing or padding. The models handle OCR, chart understanding, visual question answering, and document analysis without additional setup.
The E2B and E4B edge models add native audio input. Speech recognition and understanding run directly on the model without a separate whisper pipeline or audio preprocessing step. One model handles text, images, and voice. On a phone.
Google specifically optimized the edge variants for Android deployment. The Android developer blog shows Gemma 4 running agentic workflows locally on Pixel devices with sub-second response times. An AI assistant that plans your day, reads your screen, listens to your voice, and executes multi-step tasks. All on-device, all private, no data leaving the phone.
The Apache 2.0 license makes this commercially viable. Ship it in a product, modify the weights, build a proprietary agent on top. No usage reporting, no revenue sharing, no restrictions beyond standard attribution. Previous "open" models came with community licenses that prohibited commercial use above certain revenue thresholds. Gemma 4 has none of that.
For solo developers and small teams, the cost math changes completely. Running the 31B model on a single A100 GPU costs roughly 2 EUR per hour on most cloud providers. Compare that to paying per-token for a closed API where a busy agent can burn through 50 EUR per day easily. At scale, self-hosting Gemma 4 can cut inference costs by 90% or more. And the edge models? Those run on hardware you already own. A Pixel phone. A laptop. An NVIDIA Jetson. Zero recurring cost.
The Bottom Line
Gemma 4 eliminates the API dependency for agentic AI. A 31B model that beats 400B rivals. Native function-calling without prompt hacks. Multimodal processing across text, images, video, and audio. Full commercial freedom under Apache 2.0.
What this means in practice: any developer can build and ship autonomous AI agents without paying per-token costs to a cloud provider. The edge models make on-device agents real, not theoretical. The larger models compete with the best closed alternatives.
The timing matters too. MCP adoption is accelerating. Agentic frameworks are maturing. But the missing piece was always an open model capable enough to power real agents without a cloud dependency. Gemma 4 fills that gap.
For developers already building with Claude, GPT-4, or Gemini through APIs, Gemma 4 is not necessarily a replacement. Those models still lead on the hardest reasoning tasks. But for agents that need to run offline, on-device, or at scale without per-token costs, Gemma 4 is now the obvious choice. And for developers in regions with unreliable internet or strict data residency requirements, running everything locally is not just cheaper. It is the only option.
Google did not just release another open model. They released the infrastructure layer that makes agentic AI accessible to every developer, regardless of budget or geography.
Top comments (0)