DEV Community

Cover image for Gemma 4 Was Built for Agents. Here's What That Means in Code.
Gabriel Anhaia
Gabriel Anhaia

Posted on

Gemma 4 Was Built for Agents. Here's What That Means in Code.


April 2, 2026. Google DeepMind ships Gemma 4: four open-weight variants under Apache 2.0 (a notable shift from the custom Gemma Terms of Use that governed Gemma 1 through 3), with the headline phrase "built for agents" stamped on every announcement page. The Google blog post calls them "byte for byte, the most capable open models." The InfoQ writeup tracks the agentic angle. Both pieces are accurate, both are vague.

You've seen this rhetoric before. Every model release for the past two years has claimed to be "built for agents" or "agent-native" or some variation. Most of the time it means the team added function-calling to the post-training mix and called it a day. Gemma 4 is a more interesting case, and the difference shows up in code rather than slide decks.

The four things "built for agents" means here

Longer effective context. The 31B dense model and the 26B MoE both hold their reasoning quality further into a 128K window than Gemma 3 did. The Hugging Face blog reports near-flat needle-in-haystack performance through 96K tokens, where Gemma 3 reportedly degraded earlier in the window. For agent loops where every step appends tool results to the conversation, that's the difference between a 30-step loop staying coherent and a 30-step loop forgetting what it was doing.

Tighter tool-call output formatting. Gemma 4 was trained with six dedicated special tokens (<|tool>, <|tool_call>, <|tool_result>, and their closing pairs) baked into all instruction-tuned variants. This isn't cosmetic. Older models guessed at JSON via prompting; Gemma 4 was trained to emit a tool call as a distinct token sequence. Malformed JSON drops sharply.

Better function-call schema adherence. The model takes tool definitions in JSON Schema, including optional parameters, enum types, and nested objects, and (per the function calling docs) reliably picks the right tool, knows when not to call one, and respects enum constraints in arguments. Previous open models frequently hallucinated enum values on hard prompts; Gemma 4 reduces that gap meaningfully on the eval suites Google reports.

Cheaper agentic loops thanks to small variants. The 2B and 4B "edge" variants run on a laptop with a discrete GPU, or even on a phone. For agent loops that fire 50+ tool calls per task, routing the easy decisions ("did the user mean A or B?") to a 4B variant and the hard reasoning to the 31B saves an order of magnitude on cost without trashing quality. The Google Developers blog frames this as on-device agents; the practical use is hybrid routing.

That's the four-point list. None of them is "Gemma 4 is smarter." All of them are about what happens when you wire the model into a loop with tools.

A 100-line agent that exercises all four

The fastest way to see what changed is to run a real agent loop. This one talks to Gemma 4 through Ollama, has one tool (a fake stock-price lookup), and is capped at six steps so it can't run forever. It's intentionally small. Small enough that you can read every line and see where the model's behavior matters.

import json
import re
import requests

OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "gemma4:31b-instruct"

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current price of a stock by ticker.",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker": {
                        "type": "string",
                        "description": "Stock ticker, e.g. AAPL",
                    },
                    "currency": {
                        "type": "string",
                        "enum": ["USD", "EUR", "GBP"],
                    },
                },
                "required": ["ticker"],
            },
        },
    }
]
Enter fullscreen mode Exit fullscreen mode

The tool definition is plain JSON Schema with one required parameter and one enum. The enum is the part where older open models tend to wander; Gemma 4 should respect it.

def get_stock_price(ticker: str, currency: str = "USD") -> dict:
    fake = {"AAPL": 220.45, "GOOGL": 178.10, "MSFT": 415.20}
    price = fake.get(ticker.upper(), 0.0)
    return {"ticker": ticker, "price": price, "currency": currency}


def call_model(messages: list[dict]) -> dict:
    payload = {
        "model": MODEL,
        "messages": messages,
        "tools": TOOLS,
        "stream": False,
        "options": {"temperature": 0.0},
    }
    r = requests.post(OLLAMA_URL, json=payload, timeout=120)
    r.raise_for_status()
    return r.json()["message"]
Enter fullscreen mode Exit fullscreen mode

The loop is the part that earns its keep. Three things to watch: the model should emit tool_calls rather than free text when a tool is needed, the arguments should parse as valid JSON, and the loop should terminate when the model stops calling tools.

def run_agent(user_question: str, max_steps: int = 6) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "You are a finance assistant. "
                "Use tools when you need real prices. "
                "Answer concisely once you have data."
            ),
        },
        {"role": "user", "content": user_question},
    ]
Enter fullscreen mode Exit fullscreen mode

At step zero, messages is just the system prompt and the user question. Every subsequent step appends one assistant message and any tool results, so the conversation grows linearly with the number of tool calls. This is where the 128K context starts to earn its keep on longer loops.

    for step in range(max_steps):
        msg = call_model(messages)
        messages.append(msg)

        tool_calls = msg.get("tool_calls") or []
        if not tool_calls:
            return msg.get("content", "")

        for call in tool_calls:
            name = call["function"]["name"]
            raw = call["function"]["arguments"]
            args = raw if isinstance(raw, dict) else json.loads(raw)

            if name == "get_stock_price":
                result = get_stock_price(**args)
            else:
                result = {"error": f"unknown tool {name}"}

            messages.append({
                "role": "tool",
                "name": name,
                "content": json.dumps(result),
            })

    return "Hit max steps without final answer."


if __name__ == "__main__":
    answer = run_agent(
        "What's the current AAPL price in USD, and how does "
        "MSFT compare?"
    )
    print(answer)
Enter fullscreen mode Exit fullscreen mode

Pull the model with ollama pull gemma4:31b-instruct, run the script, and read the trace. You'll see the model emit two tool calls (one per ticker), receive the JSON results, and produce a one-sentence answer. The total loop is three model rounds.

What the same script does on an older model

Swap the model line to gemma3:27b-instruct and run again. You'll see one or more of these failure modes, depending on how the prompt phrases the request:

  • The model returns the tool call as a fenced code block in the assistant content rather than as a structured tool_calls field. Your parser breaks.
  • The arguments JSON has trailing commas or single quotes. json.loads raises.
  • The model hallucinates a currency value of "USD/EUR" because it tried to compare them in the same call.
  • The model calls get_stock_price for "AAPL", gets the result, and then calls it again for "AAPL" instead of moving to MSFT, because the loop reasoning lost track of what was already answered.

Gemma 4's contribution is that all four failure modes are addressed together rather than in sequence. The trained tool-call tokens parse cleanly, the longer context keeps the loop's state intact across more steps, and the schema-adherence training kills most of the enum hallucinations. The combination is what people mean when they say "built for agents."

Hybrid routing with the small variants

The 4B variant is where this gets practical for cost. Most agent loops have two kinds of decisions: routing decisions ("is this a price question or a news question?") and content decisions ("write the analysis"). The first kind doesn't need a 31B model.

ROUTER_MODEL = "gemma4:4b-instruct"
WORKER_MODEL = "gemma4:31b-instruct"

def classify(question: str) -> str:
    msg = requests.post(OLLAMA_URL, json={
        "model": ROUTER_MODEL,
        "messages": [
            {"role": "system", "content": (
                "Classify the user's question as one of: "
                "price, news, analysis. Answer with only the label."
            )},
            {"role": "user", "content": question},
        ],
        "stream": False,
        "options": {"temperature": 0.0},
    }, timeout=30).json()["message"]["content"]
    return msg.strip().lower()
Enter fullscreen mode Exit fullscreen mode

A 4B running locally answers a classification call in under 200ms with negligible compute. The 31B handles the heavy step. For an agent that fires 200 tool calls a day, the math gets attractive fast. And because both models are Apache 2.0, you can run them on the same host without licensing gymnastics.

What this doesn't change

Gemma 4 didn't make agent design free. You still have to think about loop termination conditions, tool error handling, retry budgets, and what happens when the model decides to call your tool with empty arguments because it gave up. The model is more reliable; the loop discipline still matters.

You also still have to evaluate. The fact that schema adherence improved on Google's reported benchmarks doesn't tell you whether your specific tools, your specific prompts, your specific user questions exercise the model in the regime where the improvement holds. Build a harness with 30 representative tasks and run it against both Gemma 3 and Gemma 4 before you migrate. The deltas are real. They're also smaller than the marketing implies, and they vary wildly by task type.

The case for upgrading is concrete. The case for "this changes everything" is overblown. Hold both ideas at once and you'll make a decent migration plan.

If this was useful

Designing the loop around the model is the part nobody writes a blog post about: when to retry, when to bail, how to bound state across steps so the agent doesn't hallucinate progress. The AI Agents Pocket Guide walks through the patterns that survive a model swap, with the failure modes that catch teams shipping their first agent to production.

AI Agents Pocket Guide

Top comments (0)