Mukunda Rao Katta

Posted on May 19

What I shipped during I/O 2026 week: Gemma 4 on Ollama with a five-piece safety stack

#googleio #gemma #ai #opensource

Drafted in anticipation of the Google I/O 2026 Writing Challenge. Will add the devchallenge and challenge-specific tags once the announcement is live on May 19.

I/O week is the one week of the year where my GitHub feed and my coffee maker are equally caffeinated. This year the announcement I cared about most was Gemma 4, the new family of open models from Google. By Friday I had gemma4:2b and gemma4:4b running on a laptop via Ollama, a small research agent loop, and a handful of tiny libraries I had been meaning to ship anyway.

Here is what I shipped, why I shipped each piece, and what I learned about running a 2B-parameter open model as the brain of a real agent.

The headline: 2B is enough if the scaffolding is right

Pointing a 2B-parameter model at "answer this question, use tools when you need to, return JSON" goes badly without scaffolding. The model wraps JSON in markdown fences. It hallucinates tool args. It drops a required field. Each of those is a separate failure mode, and each one has a clean, small fix.

The agent I built is around 200 lines of code. The five libraries it depends on are around 200 lines each. Total surface area is small enough that I can hold the whole thing in my head and stick a debugger into any of it.

Concretely, on the question "What is RLHF?", the agent:

Receives the question.
Plans (gemma4:2b returns a structured plan).
Picks a tool (fetch_url to read the Wikipedia page).
Validates the tool args before running them (tool validator rejects garbage).
Fetches the page through a domain allowlist (the agent cannot wander).
Calls the model again with the fetched text in context (with the context fitted to a budget).
Returns a structured JSON answer with sources.

Steps 4, 5, and 6 are the load-bearing ones. Without them the 2B model is a toy. With them it is a useful tool that runs on a laptop with no API key and no rate limit.

The five small problems and the five small fixes

Problem 1: Output drifts

Gemma 4 will return JSON wrapped in json ..., sometimes with a trailing comma, sometimes prefixed with "Sure, here you go:". I built a three-pass repair: strip fences, extract the largest balanced JSON object, remove trailing commas. Then validate against a schema. If validation still fails, hand the model a short hint and retry once.

The hint is the trick. Small models self-correct beautifully when you tell them precisely what was wrong. They do not self-correct on "invalid JSON, please try again." Give them the field name and the constraint.

Problem 2: Tool args go wrong

When the 2B model picks a tool, it sometimes picks args that violate the schema you sent it. The fix is to validate every tool call before running it. If validation fails, do not run the tool. Feed the validation issues back to the model as the tool's response. The model gets exactly the structural complaint it needs to correct on the next turn.

This catches three classes of bugs: wrong types (string where number was wanted), missing required fields, and extra fields the schema does not permit. All of them happen with smaller models. None of them happen if you validate first.

Problem 3: The agent can wander to the wrong domains

Once the model can pick URLs to fetch, you have handed it URL-picking power. That is not always what you want. A naive prompt-injection attack can convince a small model to fetch from an attacker-controlled domain.

The fix is the smallest piece in the stack: a declarative domain allowlist. Set it to the three or four hosts the agent legitimately needs. Block everything else with an actionable error. The model never gets to wander.

Problem 4: Context budget gets tight

Gemma 4 advertises 128k tokens but the practical throughput window is much smaller. Bounded chat histories matter. The fix is anchored truncation: always preserve the system message at the top and the trailing user turn at the bottom. Drop the middle when the total goes over budget.

DropOldest is the right default. DropMiddle is a reasonable alternative if you want to keep both early grounding context and recent turns. Both keep the load-bearing pieces of the prompt.

Problem 5: You will regress, and not notice

You tweak a system prompt. The agent picks a different tool order. Sometimes that is fine. Sometimes it is a regression that breaks the deployed app. By Friday afternoon.

The fix is a snapshot test. Record one agent run end-to-end as a JSON trace. First test run writes the snapshot. Every subsequent run compares against the snapshot and fails with a unified diff if anything diverges. Refresh the snapshot when the change is intentional. Five lines per test.

Why "small open model + scaffolding" matters

The pitch for big closed models is that they hide all of these problems for you. The pitch for small open models is everything else: latency, cost, privacy, offline-ness, the ability to fine-tune. The five problems above are the price of admission for the latter.

The good news is that each problem is small and each fix is small. The combined scaffolding is around 1000 lines of code, MIT-licensed, distributed as separate libraries you can adopt one at a time. You can swap any one of them for your own implementation without the rest noticing.

The whole loop in 20 lines

let messages = build_messages(question);
let fitted = Fitter::new(8_000).fit(messages, Strategy::DropOldest);

let raw = call_gemma4_via_ollama(&fitted, &tap).await?;
let action = action_caster.parse(&raw)?;

if action.kind == "tool" {
    let v = tool_validator(&action.tool)?;
    v.validate(&action.args)
        .map_err(|e| anyhow::anyhow!(e.for_llm()))?;

    // Egress allowlist before any fetch
    if let Some(url) = action.args.get("url") {
        allow.check(url.as_str().unwrap())?;
    }
    run_tool(&action).await
} else {
    Ok(action.text)
}

That is the whole thing. The 2B model on the other end. Five small libraries doing the boring work. The result is reliable enough for me to dogfood on local tasks without worrying about the model going off the rails.

What I'm taking away from I/O 2026

Two things, mostly.

Open is having a moment, but only with scaffolding. Gemma 4 (2B) running locally is a real productivity tool once the safety net is in place. Without the safety net it is a demo that breaks the first time a user asks something weird. The community has been quietly building the safety net for a year; pick it up off the shelf.

Local-first lowers the threshold for trying things. I built and tested the loop above without a single API call to a paid endpoint. The whole iteration cycle was free. The thing that would have been three weeks of work on a paid model was three nights of work on Ollama.

If you build something on Gemma 4 this week, the meta-lesson is: do not be afraid to scaffold around it. The model is the easy part. The scaffolding is the part that ships.

Happy I/O week.

DEV Community