Mukunda Rao Katta

Posted on May 17

gemma4-safe-agent: making 2B parameters production-usable

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

Gemma 4 e2b is small enough to run on a laptop. The question I wanted to answer is whether a 2-billion-parameter model can be production-usable for actual agent work, not just demos. The honest answer: the model is fine, the cliff is everywhere around it.

This is my Build track entry for the Gemma 4 DEV Challenge.

What I built

gemma4-safe-agent is a tiny research agent that runs Gemma 4 e2b locally through Ollama, with five small zero-dependency libraries doing the unglamorous reliability work around the model. Ask it a question, it picks tools, hits Wikipedia and arXiv, returns a structured JSON answer.

ollama pull gemma4:e2b
npm install
npm run demo -- "What is RLHF?"

Output:

{
  "final": "RLHF is a technique that uses human preferences as a reward signal to fine-tune language models.",
  "sources": ["https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback"],
  "steps": 2
}

The point of the project is not the demo. The point is the scaffolding around the model.

The reliability cliff

When you switch a working agent from a frontier model to Gemma 4 e2b, four things break in the same week.

The model emits tool arguments that fail your tool's input schema.
The model invents URLs and tries to fetch them.
The model's final answer is JSON-shaped but not quite valid JSON.
The conversation outgrows the 2B context window and you lose your system prompt.

Each one feels like a model problem. None of them are. They are all integration problems that show up dramatically more often on small models, because small models are less forgiving when the surrounding code is sloppy.

The five libraries

Each one is a small open-source library I published on npm. Each handles one of the four failure modes above plus snapshot testing, with one shared idea: do the strict thing, return a clear error, never paper over a failure.

Library	Role in the loop
agentfit	Trim chat history to fit the 2B context window before every turn
agentguard	Network egress firewall: only Wikipedia and arXiv allowed
agentvet	Reject tool calls with wrong argument shapes before they run
agentcast	Force the final answer into a valid JSON schema, retry on miss
agentsnap	Snapshot the tool-call trace, fail CI on regressions

The loop wires them together like this:

const POLICY = policy({
  network: { allow: ['en.wikipedia.org', 'arxiv.org', 'localhost'] },
  budget: { maxRequests: 30 },
  violations: 'throw',
});

await firewall(POLICY, async () => {
  const fitted = fit(messages, { maxTokens: 4096, preserveSystem: true });
  const raw = await ollamaChat(fitted.messages);   // gemma4:e2b
  const args = vet({ name: 'search', schema, args: raw.toolCall });
  const result = await tools[raw.toolCall.name](args);
  // ... loop ...
});

const final = await cast({
  llm: ollamaChat,
  validate: zod(AnswerSchema),
  prompt: 'Give the final answer as JSON only.',
});

That is the whole agent. About 200 lines.

What changed once the scaffolding was in

Two things changed between the bare Gemma 4 + Ollama loop and the wrapped version.

The agent stopped failing in the middle of runs. Before: silent crashes on JSON parse, runaway fetches to invented URLs, context overflow erasing the system prompt. After: every failure mode raises a typed error that the loop can recover from, and the snapshot test catches behaviour drift.

The agent became debuggable. agentsnap records every tool call into a file the run can diff against. When a refactor changed which tools the model picks, the snapshot test failed on the first run and pointed at the exact tool order change.

The model itself never changed. Only the code surrounding it.

How I used Gemma 4

Gemma 4 e2b is the agent's only LLM. Everything goes through Ollama on localhost. The 2B size is what made the scaffolding necessary in the first place: the model is genuinely good at tool selection, but the response budget is small, the JSON-mode behaviour is less polished than frontier models, and the context window is tight enough that any sloppy message history kills the run.

This is why I think the project matters. Open models on the small end of the curve are about to be a lot more available, and the reliability cliff is the bottleneck, not the raw capability. Showing that 200 lines of scaffolding takes Gemma 4 e2b from "broken in week one" to "snapshot-tested in CI" is the point.

Demo

The repo includes both the real Ollama path and a stub-LLM path that runs end-to-end without a model, for CI and judging:

AGENT_MOCK=1 node examples/run-stub.js

CI runs on every push; the badge is green. License is MIT.

Code

github.com/MukundaKatta/gemma4-safe-agent

The five libraries are public on npm and used as plain dependencies. The agent itself is a single small loop. Nothing fancy. Everything readable.

What I would build next

A second version that targets Gemma 4 26B MoE on the cloud side, with the same scaffolding contract: same five libraries, same snapshot test, same egress allowlist. The goal would be a single agent codebase that swaps model sizes via a single env var with no behaviour regression that the snapshot test doesn't catch first.

The fun part of open models in 2026 is that this is now a real engineering question, not a research one.

DEV Community