Diagent: when the static auditor and the sandbox disagree, who's right?

#devchallenge #gemmachallenge #gemma #langgraph

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

I drew one workflow on a notepad. A few boxes, some arrows, a loop. Then I fed the photo to my project six times — same drawing, same prompt, same model — and got six different LangGraph programs back. Four ran clean. One refused to compile. One spun until it hit a recursion limit and died. Same sketch, three different verdicts. It turns out the interesting question for an agent builder isn't "did the code compile" — it's "which run did you get?"

What I Built

Diagent turns a hand-drawn picture of an agent workflow into a verified, runnable LangGraph agent.

The pipeline:

You sketch an agent workflow on paper — boxes for nodes, arrows for edges, a diamond for a branch.
Gemma 4 E4B reads the photo (it's natively multimodal) and produces a Mermaid diagram — a clean, text-based intermediate representation you can eyeball and edit before anything else happens.
Gemma 4 E4B again, this time in reasoning mode, turns that Mermaid into LangGraph Python.
Shingan, a small static auditor I wrote for this project, reads the generated code and flags hazards.
A Docker sandbox actually runs the code and reports back what happened.

Everything runs locally. Apache 2.0. No API calls, no per-token bill.

Why two verification layers

This is the design decision the rest of the article is about, because day 4 of the build is when it backfired in the most instructive way possible.

Diagent verifies generated code twice: once statically (Shingan reads it without running it), once dynamically (the sandbox runs it).

Static analysis is excellent at the decidable things. Syntax errors, an obviously unreachable node, a router referenced but never defined — Shingan catches those instantly. But cycles are special. "Does this loop terminate?" is the halting problem wearing a small hat. Undecidable in general, decidable in the specific case where you actually run the code. Shingan, reading statically, sees a cycle that has an exit branch but no recursion guard and does the only honest thing available to it: it raises a warning. The sandbox runs the code and reports what actually happened.

My thesis going in was: both layers necessary, neither sufficient. The empirical evidence ended up much stronger than I expected.

Demo

Diagent demo: sketch photo → Mermaid IR → LangGraph code → Shingan cycle_detection warning → Sandbox exit code 0

The clip above is a 4× speedup of a full pipeline run on Sketch 3 — a textbook agent loop (Input → Classify → Validate, where Validate loops back to Classify on failure or proceeds to Output on success). The whole sequence from photo drop to sandbox verdict takes about two minutes in real time.

But the more important demo is what happened when I recorded that same pipeline six times in one sitting.

Same sketch, three verdicts

I set up a recording session and ran Sketch 3 six times — same drawing, same prompt, same E4B model:

4 takes: cycle_detection warning → sandbox exit code 0. Clean.
1 take: cycle_detection warning → ValueError: Graph must have an entrypoint. The generated code was malformed.
1 take: cycle_detection warning → GraphRecursionError: Recursion limit of 25 reached. It spun forever.

Same input, three outcomes. The Mermaid IR was byte-identical every time. The Shingan warning text was identical every time. The Python was different every time — because Gemma 4 E4B emits a different router function on each generation, and the router is the exact thing that decides whether the loop ever takes its exit branch.

I also ran a sketch I had labelled in my README as "the deterministic infinite loop — always crashes" three times. Two terminated cleanly. One failed to compile. Zero hit the GraphRecursionError I had documented. The sketch I'd called bulletproof-broken was bulletproof neither way.

This is the most honest demonstration of verify-then-execute I could have asked for, and I didn't design it — I tripped over it. The drawing is fixed. The IR is fixed. The static warning is fixed. The generated program is a fresh sample from Gemma 4 E4B every single time, and a cycle's behaviour depends entirely on a router function the model writes a little differently on each pass. The static auditor cannot read the future. The sandbox doesn't have to — it runs the program and watches.

Code

GitHub: github.com/hatyibei/diagent (Apache 2.0)

The notable pieces:

backend/parser.py — multimodal call to Gemma 4 E4B that turns a sketch photo into Mermaid. Single-shot, no chain-of-thought needed.
backend/codegen.py — reasoning-mode call to Gemma 4 E4B that turns Mermaid into LangGraph Python. think=True. This is the step that introduces the non-determinism shown above.
backend/shingan/ — the static auditor. Walks the generated AST, flags hazards. The most useful check is cycle_detection: any non-Loop node sitting inside a cycle that has an exit branch but no max_iterations / recursion_limit guard. Exactly the line a careful reviewer would circle in a PR with "are we sure this terminates?".
backend/sandbox/ — Docker harness. Generated code is mounted in, runs with dropped capabilities, no network, read-only FS, memory + PID caps, and a hard timeout. LangGraph's own recursion_limit (25 by default) is what converts "infinite loop" into a tidy 5-second GraphRecursionError instead of a wedged container.
frontend/ — minimal SPA so you can drop a sketch, watch the four panels populate, and click into the sandbox.

Local setup: docker compose up, then visit http://localhost:8080. README has the model-download instructions.

How I Used Gemma 4

I chose Gemma 4 E4B for both the parsing step and the codegen step, even though E4B isn't the most capable model in the family. The reasoning:

E4B is multimodal natively. The parser step takes a photo of a hand-drawn diagram and has to output structured Mermaid. With the smaller E2B I lost edge detection on faint pencil strokes; with the larger 31B Dense I gained nothing parser-wise but blew up my VRAM budget. E4B at 4-bit fits on a single 16GB consumer GPU with room left for the codegen step.

E4B handles think=True codegen well. For the Mermaid → LangGraph step I need the model to plan: pick a state schema, decide which edges become conditional routers, decide where the recursion_limit goes. E4B's reasoning mode produces code that runs about 80% of the time on the first try — which, to be clear, is what creates the bug-distribution this whole project is built around. A more reliable model would generate the same router every time and you'd never see the variance that makes the sandbox layer necessary. A less capable model would fail to compile so often that you'd never get to the interesting cases.

E4B is the right "interesting failures" zone. This is the most important point. The judging rubric asks about intentional model choice. I want the model to write code that's usually right and occasionally non-trivially wrong, because that's the regime where the verify-then-execute story has a point to make. If the model were perfect, Shingan would be a museum piece and the sandbox a formality. If the model were broken, every generation would crash and there'd be no contrast to demonstrate. E4B sits squarely in the band where the static auditor and the sandbox disagree in interesting ways — and where Diagent's whole architecture earns its keep.

Local-only matters for this project. The build measures variance across many regenerations of the same prompt. With a hosted API I'd be metering every demo. With E4B local I can run six takes back-to-back without thinking about the bill, and I can ship the project as a thing readers actually run on their own machine.

What I'd build next

A statistical sandbox runner. Run the generated code N times and show the distribution of outcomes — "4/6 clean, 1/6 crash, 1/6 malformed" — instead of a single pass/fail. The distribution is the answer.
Targeted auto-fix. When a specific run crashes on a specific cycle, regenerate just that router with the failure in context, instead of re-rolling the whole graph and hoping.
Confidence-graded warnings. Let Shingan separate "I think this is fine, but verify" from "I think this is broken." The cycle warning is the first kind, and treating it like the second kind is crying wolf.

The thing I keep coming back to: I drew one workflow, and Gemma 4 wrote me six different programs that each claimed to be it. The static auditor flagged all six the same way. Only the sandbox could tell me which one I had actually gotten. When the auditor and the sandbox disagree, neither is wrong — they are answering different questions. You need both, and the gap between their answers is exactly where the bugs live.