Nick Chaves

Posted on May 25

The Gemma4 Dungeon Master Companion

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

authors:
@nickc672
@jake_mckenna_1025

What We Built

Hey, we're Nick and Jake, and we built Dungeon Master's Companion - an offline AI Dungeon Master that puts Gemma 4 through three tool-calling passes per turn so the model can be creative and correct at the same time. The whole thing runs locally through Ollama with no API calls, no cloud, no payment, and no player data ever leaving your machine.

We built a Streamlit front end on top of it so a player can actually sit down and play - type what your character wants to do, the way you'd tell a real human DM, and the system handles the narration, the world state, the NPCs, the dice, and the memory of what happened.

Here's the problem we were trying to solve. Most LLM-driven interactive fiction has the same failure mode. You ask one model pass to narrate the scene and also record what changed in the world, and the text looks fine while the underlying state desynchronizes. The player picks up the dagger, the dagger never lands in their inventory. The DM warmly remembers an NPC it has structurally never met. The fix is not a bigger model or a fancier prompt - it's structural:

Phase Loop 1: Read and Resolve. A tool-calling pass with read-only tools and the dice mechanics. The model gathers scene context, validates that the player can actually reach what they want to reach, rolls any skill checks the situation demands, and emits a turn_summary plus a narration_focus brief. Sampling is deterministic.
Narration Phase: A single creative pass that writes the DM response against the pre-write world snapshot, guided only by what Phase 1 surfaced. Sampling is high-variance. The narrator has no write access. It physically cannot change state.
Phase Loop 2: Write. Another deterministic tool-calling pass that reads the narration plus the Phase 1 trace and applies the exact world mutations the narration implies. Move the player, write entity memories, materialize newly-engaged NPCs and items, advance the world.

A reconciliation step then diffs the world before and after, derives which entities and items came into existence this turn, and commits the turn to history.The bigger picture for us is that we're interested in what local LLMs can do for video game NPCs. Right now, every character in every shipped game says the same lines forever. We wanted to prove you can get believable, in-world AI storytelling - characters that actually remember what you did, conversations that fit the world - without retraining a model from scratch and without sending player data to the cloud. This dungeon master is our proof of concept.

Demo

Code

nickc672 / gemma4-dungeon-master

Using Gemma4 as Dungeon Master in our custom DnD game.

Dungeon Master Companion

A local-first, agentic interactive-fiction orchestrator that turns a small open-weight LLM into a competent AI Dungeon Master, plus a closed-world benchmark suite that grades both the LLM and the orchestration code on the same scenarios.

What this project is

Dungeon Master Companion is where the model runs the system. Every non-trivial decision in a turn is a tool call the model chooses to make:

which character to look up
which memory needs retrieving
whether the player's words warrant a skill check
whether something happened worth remembering
whether the scene calls for a character or item that does not exist yet

The orchestrator's job is to make sure the world the model writes to actually exists, the lookups it asks for are honest, and the mutations it commits are valid.

Each turn runs in three phases, and within each phase the model operates as an autonomous agent in…

View on GitHub

Four files for the speed-run review:

pipeline.py - the StoryEngine and the turn orchestrator
agent_loop.py - the REACT loop with lifecycle hooks
prompt_texts.py - the Phase 1, Narration, and Phase 2 system prompts with worked examples
reconciliation.py - the before-and-after world diff

How We Used Gemma 4

The default config runs Gemma 4 31B Dense locally through Ollama. The nice part is that the same pipeline drops onto E4B for a laptop or E2B for a Raspberry Pi 5 or a high-end phone, with zero code changes. We'll get into why that matters in a bit. First, here's why Gemma 4 ended up being such a good fit for what we were building.

One model, two personalities, three phases.
The pipeline actually asks the model to do two pretty different jobs on the same turn.

Phase 1 and Phase 2 need the model to behave like a careful surgeon. Pick from a closed list of tools, fill in the right arguments, hit finalize_turn or finalize_writes exactly once, and stop. That kind of work wants low-temperature, predictable output.

Narration is the opposite. We need the model to write reactive, scene-aware prose that sounds like a real dungeon master and not a database changelog. That wants creative sampling.

So in app_config.json we run the tool-calling phases at temperature: 0.2, top_p: 0.5, top_k: 10, and the narration stage overrides to temperature: 0.75, top_p: 0.93, top_k: 50, repeat_penalty: 1.15. Same weights, same context window, two completely different operating modes within a single turn. Gemma 4 handles both cleanly, which is what lets this design work as a single-model pipeline instead of needing a routing layer on top of three different models.

Tool calling on local models fails in some pretty predictable ways. The model writes finalize_turn(...) as a markdown code block instead of actually calling the function. It tries to call writer tools during the reader phase. It stops responding without finalizing anything. It calls the same read tool five times in a row because it forgot it already had the answer. It crams two tool calls into one response when the contract is one.

So around Gemma 4's native tool calling we built a defensive REACT loop in agent_loop.py, with hooks in phase_one.py and phase_two.py that catch those failure modes specifically:

A pre_tool_use hook enforces which tools each phase is allowed to use. Phase 1 cannot touch writer tools, and once finalize_turn has succeeded, calling it again is blocked.
A response hook requires every assistant message to start with Decision Summary: and rejects responses with more than one tool call. It also catches the markdown-as-tool-call thing and tells the model to issue a real function invocation.
A post_tool_use hook injects targeted recovery notes when a tool fails non-retryably, with an explicit DO NOT RETRY hint so the model does not loop on a dead call.
A stop hook catches early stops and hands the model the exact JSON shape it owes finalize_turn.
An early_exit check kills the loop the instant a terminal tool succeeds, so the model cannot keep going past the natural end of the phase.
Phase 1 also caches deterministic read tools like get_current_context, check_can_interact, and list_scene_entities within the turn. If the model re-asks the same question, it gets the same answer instantly, with no penalty for thinking out loud.

With all that scaffolding in place, the interesting question stops being "does Gemma 4 emit a valid function call." It becomes "does Gemma 4 orchestrate thirteen tools across a multi-iteration loop and converge on a clean finalize?" In our testing, it does.

The 128K context window is doing a lot of work. The prompt state grows a lot as a session goes on. Every turn includes rolling history, the current and next story beat, a session summary, the current scene with actors and items and connections, per-entity scene memory surfaced by check_can_interact, recent skill check results, the turn-level TODO list, and any unresolved interaction targets the player gestured at but did not actually name.

Smaller-context local models start hard-truncating session history within the first half-hour of play, and the DM forgets what just happened. Gemma 4's 128K window means a real campaign session stays coherent without aggressive summarization eating away at the texture of the story.

The other thing Gemma 4 gives us is real choice on hardware. The same architecture turns into three different products:

31B Dense on a workstation - the full experience
E4B on a laptop - playable performance on consumer hardware, identical pipeline
E2B on a phone or a Pi 5 - the same architecture compressed onto an edge device

One pipeline, three hardware tiers, zero model-specific code. That is something an open model family with a real size ladder lets you do, and honestly, it would not be possible with a single closed API.

We decided early on that this project would only run on local Ollama models. No cloud, no API keys, no per-token billing. That constraint meant we needed something fast enough to feel like a game and accurate enough to not break the game.

To figure out what actually fit, we built a benchmark harness (benchmark/runner.py, benchmark/metrics.py) that scores each phase on two layers.

Function calls: did the model call the right tools.
State changes: did those calls actually mutate the world the way the narration claimed.

The state-change layer is the one that catches the failure mode that ruins a DM, where the model calls move_to_location with the wrong arguments and the player silently fails to move. Pure tool-name scoring would shrug at that. The harness flags it.

We ran four models through 30 cases each:

Phase-by-phase, Gemma 4 31B scored 0.982 on Phase 1 (read + mechanics), 0.990 on Narration, and 0.950 on Phase 2 (writer). It was the only model to break 0.9 on the writer phase, which is exactly where state corruption happens. gpt-oss 20B and llama 3.1 8B ran fast on our hardware, but they were dropping or hallucinating tool calls often enough that the game would visibly break mid-session. Devstral was decent but still well behind on Phase 2.

We also tested Qwen 3.6, which got closer to Gemma's accuracy than anything else we tried. The problem was speed - it ran so slowly on our hardware that it was not viable for actual gameplay, so it never made it onto the chart.

Gemma 4 averages about 24 seconds per case versus 10 for gpt-oss. That tradeoff was worth it for us, because the faster models would lie about what just happened in the world. In a story game, waiting a few extra seconds for a turn to resolve is dramatically better than the model silently corrupting your save.

What really sealed it for us was the actual playtesting. Sessions with Gemma 4 were just more fun. It was creative, it would invent new characters that fit the tone of the world, and then it would correctly call the tools to register those characters into the system memory, so when you talked to that character again three turns later the game actually remembered them. But it also stayed inside the bounds of the story. It would not go off the rails or break the mood.

The test set is not huge and we are not claiming any kind of universal model ranking here. What the benchmark proves is that the pipeline correctly translates Gemma 4's tool calls into world state mutations, and that with this architecture in place, the model and the system finally stopped fighting each other and started telling a story together.