Prompt Architecture for a Reliable AI Dungeon Master

#programming #ai #webdev #gamedev

How I structured prompts to make an LLM behave like a consistent, rule-following D&D 5e GM and actually stay that way across long sessions.

Building an AI Dungeon Master sounds straightforward until you actually try it.

Left to its own devices, an LLM will happily let a level-1 fighter dual-wield longswords, cast spells they never learned, and narrate a killing blow before the player has rolled a single die. Fun the first time. Maddening as a recurring loop.

The core challenge isn't creativity LLMs are genuinely great at that. The challenge is consistency. D&D 5e has hundreds of interlocking rules, and a solo player with no human GM to correct mistakes will notice fast when the AI forgets that Warlocks recharge spell slots on a short rest, or that a Bard can't swap their Known Spells mid-dungeon. Get caught bending the rules once, and the player trusts nothing.

Here's how I tackled it in SoloQuest.

Layer 1: The System Prompt as a Rules Contract

The foundation is a dense, versioned system instruction (currently on v9) that functions less like a creative brief and more like a contract the model has to sign before every session.

It covers things like:

Inventory and resource tracking. The AI is told explicitly what it can and cannot offer based on the current character state. Confiscated items are off-limits. When a consumable gets used, the model emits a machine-readable tag (ITEM_USED: Item Name) that the game parser watches for. Same with ability charges. This isn't vague guidance, it's a protocol.

Spell management by caster type. Rather than one generic spell rule, the prompt distinguishes three different systems: prepared casters (Cleric, Druid, Paladin), known casters (Bard, Sorcerer, Warlock), and spellbook casters (Wizard). Each one has different rules for what they can cast and when their slots come back. Warlocks even get a dedicated callout, because "Warlock slots recharge on short rest, not long rest" is one of the most common AI failure modes I ran into early on.

Rolls are mandatory before outcomes. This one is bolded in the prompt. The model gets shown the wrong pattern ("Your sword strikes true, dealing 8 damage!") and the correct one: request the roll, wait, then narrate after. Without this, the AI just collapses into a choose-your-own-adventure book where the dice are decoration.

Encounter balance guardrails. At levels 1-3, solo encounters have hard HP caps baked into the prompt (level 1 enemies: max 7 HP, no multiattack). A single bad roll at low level with no party to pick you up is a run-ender, so I encoded the balance logic directly rather than hoping the model had good instincts about what's "survivable."

Layer 2: Injecting the Rules Right When They're Needed

The system prompt handles general behavior, but specific mechanics need to be fresh in context exactly when they matter.

Every turn, I run the player's input through a lightweight keyword extractor that picks up signals: Are there active enemies? Is the player low on HP? What class are they? What verb did they use, "attack," "sneak," "cast"?

From those signals, a scoring function pulls the top 3 most relevant rules from a structured SRD database and drops them directly into the user prompt:

RELEVANT D&D 5E RULES (Follow these strictly):

Sneak Attack: Requires a finesse or ranged weapon, plus advantage or an ally adjacent to the target...

Two-Weapon Fighting: The off-hand attack uses a bonus action and does not add your modifier to damage...

This means the model doesn't have to recall Sneak Attack from training data. The rule is just sitting right there in the prompt when the Rogue tries to use it. Scoring is weighted (exact tag match scores highest, loose content match scores lowest), and irrelevant rules get filtered out so they don't clutter the context window.

Layer 3: Game State as the Source of Truth

The most brittle moment in a long session is continuity. After 30 turns, does the model remember the player is concentrating on Hold Person? That the goblin lieutenant fled rather than died? That the fighter is at 3 of 12 HP?

To handle this, I serialize the entire engine state into every prompt turn. The AI always receives:

Combat state: initiative order, whose turn it is, each enemy's distance and line of sight, cover values
Exploration state: travel pace, light level, time elapsed, last passive perception result
Death save tracker: active or not, current success/failure counts
Active effects: conditions with round durations, poisons with their DCs, disease stage progression

The prompt declares these trackers as authoritative. The AI cannot narrate hitting an enemy behind full cover, can't apply damage to a character with immunity, can't start concentrating on a second spell without dropping the first. The state in context is the source of truth, not whatever the model thinks it remembers.

One more detail worth calling out: the machine-readable combat trace from the previous turn also gets passed in. This gives the model a clear record of what mechanics were actually applied last round, which prevents a frustrating drift pattern where the model re-narrates a hit that was actually a miss.

Layer 4: Enforced Response Structure

Parsing AI output reliably means removing all ambiguity from the format. Every turn, the model is required to return four sections:

[NARRATIVE] — the story beat, written however it likes
[MECHANICS] — machine-readable tags only (HP_CHANGE:-8, ENEMY_HP: Goblin Scout, 4, ROLL:1d20+DEX for Stealth)
[SUGGESTIONS] — player options, each tagged as requiring a roll or not
[CHRONICLE] — a one-line campaign log entry, only on significant beats

The parser maps the mechanics tags directly to state transitions in the game engine. HP_CHANGE, ENEMY_SPAWN, ITEM_GAINED, SPELL_SLOT_USED, CONCENTRATION all feed deterministic updates. The AI's prose can be whatever it wants, but the mechanics section has to be machine-readable.

The roll:true/false flag on each suggestion is checked by the client before an action gets sent. If a suggestion requires a roll and the player hasn't provided one, the UI stops and prompts for it. The AI can't skip the dice.

Testing: No Unit Tests, Just Instrumented Playtesting

There's no test suite for DM behavior. A spec can tell you if a function returns the right boolean, but it can't tell you whether the goblin acted like a goblin.

Instead, I leaned on a few practices:

Debug logging at parse time. Every injected SRD rule gets logged. Every mechanic event gets logged. This is how I caught the model emitting ITEM_GAINED: Health Potion x1 (with a quantity suffix the parser didn't expect) and traced it back to a missing naming constraint in the prompt.

Version bumping as change control. The system instruction version constant forces a cache miss on character caches whenever the prompt changes. What felt like a minor clarification sometimes produced drastically different combat behavior. Treating each meaningful edit as a version was the only way to reason about what actually changed.

Failure pattern cataloguing. When the AI broke a rule, I wrote it down, not as a bug ticket, but as a new clause in the system prompt. "Do not narrate level-ups directly, the game system handles this automatically" is in there because the model confidently told players "You feel a surge of power, you are now level 3!" on three separate playtests. Every clause traces back to a real failure.

What I'd Do Differently

The biggest tension is prompt size vs. model attention. At v9, the system instruction is around 350 lines before any per-turn context is added. Long-context models handle the length fine, but rules buried in the middle of a dense block can get overshadowed by recency effects from the conversation history.

Looking back, I'd front-load the most commonly violated rules (roll-before-outcome, spell prep rules) closer to the top, and probably split the narrative craft guidance into a separate injection that only appears during non-combat turns. There's no reason to spend context budget on "pacing and encounter variety" guidance while the player is mid-fight.

The SRD injection approach scales really well though. Adding a new mechanic like attunement, exhaustion, or diseases just means writing one structured rule object with tags and dropping it into the right category file. The scoring handles the rest.

Getting an LLM to follow structured rules reliably is harder than it looks. The model will do what the prompt describes, but only if the description is precise enough, and precise enough turns out to be a moving target. The real work is translating a rulebook written for human interpretation into something that survives a 50-turn campaign without drifting.

That gap between "the rules" and "what the model actually does" is where most of the engineering lives.

Building something AI-powered and running into consistency issues? I'd love to hear how you're handling it in the comments.