Harish Kotra (he/him)

Posted on Jun 8

Same Prompt, 4 LLMs: The Roguelike Showdown

#ai #programming #python #dailybuild2026

I gave four different LLMs exactly the same prompt: build a complete, single-file, terminal-based Python 3 roguelike game using only the standard library. Three models produced runnable code. One timed out before generating anything.

The goal was not to pick a winner. The goal was to see what changes when the prompt is held constant — and the results revealed more about LLM behavior than any leaderboard could.

The Prompt

The specification was detailed and unambiguous:

Build a terminal-based roguelike with player stats (Health 100, Gold 0, Attack 15), exactly 20 turns, 3 randomized choices per turn from a pool of 4+ event types (fight, chest, tavern, merchant), ANSI color-coded output, screen clearing, input validation, and an end-game summary.

Full prompt is in the repository README.

Architecture Diagram

Model-by-Model Analysis

DeepSeek V4 Flash (524 lines)

DeepSeek took the most architecturally disciplined approach. It defined an abstract Event base class and four concrete subclasses:

class Event:
    def describe(self, player): ...
    def can_afford(self, player): return True
    def execute(self, player): ...

class FightMonsterEvent(Event): ...
class ChestEvent(Event): ...
class TavernEvent(Event): ...
class MerchantEvent(Event): ...

Choices are generated by sampling event classes from a list, then instantiating them:

EVENT_CLASSES = [FightMonsterEvent, ChestEvent, TavernEvent, MerchantEvent]
chosen_types = random.sample(EVENT_CLASSES, k=3)
choices = [cls() for cls in chosen_types]

This polymorphic approach is clean and extensible — adding a new event type means adding one class and registering it in the list.

The bug: The end-game summary displays the player's current gold, not the lifetime gold accumulated. If you spend gold at taverns or merchants, the final figure under-reports. The prompt asked for "total Gold accumulated," but no lifetime counter was implemented.

MiniMax M3 (802 lines)

MiniMax went all-in on user experience. It added color fallback detection, health bars that change color (green > yellow > red), character-by-character slow-print narration, ASCII art banners, and a multi-tier verdict system.

Architecturally, it used a Choice class carrying a Callable closure:

class Choice:
    def __init__(self, key, title, description, color, action: Callable):
        self.key = key
        self.action = action

The game engine simply calls choice.action(self.player) — a clean dispatch pattern.

Choice generation used an interesting strategy. Instead of always showing all options, it filtered out unaffordable ones and padded with duplicates:

def generate_choices(player):
    pool = [make_fight_choice(), make_chest_choice()]
    tavern = make_tavern_choice(player)    # None if can't afford
    merchant = make_merchant_choice(player)
    if tavern:    pool.append(tavern)
    if merchant:  pool.append(merchant)
    while len(pool) < 3:
        pool.append(random.choice(pool[:2]))   # pad!
    random.shuffle(pool)
    return pool[:3]

The bugs: This padding strategy can produce duplicate choices (e.g., fight, chest, chest) on early turns. Also, like DeepSeek, it reports current gold rather than lifetime earned gold.

Mimo 2.5 (800 lines)

Mimo produced the cleanest architecture, splitting responsibilities across four classes:

GameState        — owns player stats + summary counters
EventHandler     — event execution, mutates GameState
ChoiceGenerator  — menu generation, affordability checks
GameEngine       — control flow, wires everything together

It was the only model that correctly tracked total_gold_earned:

class GameState:
    def __init__(self):
        ...
        self.total_gold_earned = 0
        self.monsters_slain = 0
        self.damage_dealt = 0
        self.damage_taken = 0

The hidden bug: Mimo stored live Monster instances in a global MONSTER_POOL and mutated them during combat. If the same monster object was selected again in a later fight, it could start with corrupted stats or even dead. This is precisely the kind of state-management defect that survives surface-level review.

There was also an off-by-one in turn tracking: a full 20-turn survival path would report 21/20 turns.

Nemotron 3 Ultra (0 lines)

Nemotron produced no game source. The captured error was:

write Failed
"Upstream idle timeout exceeded"

This is a reliability data point, not a code quality data point. It belongs in the comparison because if the experiment is "ask models to perform the same task," completion reliability is part of the outcome.

Lessons for Engineers Using LLM-Generated Code

1. Surface Compliance ≠ Correctness

All three successful models implemented ANSI colors, turn loops, combat, and summaries — the visible requirements. But the bugs were in edges and bookkeeping that a human reviewer would need to trace.

2. Cardinality Constraints Force Design Tradeoffs

The "exactly 3 choices" requirement pushed each model into a different strategy:

DeepSeek: Show all options, block unaffordable ones at execution time (player wastes a turn)
MiniMax: Filter unaffordable ones, pad with duplicates (violates "distinct" intent)
Mimo: Show unaffordable options with labels, route to handler anyway (most pragmatic)

None of these are obviously correct. The prompt was ambiguous, and each model resolved the ambiguity differently.

3. Architecture Quality and Correctness Are Independent Axes

Mimo had the cleanest state/handler/generator separation — and the most interesting hidden state bug (mutated global monsters). MiniMax had the most polished UX — and the most obvious choice-generation issue. Clean architecture is a good thing, but it doesn't guarantee correct behavior.

4. Bookkeeping Is the Weak Spot

Two out of three successful models got the "total gold accumulated" requirement wrong because they conflated "current gold" with "lifetime earned gold." This is a pattern: LLMs are good at generating visible features but often miss exact semantic bookkeeping.

5. Failed Runs Are Still Useful

The Nemotron timeout tells us something about infrastructure reliability. When evaluating LLMs for real work, completion rate is a legitimate metric — separate from code quality.

A Practical Review Checklist

Based on this experiment, here is a review lens for generated code:

Check visible prompt compliance — does the broad structure match?
Check edge-case semantics — are numbers genuinely correct, not just present?
Check hidden mutable state — are objects shared where they should be copied?
Check final-summary bookkeeping — does "total X" mean what it says?
Separate model quality from infrastructure reliability — a timeout is not bad code.

Running the Games

Everyone can run these locally — no dependencies required:

python3 deepseek-v4-flash/roguelike.py
python3 minimax-m3/rogue.py
python3 mimo-2.5/roguelike.py

The most useful result from this experiment is not a ranking. It is the observation that same-prompt comparisons reveal different model instincts:

One model optimized for concise, prompt-faithful coverage.
One optimized for presentation and game feel.
One optimized for clean, separated architecture.
One failed at the execution layer before code could be evaluated.

For real engineering work, that suggests a practical approach: use LLMs for drafts, apply structured review for correctness, and never assume that compliant-looking output is correct output.