Day before, I ran an experiment: give several local LLMs the exact same prompt (build a terminal-based roguelike game in Python 3) and compare what comes out. The goal was never to declare a winner — it was to see how architectural choices, edge-case handling, and failure modes diverge when the input is held perfectly constant.
The results were more interesting than I expected. Surface compliance was common; correct state management was not. Presentation polish sometimes masked mechanical defects. Failed generations were still useful data points.
So I did it again. Same methodology, different domain. This time: a static site generator.
The Experiment
Prompt
The models were given this prompt verbatim:
Act as an expert software engineer. Build a complete, single-file, production-ready
Python 3 static site generator (SSG). The program must run entirely from the command
line with zero external dependencies (standard library only).
You must implement the entire program in a single shot. Do not use placeholders,
TODOs, or truncated code.
The full prompt specified Markdown parsing (headings, bold, italic, lists, code blocks, links, horizontal rules, paragraphs), front matter support (title, date, description), HTML5 output with embedded CSS, auto-generated index page with working links, deterministic output, a --clean flag, proper error handling for missing directories and unreadable files, and an object-oriented architecture with distinct parsing/rendering/file-management responsibilities.
Models
| Model | Output | Lines |
|---|---|---|
| Big Pickle | ssg.py |
410 |
| DeepSeek V4 Flash | ssg.py |
583 |
| Mimo 2.5 | ssg.py |
566 |
| Nemotron Ultra | (empty) | 0 |
Methodology
- All folders preserved as raw, unedited artifacts.
- Analysis based on source inspection, syntax parsing (
python3 -m py_compile), and functional testing with sample Markdown content. - No edits were made to any model output.
- Empty folders are treated as completion/reliability failures, not implementation comparisons.
Results: The Baseline
All three successful models passed syntax checks and produced valid HTML output on functional tests. Every model met the core prompt requirements:
- ✅ Markdown→HTML conversion (headings, bold, italic, unordered lists, ordered lists, inline code, fenced code blocks, hyperlinks, horizontal rules, paragraphs)
- ✅ Front matter parsing (title, date, description) with body stripping
- ✅ Complete HTML5 documents with
<!DOCTYPE html>,<head>,<body> - ✅ Embedded CSS stylesheet (no external resources)
- ✅ Auto-generated
index.htmllisting all pages - ✅
--cleanflag for output directory reset - ✅ Error handling for missing source directories
- ✅ Graceful handling of files with no
.mdcontent - ✅ Per-file error logging without aborting the build
- ✅ Deterministic output (identical input → identical output)
- ✅ Object-oriented architecture with separated responsibilities
This baseline was important to establish. All three models did what the prompt asked. The interesting observations are in the divergences.
Divergence #1: The index.md Collision Problem
The prompt explicitly called out a subtle edge case:
If a file is named index.md, it should be rendered as a page (not replace the generated index).
Three models read this and produced three different solutions:
Big Pickle renamed the output file:
def output_path_for(self, source_path):
name = source_path.stem
if name == 'index':
name = 'index-page' # rename to avoid collision
return self.output / f'{name}.html'
DeepSeek V4 Flash routed it to a subdirectory:
if md_file.stem == "index":
relative_url = "pages/index.html"
else:
relative_url = md_file.stem + ".html"
Mimo 2.5 did not handle it at all. An index.md source file would be written to output/index.html, silently overwriting the auto-generated site index. Same prompt, same warning — but the edge case was either missed or deemed unimportant.
This is a recurring pattern from the game experiment too, where the "exactly 3 choices" requirement was met differently by every model. Prompt edge cases are interpreted differently even when explicitly called out.
Divergence #2: Configurable Output Directory
The prompt says:
Each .md file must be converted to a corresponding .html file in an output/ directory (created if it doesn't exist).
Is output/ a hard requirement or a default? The models disagreed:
-
Big Pickle: Hardcoded to
output/. No configuration option. -
DeepSeek V4 Flash: Hardcoded to
output/. No configuration option. -
Mimo 2.5: Configurable via
--output/-oflag, defaults tooutput/.
Only one model added customization. The other two treated "output/" as fixed. This is a small detail, but it reflects how models handle "configuration" hints in prompts — some stay literal, others extrapolate to flexibility.
Divergence #3: HTML Escaping (Security)
This is the most important quality gap in the experiment.
Big Pickle and DeepSeek V4 Flash both correctly escape HTML special characters before applying inline Markdown formatting:
# From Big Pickle's MarkdownParser
def _parse_inline(self, text):
text = self._escape(text) # &, <, >, " escaped first
# Inline code — first to protect content from further parsing
text = re.sub(r'`([^`]+)`', r'<code>\1</code>', text)
text = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', text)
# ...etc
Mimo 2.5 applies regex substitutions to raw text without prior escaping:
def _parse_inline(self, text):
text = self.bold_pattern.sub(r'<strong>\1</strong>', text)
text = self.italic_pattern.sub(r'<em>\1</em>', text)
text = self.code_pattern.sub(r'<code>\1</code>', text)
text = self.link_pattern.sub(r'<a href="\2">\1</a>', text)
return text # raw HTML from markdown content passes through unescaped
An _escape_html method exists in HTMLRenderer, but it is only used for metadata and title rendering — not for body content. This means user-provided Markdown content containing &, <, >, or raw HTML tags would produce unexpected (and potentially unsafe) output.
Two of three models got this right. The third had a perfectly clean implementation otherwise. This is exactly the kind of defect that is easy to miss in a single-shot generation but matters in production.
Divergence #4: Architecture Style
Each model chose a different architectural approach:
Big Pickle — Coordinated Class Dependencies
Big Pickle uses five classes with SiteGenerator as the orchestrator that composes FileManager, MarkdownParser, and HTMLRenderer. The FileManager is the most feature-rich of the three, handling file discovery, non-Markdown file counting, read/write, output preparation, and path collision resolution.
The pipeline flow is straightforward:
FileManager.find_markdown_files()- For each file:
FrontMatter.parse()→MarkdownParser.parse()→HTMLRenderer.render()→FileManager.write() -
_generate_index()→ summary
DeepSeek V4 Flash — Block-Based Pipeline
DeepSeek separates parsing and rendering into two explicit phases with an intermediate representation. MarkdownParser splits text into semantic blocks stored as (type, content) tuples:
# Blocks look like:
("heading", (1, "My Title"))
("paragraph", "Some text here")
("code", ("python", "print('hello')"))
("unordered_list", ["item one", "item two"])
HTMLRenderer.render_block() then consumes these tuples polymorphically. This two-phase approach is the cleanest separation of concerns among the three implementations.
DeepSeek also went beyond the prompt by supporting blockquotes (>), language-tagged code blocks (<pre><code class="language-python">), and an extra_head injection point for custom <head> content.
Mimo 2.5 — Dataclass-Driven with Builder Pattern
Mimo uses Python @dataclass for clean data modeling:
@dataclass
class PageMetadata:
title: str
date: Optional[str] = None
description: Optional[str] = None
@dataclass
class Page:
filename: str
source_path: str
output_path: str
metadata: PageMetadata
content: str
html: str
The SiteBuilder.build() method returns a bool rather than calling sys.exit() directly, giving callers more control over the exit flow. The index page sorts entries by date in descending order — a nice UX touch.
Divergence #5: Line Count vs. Completeness
The smallest implementation (Big Pickle, 410 lines) covered every prompt requirement. The largest (DeepSeek, 583 lines) added features beyond the prompt (blockquotes, language-tagged code blocks). Mimo (566 lines) also added UX extras (date-sorted index, short CLI flags).
| Model | Lines | Extra features | Missing features |
|---|---|---|---|
| Big Pickle | 410 | index.md collision handling | No --output flag |
| DeepSeek | 583 | Blockquotes, lang tags, custom exceptions | No --output flag |
| Mimo 2.5 | 566 | --output/-o flags, date-sorted index | No HTML escaping, no index.md handling |
More code did not mean more completeness. Big Pickle was 30% smaller than DeepSeek but handled the index.md edge case that Mimo missed entirely.
Failure Case: Nemotron Ultra
The Nemotron Ultra folder is empty. Unlike the game experiment's Nemotron 3 Ultra, which at least captured an upstream timeout error message, this run produced nothing at all — no code, no error log, no artifact. This is a harder failure to diagnose and a reminder that generation reliability is a separate concern from generation quality.
Patterns Across Both Experiments
The game experiment (round 1) and the SSG experiment (round 2) share some common findings:
1. Surface compliance is the easy part
Every successful model in both experiments met the stated requirements. The divergences are in edge-case handling, subtle bugs, and architectural choices — not in whether the program runs.
2. Ambiguity drives divergence
The game prompt's "exactly 3 choices" was interpreted differently by every model. The SSG prompt's "output/ directory" got two interpretations. When a prompt leaves room for interpretation, models fill that room differently.
3. Security details are fragile
In the game experiment, Mimo 2.5 had hidden mutable global state that would cause bugs after repeated play. In the SSG experiment, Mimo 2.5 missed HTML escaping. These are both "works on first glance, fails under real use" defects.
4. Architecture ≠ correctness
The most architecturally clean implementation (Mimo 2.5's GameState/EventHandler/ChoiceGenerator/GameEngine separation in the game, DeepSeek's block-based pipeline in the SSG) also had the most subtle bugs. Clean code is not a guarantee of correct code.
5. Failures matter
Both experiments had a model that produced no output. These are not just null data points — they highlight reliability constraints in the generation pipeline itself.
Running the same experiment across two very different domains (a terminal game and a static site generator) reveals patterns that a single run couldn't. The same types of divergence — edge-case handling, architectural taste, security awareness — showed up in both.
The raw artifacts and full analysis are available at:
https://github.com/kotrats/same-prompt-multiple-local-models
The repo now contains both experiments with unedited model outputs, detailed comparison tables, architecture diagrams, and per-model observations. If you run your own experiments or have ideas for the next prompt to test, I'd love to hear about it.
All model outputs preserved as raw artifacts. No edits were made to any generated code. Analysis based on source inspection, syntax parsing, and functional testing.
Top comments (0)