Harish Kotra (he/him)

Posted on Jun 10

Same Prompt, Multiple Local Models — Round 2: Static Site Generator

#ai #programming #productivity #python

Day before, I ran an experiment: give several local LLMs the exact same prompt (build a terminal-based roguelike game in Python 3) and compare what comes out. The goal was never to declare a winner — it was to see how architectural choices, edge-case handling, and failure modes diverge when the input is held perfectly constant.

The results were more interesting than I expected. Surface compliance was common; correct state management was not. Presentation polish sometimes masked mechanical defects. Failed generations were still useful data points.

So I did it again. Same methodology, different domain. This time: a static site generator.

The Experiment

Prompt

The models were given this prompt verbatim:

Act as an expert software engineer. Build a complete, single-file, production-ready
Python 3 static site generator (SSG). The program must run entirely from the command
line with zero external dependencies (standard library only).

You must implement the entire program in a single shot. Do not use placeholders,
TODOs, or truncated code.

The full prompt specified Markdown parsing (headings, bold, italic, lists, code blocks, links, horizontal rules, paragraphs), front matter support (title, date, description), HTML5 output with embedded CSS, auto-generated index page with working links, deterministic output, a --clean flag, proper error handling for missing directories and unreadable files, and an object-oriented architecture with distinct parsing/rendering/file-management responsibilities.

Models

Model	Output	Lines
Big Pickle	`ssg.py`	410
DeepSeek V4 Flash	`ssg.py`	583
Mimo 2.5	`ssg.py`	566
Nemotron Ultra	(empty)	0

Methodology

All folders preserved as raw, unedited artifacts.
Analysis based on source inspection, syntax parsing (python3 -m py_compile), and functional testing with sample Markdown content.
No edits were made to any model output.
Empty folders are treated as completion/reliability failures, not implementation comparisons.

Results: The Baseline

All three successful models passed syntax checks and produced valid HTML output on functional tests. Every model met the core prompt requirements:

✅ Markdown→HTML conversion (headings, bold, italic, unordered lists, ordered lists, inline code, fenced code blocks, hyperlinks, horizontal rules, paragraphs)
✅ Front matter parsing (title, date, description) with body stripping
✅ Complete HTML5 documents with <!DOCTYPE html>, <head>, <body>
✅ Embedded CSS stylesheet (no external resources)
✅ Auto-generated index.html listing all pages
✅ --clean flag for output directory reset
✅ Error handling for missing source directories
✅ Graceful handling of files with no .md content
✅ Per-file error logging without aborting the build
✅ Deterministic output (identical input → identical output)
✅ Object-oriented architecture with separated responsibilities

This baseline was important to establish. All three models did what the prompt asked. The interesting observations are in the divergences.

Divergence #1: The index.md Collision Problem

The prompt explicitly called out a subtle edge case:

If a file is named index.md, it should be rendered as a page (not replace the generated index).

Three models read this and produced three different solutions:

Big Pickle renamed the output file:

def output_path_for(self, source_path):
    name = source_path.stem
    if name == 'index':
        name = 'index-page'       # rename to avoid collision
    return self.output / f'{name}.html'

DeepSeek V4 Flash routed it to a subdirectory:

if md_file.stem == "index":
    relative_url = "pages/index.html"
else:
    relative_url = md_file.stem + ".html"

Mimo 2.5 did not handle it at all. An index.md source file would be written to output/index.html, silently overwriting the auto-generated site index. Same prompt, same warning — but the edge case was either missed or deemed unimportant.

This is a recurring pattern from the game experiment too, where the "exactly 3 choices" requirement was met differently by every model. Prompt edge cases are interpreted differently even when explicitly called out.

Divergence #2: Configurable Output Directory

The prompt says:

Each .md file must be converted to a corresponding .html file in an output/ directory (created if it doesn't exist).

Is output/ a hard requirement or a default? The models disagreed:

Big Pickle: Hardcoded to output/. No configuration option.
DeepSeek V4 Flash: Hardcoded to output/. No configuration option.
Mimo 2.5: Configurable via --output/-o flag, defaults to output/.

Only one model added customization. The other two treated "output/" as fixed. This is a small detail, but it reflects how models handle "configuration" hints in prompts — some stay literal, others extrapolate to flexibility.

Divergence #3: HTML Escaping (Security)

This is the most important quality gap in the experiment.

Big Pickle and DeepSeek V4 Flash both correctly escape HTML special characters before applying inline Markdown formatting:

# From Big Pickle's MarkdownParser
def _parse_inline(self, text):
    text = self._escape(text)  # &, <, >, " escaped first

    # Inline code — first to protect content from further parsing
    text = re.sub(r'`([^`]+)`', r'<code>\1</code>', text)
    text = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', text)
    # ...etc

Mimo 2.5 applies regex substitutions to raw text without prior escaping:

def _parse_inline(self, text):
    text = self.bold_pattern.sub(r'<strong>\1</strong>', text)
    text = self.italic_pattern.sub(r'<em>\1</em>', text)
    text = self.code_pattern.sub(r'<code>\1</code>', text)
    text = self.link_pattern.sub(r'<a href="\2">\1</a>', text)
    return text  # raw HTML from markdown content passes through unescaped

An _escape_html method exists in HTMLRenderer, but it is only used for metadata and title rendering — not for body content. This means user-provided Markdown content containing &, <, >, or raw HTML tags would produce unexpected (and potentially unsafe) output.

Two of three models got this right. The third had a perfectly clean implementation otherwise. This is exactly the kind of defect that is easy to miss in a single-shot generation but matters in production.

Divergence #4: Architecture Style

Each model chose a different architectural approach:

Big Pickle — Coordinated Class Dependencies

Big Pickle uses five classes with SiteGenerator as the orchestrator that composes FileManager, MarkdownParser, and HTMLRenderer. The FileManager is the most feature-rich of the three, handling file discovery, non-Markdown file counting, read/write, output preparation, and path collision resolution.

The pipeline flow is straightforward:

FileManager.find_markdown_files()
For each file: FrontMatter.parse() → MarkdownParser.parse() → HTMLRenderer.render() → FileManager.write()
_generate_index() → summary

DeepSeek V4 Flash — Block-Based Pipeline

DeepSeek separates parsing and rendering into two explicit phases with an intermediate representation. MarkdownParser splits text into semantic blocks stored as (type, content) tuples:

# Blocks look like:
("heading", (1, "My Title"))
("paragraph", "Some text here")
("code", ("python", "print('hello')"))
("unordered_list", ["item one", "item two"])

HTMLRenderer.render_block() then consumes these tuples polymorphically. This two-phase approach is the cleanest separation of concerns among the three implementations.

DeepSeek also went beyond the prompt by supporting blockquotes (>), language-tagged code blocks (<pre><code class="language-python">), and an extra_head injection point for custom <head> content.

Mimo 2.5 — Dataclass-Driven with Builder Pattern

Mimo uses Python @dataclass for clean data modeling:

@dataclass
class PageMetadata:
    title: str
    date: Optional[str] = None
    description: Optional[str] = None

@dataclass
class Page:
    filename: str
    source_path: str
    output_path: str
    metadata: PageMetadata
    content: str
    html: str

The SiteBuilder.build() method returns a bool rather than calling sys.exit() directly, giving callers more control over the exit flow. The index page sorts entries by date in descending order — a nice UX touch.

Divergence #5: Line Count vs. Completeness

The smallest implementation (Big Pickle, 410 lines) covered every prompt requirement. The largest (DeepSeek, 583 lines) added features beyond the prompt (blockquotes, language-tagged code blocks). Mimo (566 lines) also added UX extras (date-sorted index, short CLI flags).

Model	Lines	Extra features	Missing features
Big Pickle	410	index.md collision handling	No --output flag
DeepSeek	583	Blockquotes, lang tags, custom exceptions	No --output flag
Mimo 2.5	566	--output/-o flags, date-sorted index	No HTML escaping, no index.md handling

More code did not mean more completeness. Big Pickle was 30% smaller than DeepSeek but handled the index.md edge case that Mimo missed entirely.

Failure Case: Nemotron Ultra

The Nemotron Ultra folder is empty. Unlike the game experiment's Nemotron 3 Ultra, which at least captured an upstream timeout error message, this run produced nothing at all — no code, no error log, no artifact. This is a harder failure to diagnose and a reminder that generation reliability is a separate concern from generation quality.

Patterns Across Both Experiments

The game experiment (round 1) and the SSG experiment (round 2) share some common findings:

1. Surface compliance is the easy part

Every successful model in both experiments met the stated requirements. The divergences are in edge-case handling, subtle bugs, and architectural choices — not in whether the program runs.

2. Ambiguity drives divergence

The game prompt's "exactly 3 choices" was interpreted differently by every model. The SSG prompt's "output/ directory" got two interpretations. When a prompt leaves room for interpretation, models fill that room differently.

3. Security details are fragile

In the game experiment, Mimo 2.5 had hidden mutable global state that would cause bugs after repeated play. In the SSG experiment, Mimo 2.5 missed HTML escaping. These are both "works on first glance, fails under real use" defects.

4. Architecture ≠ correctness

The most architecturally clean implementation (Mimo 2.5's GameState/EventHandler/ChoiceGenerator/GameEngine separation in the game, DeepSeek's block-based pipeline in the SSG) also had the most subtle bugs. Clean code is not a guarantee of correct code.

5. Failures matter

Both experiments had a model that produced no output. These are not just null data points — they highlight reliability constraints in the generation pipeline itself.

Running the same experiment across two very different domains (a terminal game and a static site generator) reveals patterns that a single run couldn't. The same types of divergence — edge-case handling, architectural taste, security awareness — showed up in both.

The raw artifacts and full analysis are available at:

https://github.com/kotrats/same-prompt-multiple-local-models

The repo now contains both experiments with unedited model outputs, detailed comparison tables, architecture diagrams, and per-model observations. If you run your own experiments or have ideas for the next prompt to test, I'd love to hear about it.

All model outputs preserved as raw artifacts. No edits were made to any generated code. Analysis based on source inspection, syntax parsing, and functional testing.

DEV Community