Tebogo Tseka

Posted on Mar 30 • Originally published at tebogo.cloud

5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

#ai #testing #webdev #llm

We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets.

The question: can cheaper models match Claude Sonnet for production code generation?

The short answer is no. The longer answer is more interesting.

The Models

Five models, spanning a 15x cost range:

Model	Provider	Input/1M Tokens	Output/1M Tokens	Why We Tested It
Claude Sonnet 4.6	OpenRouter	$3.00	$15.00	Assumed gold standard
Claude Haiku 4.5	OpenRouter/CLI	$1.00	$5.00	Same family, lower tier
Kimi K2.5	OpenRouter	$0.42	$2.20	Moonshot AI's latest
DeepSeek V3.2	OpenRouter	$0.26	$0.38	Budget option
DeepSeek R1	OpenRouter	$0.70	$2.50	Reasoning-focused

These five represent distinct price tiers and architectural approaches. Sonnet and Haiku share a lineage. Kimi is multimodal. DeepSeek V3.2 optimises for cost. R1 optimises for step-by-step reasoning.

The 16-Action Pipeline

Each model received the same template skeleton and business requirements, then applied 16 sequential actions:

#	Action	Category
1	apply-colours	Brand
2	swap-fonts	Brand
3	replace-header-logo	Brand
4	replace-footer-logo	Brand
5	replace-favicon	Brand
6	replace-hero-bg	Images
7	replace-section-bgs	Images
8	update-hero-text	Content
9	update-about-text	Content
10	update-contact	Content
11	apply-hero-layout	Layout
12	apply-sections-layout	Layout
13	add-seo-meta	Technical
14	add-structured-data	Technical
15	add-accessibility	Technical
16	verify-contrast	Quality

Same requirements spec, same gold standard, same judge for all models. Each action scored 0–10 using a violation-deduction model (see Part 1). Maximum possible: 160 points.

Actions are sequential — each builds on the previous output. Errors compound. This is deliberate: it mirrors how agents work in production.

The Results

Model	Avg Score	95% CI	% of Max	Std Dev	Runs
Claude Sonnet 4.6	149.5	N/A†	93.4%	0.0†	21
Kimi K2.5	108.2	[92.7, 123.7]	67.6%	20.1	9
Claude Haiku 4.5	107.7	[91.0, 124.4]	67.3%	13.4	5
DeepSeek V3.2	94.0	[78.0, 110.0]	58.8%	28.9	15
DeepSeek R1	41.9	N/A (n=2)	26.2%	3.3	2

Sonnet 4.6:    ████████████████████████████████████████████████████████ 149.5 (93%)
Kimi K2.5:     ████████████████████████████████████████                108.2 (68%)  ±15.5
Claude Haiku:  ████████████████████████████████████████                107.7 (67%)  ±16.7
DeepSeek V3.2: ██████████████████████████████████                       94.0 (59%)  ±16.0
DeepSeek R1:   ███████████████                                          41.9 (26%)  n=2
               |---------|---------|---------|---------|---------|
               0        30        60        90       120       150

The Honesty Moment

Before interpreting these rankings, three caveats:

Sonnet was measured differently. Its 149.5 score comes from gold standard evaluation (automated quality signals against 21 templates), not the same 16-action pipeline as the alternatives. The 41-point gap between Sonnet and the field may be partly methodological. We're fixing this in Round 2.

Rankings 2–4 are noise. Kimi's confidence interval is [93, 124]. Haiku's is [91, 124]. DeepSeek V3.2's is [78, 110]. These overlap heavily. With current sample sizes, we cannot say which of these three is genuinely better. What we CAN say: all three cluster around 59–68% of max, well below Sonnet's 93%.

Sample sizes are small. 2–15 runs per model. We need n≥16 for 80% statistical power to detect a 20-point difference. The rankings are directionally useful but not statistically conclusive for the middle tier.

Per-Template Performance

Template	Sonnet	Kimi	Haiku	DeepSeek V3.2	Best Alt % of Sonnet
AI Page Builder (SaaS)	149.5	134.8	124.2	99.5	90.2%
Association Corporate	149.5	126.0	120.2	105.5	84.3%
Safari Lodge	149.5	—	108.2	120.5	80.6%
SaaS Product	149.5	112.0	89.5	112.0	74.9%
Gala Event	149.5	98.8	96.0	86.8	66.1%

The AI Page Builder template is the closest contest — Kimi reaches 90.2% of Sonnet's quality. The Gala Event template is the widest gap at 66.1%. Template complexity matters: simpler structures with fewer sections are easier for all models.

Action Difficulty: What's Easy and What's Impossible

This is where the data gets interesting. Not all 16 actions are created equal:

Rank	Action	Avg Score	Category
1	add-accessibility	9.4/10	Technical
2	add-seo-meta	9.2/10	Technical
3	update-about-text	8.8/10	Content
4	replace-favicon	8.6/10	Content
...	...	...	...
14	apply-colours	5.2/10	Brand
15	apply-hero-layout	2.8/10	Layout
16	apply-sections-layout	-0.8/10	Layout

The pattern is clear when you group by category:

Category	Avg Score	Observation
Technical (SEO, a11y, schema)	8.7/10	Models follow structured specs reliably
Content (text updates)	7.7/10	Good when verbatim rules enforced
Brand (colours, fonts, logos)	6.8/10	Moderate — CSS variable application is fragile
Images (hero, section bgs)	6.2/10	All models hallucinate descriptions as src
Layout (hero, sections)	1.0/10	Consistently catastrophic

Structured, well-defined tasks score high. Spatial, visual tasks score low. Same models, wildly different results depending on task type.

The Gap Analysis: Where Alternatives Fall Behind

Comparing each action against Sonnet reveals where the quality gap actually lives:

Action	Sonnet	Kimi	Haiku	DS-V3	Avg Gap
add-accessibility	9.5	9.6	9.8	9.2	+0.0
replace-favicon	9.0	9.0	8.8	8.4	-0.3
add-seo-meta	10.0	9.4	9.6	9.0	-0.7
...
apply-colours	9.5	6.2	5.8	6.5	-3.3
apply-hero-layout	9.0	4.7	3.2	2.8	-5.4
apply-sections-layout	9.0	1.6	-3.8	-1.5	-10.2

Three actions account for most of the quality gap:

apply-sections-layout (-10.2 point gap) — alternatives actively break layouts. Haiku scores -3.8 on average, meaning it makes pages significantly worse.
apply-hero-layout (-5.4 point gap) — layout transformation is fundamentally hard for all models below Sonnet.
apply-colours (-3.3 point gap) — CSS variable propagation is inconsistent. Models update some variables but miss gradients, overlays, and header tints.

Three actions show essentially zero gap:

add-accessibility (+0.0) — every model follows accessibility specs equally well.
replace-favicon (-0.3) — simple file replacement.
add-seo-meta (-0.7) — structured metadata is a universal strength.

This has a practical implication: if you could route easy tasks to cheap models and hard tasks to Sonnet, you could potentially cut costs without cutting quality on the tasks that matter. More on this in Part 4.

The Action Heatmap

Here's every model scored on every action — the full picture:

                    Kimi  Haiku  DS-V3  DS-R1
add-accessibility   9.6   9.8    9.2    8.1
add-seo-meta        9.4   9.6    9.0    6.8
update-about-text   9.2   8.8    8.6    0.6
replace-favicon     9.0   8.8    8.4    6.0
replace-header-logo 8.2   9.2    7.4    4.8
add-structured-data 7.8   8.8    7.0    5.1
update-hero-text    7.6   7.7    7.2    1.6
update-contact      7.4   7.6    7.0   -1.2
swap-fonts          7.6   7.0    6.8    2.1
replace-hero-bg     7.3   6.2    6.5    2.8
verify-contrast     6.4   7.8    5.8    4.8
replace-section-bgs 7.6   2.4    5.5    3.0
replace-footer-logo 6.0   8.6    4.8    2.0
apply-colours       6.2   5.8    6.5    0.2
apply-hero-layout   4.7   3.2    2.8   -3.9
apply-sections-lyt  1.6  -3.8   -1.5   -2.5

Notice DeepSeek R1's column. It scores -1.2 on contact updates and -3.9 on hero layout. These aren't just bad scores — they mean the model made the page actively worse than the starting template on basic tasks.

The Reasoning Model Trap

DeepSeek R1 scored 26.2% — worse than any other model by a wide margin. On two runs, it averaged 41.9/160. For context, a score of 41.9 means the model successfully completed roughly 4 of 16 actions and actively damaged several others.

Why? R1 is a reasoning model. It's optimised for step-by-step logical deduction — mathematical proofs, multi-hop reasoning, chain-of-thought problem solving. Code generation is not reasoning. It's pattern completion with spatial awareness.

R1 spent tokens "thinking" about CSS instead of writing it. Its chain-of-thought preambles consumed context window without producing better output. On layout tasks, it reasoned its way into worse solutions than models that simply pattern-matched from training data.

The lesson: match the model architecture to the task type. Reasoning models are the wrong tool for code generation. This seems obvious in hindsight, but R1's pricing ($0.70/$2.50) sits between Haiku and Sonnet — it looks like a mid-tier option until you run the evaluation.

The Variance Problem

Average scores tell half the story. The other half is variance.

Model	Avg Score	Std Dev	Best Run	Worst Run	Range
Claude Haiku	107.7	13.4	~121	~94	27
Kimi K2.5	108.2	20.1	~128	~88	40
DeepSeek V3.2	94.0	28.9	120.5	25.8	95

Haiku is the most consistent model — you know what you're getting. Its standard deviation (13.4) is half of Kimi's and less than half of DeepSeek V3.2's.

DeepSeek V3.2's variance is remarkable. Its best run (120.5) approaches Haiku's average. Its worst run (25.8) is catastrophic — worse than R1's average. Same model, same template, same requirements, 95-point swing.

For production systems, unpredictable quality is worse than consistently mediocre quality. A restaurant that's amazing 50% of the time and terrible 50% isn't a good restaurant. Haiku's consistency is a genuine advantage that doesn't show up in averages.

What We'd Do Differently

This was an exploratory evaluation — designed to identify patterns, not prove rankings. For Round 2, we're addressing three issues:

Run Sonnet through the same pipeline. The gold standard scoring method makes Sonnet's score non-comparable. In Round 2, Sonnet runs the same 16-action pipeline as every other model. Same judge, same conditions, same denominator.

Increase sample sizes. Minimum 15 runs per model across the same template set. That gives us 80% statistical power to detect a 20-point difference at alpha=0.05. No more overlapping confidence intervals for the middle tier.

Calibrate the judge. Our Claude Opus judge scores Claude models. There's an obvious bias risk. Round 2 will score a subset with a second judge model and compute inter-rater agreement. We'll also blind the judge by stripping model-identifying patterns from outputs.

Key Takeaways

No model matches Sonnet. The gap is directionally clear even with measurement caveats. For client-facing output where quality is non-negotiable, Sonnet remains the production choice.

The middle tier is a tie. Kimi, Haiku, and DeepSeek V3.2 are statistically indistinguishable. Pick based on secondary factors: Haiku for consistency, Kimi for peak performance, DeepSeek for cost.

Task type matters more than model choice. The difference between the easiest action (9.4/10) and the hardest (-0.8/10) is larger than the difference between any two models on the same action. If you optimise which tasks you give to AI rather than which AI you use, you'll see bigger quality gains.

Reasoning models don't generate code well. R1's architecture is wrong for this task. Don't pick a model based on its benchmark scores on reasoning tasks if your workload is code generation.

Variance is a feature, not noise. DeepSeek V3.2 is the cheapest option but the least predictable. Haiku costs 5x more but delivers consistent results. The reliability premium is real.

This is part 2 of a 7-part series documenting how we built an evaluation framework for AI code generators, tested 5 models across 467 real code generation tasks, and turned the results into production improvements.

Previous: Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs
Next: Building an LLM Judge That Doesn't Lie to You

Originally published on tebogo.cloud

Top comments (1)

Max Quimby • Apr 2

This is one of the most honest AI benchmarking pieces I've read. The action difficulty breakdown is the real gem here — the finding that layout tasks score 1.0/10 while structured technical tasks hit 8.7/10 perfectly matches what I've seen building multi-step agent pipelines.

The implication you hint at near the end (routing easy tasks to cheap models, hard tasks to Sonnet) is exactly where production systems are heading. We've been experimenting with a similar approach — running a lightweight classifier that scores task complexity, then dispatches to the appropriate model tier. The savings are significant, but the tricky part is building confidence in the classifier itself. Too aggressive and you send a layout task to Haiku and get -3.8 scores. Too conservative and you're burning Sonnet tokens on accessibility tags.

One thing I'd love to see in Round 2: how do error compounding rates differ across models? Since actions are sequential, does a single early failure cascade differently for Kimi vs. DeepSeek? That could change the routing calculus entirely — reliability of early actions might matter more than average score.

Great methodology. Looking forward to the next round with normalized Sonnet comparison.