DEV Community: Jarek

Deterministic where you can, model where you must, human for taste

Jarek — Fri, 31 Jul 2026 06:45:00 +0000

Last time I wrote about the judge that scores my benchmark, I left one thing out. That post was about adding a model to the scoring. It was not about the fact that the model is the last thing I reach for in this machine.

Evaluating an AI coding model is not one number. It is several different questions: did it meet the spec, does the thing actually run, does it look good, is the code clean, and did the agent behave sensibly. Each of those questions has a different natural judge. So in Refio I split scoring into three layers: a deterministic judge (measures what a machine can measure), strong LLM judges (read the code and score what a machine cannot weigh), and a human (visual taste and the final call). The whole thing runs on the same agent engine the production plugin uses, just headless. The result: model comparisons you can actually trust, because they are multi-dimensional, repeatable, and reflect how the tool really behaves.

"Works on my machine" does not scale to 480 runs

Anyone who has plugged a new model into their agent knows the moment. You fire off a prompt, the model spits out an app, you open it in a browser and say "huh, not bad" or "nope, garbage". You try again and get something different. You plug in the next model, and an hour later all you have is a vague "the first one was probably better". That is not evaluation. That is guessing in good faith.

Leaderboards do not save you, because they measure something else: how a model handles closed, self-contained puzzles. They do not tell you whether your agent, with your tools and your context budget, will get something out of it that opens and does not flood the console with errors. I have that problem in a sharp form while building the plugin, because Refio only makes sense if I can honestly say: this local model, free, on your GPU, is good enough for this kind of task, and that one is not.

Why one number lies

Take a real task from my catalogue: a simple 3D editor in the browser, single HTML file - scene view, adding and moving objects, a properties panel, zero console errors. How do you score that with one digit? You cannot, because several independent questions live inside it:

spec compliance - can you actually add an object, is there a properties panel, does the file name match,
works out of the box - does it open without manual fixing, does it throw errors,
look - this is taste, not measurement,
code quality and structure - clean, or a tangle with dead functions,
logic correctness - does it hold up when you read the code. A screen can lie beautifully,
agent logic - did the agent check files, generate, verify, summarise. Or did it wander.

A model can crush one dimension and faceplant in another - a great look on top of broken logic is a daily occurrence here. On top of that comes non-determinism: the same prompt twice gives you two different files.

The principle this stands on

Do not judge everything with one tool. Do not call a model where a plain function will do, and do not pretend a regex can weigh taste.

Look at the extremes. "Does this page render?" has a binary, machine-checkable answer; asking a model "do you think it renders?" injects noise where you could have had certainty. Just render it. In the other direction, "does this look good?" is an aesthetic judgement by definition - a regex will not weigh it, and even a strong model does it worse than a human with taste.

Hence three layers.

Layer 1: the judge with no opinions

The first layer has nothing to do with AI, and an important caveat up front: it is a hint, not a result. Those scores sit next to an entry while I am looking at it, and that is where their role ends - they never make it into the final results. It measures three things you can compute objectively from an e2e run.

Compliance through needles. For every task I define a set of needles - content markers that must appear in the delivered file. All of them hit is 1, some is 0.5, none is 0. A needle is either plain text or a regex, so the rule is hard and repeatable. This is where people usually ask "but there are no unit tests here, how do you measure anything?" - exactly like this. When the artifact is an HTML file rather than a function with a signature, you cannot write classic tests, but you can say what must be in it.

Works out of the box through an actual render. The file goes into a headless browser. Renders clean - 1, renders with console errors - 0.5, does not render at all - 0. Zero interpretation. This layer grew directly out of what I wrote about in the code that kills the browser: once you have seen an infinite draw loop lock up a tab, you stop trusting a screenshot as proof that something "works".

Agent logic through status and tool order. The run has to finish with a SUCCESS status, and if a task requires a specific tool order, I check whether it is a subsequence of the actual calls - they do not have to be adjacent, but the order has to hold. SUCCESS with the right order is 1, with the wrong one 0.5, no SUCCESS is 0.

All of this is free, instant, and one hundred percent repeatable. Before I look at anything, I already know whether the page came up and whether the agent took the right path. And importantly: it deliberately does not touch look or code quality, because those cannot be measured mechanically.

Layer 2: models, but only now

The second layer takes what a machine cannot weigh but a careful read of the code can: code structure and logic correctness. This is where judgement needs intelligence, so this is where I finally call a model (through the Claude Code and Codex agents). I described the construction in the previous post - an evidence folder instead of "rate this 1 to 10", screenshots at three points in time, clicking through controls, blind scoring. Two things follow directly from the layering.

I aggregate with the median, not the mean. The median shrugs off one judge who went off the rails; the mean politely folds that excursion into the result. The known sins of model judges (verbosity bias, position sensitivity, inconsistency across runs) do not disappear, but they stop deciding once you have two of them and take the middle. Which is why the judge here is an advisor, not a final authority - it tells me where to look, and the ranking is still signed by me.

LLM judges do not score agent logic. Not because they cannot - because how the agent worked live cannot be reconstructed from a dead file. That is a question about process, not product. A good evaluation system knows what its judges should not touch.

Layer 3: the human, trimmed to the minimum

The last mile is mine, and I do exactly one thing in it that cannot be automated: I add look (the only criterion on a 0-2 scale, because that is where "better than I expected" belongs) and make the final call.

An automated step assembles a ready entry in a review queue - artifact, screenshots, duration, tokens, cost, deterministic hints, judge verdicts and a tentative pass/fail. I walk in, look, add my part, and either promote the entry into the results or reject it. Nothing that could be computed gets retyped by hand. On top of that there is a divergence marker: it shows the largest gap between my score and the judges' aggregate. It works both ways - either the judge got it wrong, or I missed something it caught while reading the code.

Repeatability, and where the tasks come from

Since models are non-deterministic, one good result can be luck, so I measure stability separately. For me that is six runs per model per task - fewer than three and you cannot tell luck from form, more than six and it stops paying off across a dozen-plus models. For any group (task, model, environment) with at least two runs I compute the variance of scores and the similarity of the generated code. Incidentally, that second measure turned into its own post - it turns out every LLM writes the same Todo App, quite literally.

The tasks themselves have a single source of truth: one description plus one prompt, from which a generator emits both the automated e2e scenario and the review task in the admin panel. I learned this the hard way - keep a task description in two places and a month later you have two different tasks under one name. Difficulty is spread on purpose: from Snake and Todo, through the 3D editor, up to single-file monsters like a modular Web Audio synth, a logic circuit simulator or a procedural voxel world. That is where weaker models simply break, which is the whole point.

The same engine, just without the windows

The most important part last. The benchmark does not call the model through a clean API under laboratory conditions - it drives it through exactly the same agent engine, the same tools, the same context pipeline and the same guardrails the production IntelliJ plugin uses. Just headless.

So I measure the model the way you will actually experience it in Refio: with its tool-use discipline and its behaviour inside the agent loop. That makes the benchmark useful twice over - it scores models, but it is also my own proving ground. When a guardrail kills a sensible run or a model falls into a loop, I see it here before anyone else does.

What this buys me

Honesty. A model cannot hide broken logic behind a pretty surface, because those are two criteria judged by two different judges.
Repeatability. The hints I start every entry with are deterministic, and stability is measured explicitly.
Realism. I measure the model in the same engine you use it in.
Cheap decisions. I drop in a new local model and know by the end of the afternoon whether it is worth shipping.

The whole philosophy fits in one sentence: deterministic where you can, model where you must, and a human for taste. Boringly simple, and it turns "the first one was probably better" into numbers I can trust.

Results live here: benchmark.refio.dev. If you are building your own evaluation harness, I am curious where you drew the line between "I will compute this myself" and "let a model score it". My hunch is that most people jump straight to layer two and pay for it in noise.

Quantization for the impatient - why int4 doesn't speed up image generation on GPU

Jarek — Wed, 15 Jul 2026 08:00:00 +0000

Recently I sat down with an old codebase and put together a quantization pipeline for diffusion models out of curiosity - SDXL, FLUX, SD3, some video. Just to see quantization in practice, since almost every "tutorial" online repeats the same thing: fewer bits = faster. Makes sense, right? Less to multiply, less work for the GPU, image pops out quicker.

Except... no. And this isn't a minor nuance, it's a fundamental misunderstanding.

I ran measurements on a laptop with RTX 4090 mobile (16 GB VRAM), Windows 11, CUDA 12.9, Python 3.14, diffusers 0.38, torch 2.5+. Real numbers, on a real card. Without that it's just "it runs faster on my machine".

Why quantize at all

Let's start from basics, because this is the first place things get muddled. Model quantization has two goals - and almost never both at once:

Fit the model in VRAM - weights go from fp16 (2 bytes/parameter) to int8 (1 byte) or int4 (half a byte) or fp4 (on the new RTX 5000). A 12B parameter model drops from 24 GB to 6 GB. This is the main reason anyone does it at all.
Sometimes speed up inference - and here's where the minefield starts. Because "sometimes" means "only if your GPU has native tensor cores for that format". And most formats used by quantization libraries... don't have them.

Everything else in this article follows from point two. If you take only one thing from this post, make it this: quantize if you have to because of VRAM. Don't quantize for speed, because it'll almost always be slower.

Full FP16

Why fewer bits ≠ faster

The mechanics are simple once you see them. A GPU has physical, silicon-level MatMul instructions only for specific number formats. Everything else is software.

Architecture	Native tensor cores
Ampere (RTX 30xx, A100)	fp16, bf16, tf32, int8, int4
Ada Lovelace (RTX 40xx, incl. 4090 mobile)	+ fp8 (E4M3/E5M2)
Blackwell (RTX 50xx, B200)	+ fp4, mxfp8, mxfp4

Everything else - nf4, int2, int1, int3, fp4 on consumer Ada - has no native MatMul on any GPU available to mortals. Which means when a quantization library "promises" you NF4, what actually happens is this:

Weights sit in VRAM packed (e.g. 8 NF4 weights squeezed into one int32). VRAM saved, works here.
Right before the multiplication... we unpack them back to fp16. Custom CUDA kernel bitwise unpack.
MatMul runs normally in fp16, on the same tensor cores as the fp16 baseline.

So we save VRAM (weights are packed), but we add 10-30% overhead for dequantize on every step. The lower the bits, the more bitwise operations on unpack. More operations = slower.

For fp16 → int4 the savings are 4×, dequantize is relatively cheap - sometimes it roughly breaks even. For fp16 → int2 or int1 the savings are 8×-16×, but dequantize starts costing more than the multiplication itself. Result: on 1-bit weights, inference can be 3× slower than fp16.

quanto int8

Benchmark table - so nobody has to take my word for it

SDXL 1024×1024, 25 steps, RTX 4090 mobile (Ada, 16 GB VRAM), same base SDXL 1.0 weights:

Preset	Time / image	vs fp16
fp16 / bf16 (baseline)	9 s	1.0×
bnb-int4	9 s	1.0× (int4 has tensor cores!)
bnb-nf4	13 s	1.4×
torchao-int8	17 s	1.9×
torchao-fp8	17 s	1.9×
hqq-8bit	13 s	1.4×
hqq-4bit	18 s	2.0×
hqq-3bit	52 s	5.8× (!)
hqq-2bit	24 s	2.7×
hqq-1bit	26 s	2.9×
bnb-int8	32 s	3.6×

Look at the first two rows. bnb-int4 is exactly as fast as fp16. Because Ampere/Ada have int4 tensor cores - weights go through natively, no on-the-fly dequantize.

Now look at bnb-int8 - 3.6× slower. Why? Because bnb-int8 isn't pure INT8 weight-only - it's mixed precision with outlier decomposition (Tim Dettmers' LLM.int8()). Some weights go in INT8, some in fp16, there are two separate multiplications, plus a bf16→fp16 cast between operations. It was designed for LLM stability, not speed. Works great for what it was built for, but for diffusion it's not a good "default" choice.

HQQ-3bit - 52 seconds instead of 9. Six times slower than baseline. So to "save VRAM" you trade away 5× more time on every image. If you have 1000 images to generate, that difference gets very real very fast.

The conclusion from the table is brutal: don't look at the bit count, look at whether your GPU has native MatMul for that format. That's the one and only heuristic that actually works.

Five backends - what to use when

The diffusers/transformers ecosystem has five main quantization libraries. Each with its own character. Quick overview, because each deserves its own post, but you need a map:

1. bitsandbytes (`bnb`)

Widest support, simplest path. Supports int8 and 4-bit (NF4, INT4, FP4-E2M1). OOTB in diffusers via BitsAndBytesConfig.

bnb-int8 - mixed precision with outlier decomposition. Stable, but slow (see table).
bnb-nf4 - format designed for LLMs (normal weight distribution). Stable, but no tensor cores → dequant overhead.
bnb-int4 - classic INT4 weight-only. On Ampere/Ada hits int4 tensor cores, free speed.

torchao int8

2. torchao

The official PyTorch path. Requires torchao≥0.7 + diffusers≥0.38. The package that will catch up to the rest of the ecosystem in 2026/2027 - but still a bit painful to use because the API keeps changing.

torchao-int8 - Int8WeightOnlyConfig. Stable everywhere.
torchao-int4 - Int4WeightOnlyConfig. Stable on the denoiser, unstable on the CLIP-G text encoder (more on that below).
torchao-fp8 - Float8WeightOnlyConfig. Unstable on text encoder. OK on UNet.
torchao-fp4 - Float8DynamicActivationInt4WeightConfig. Requires Blackwell. Won't run on consumer Ada.

Important gotcha: in diffusers 0.38, TorchAoConfig(quant_type=...) requires an instance of AOBaseConfig, not a string. Older diffusers accepted "int8wo". If something suddenly breaks after pip upgrade - this is probably why.

3. quanto (optimum-quanto)

HuggingFace's backend. Narrow precision support (int2, int4, int8, float8), but int2 is a unique feature - you won't find it anywhere else. Just don't expect it to speed anything up :)

hqq 8bit

4. HQQ (Half-Quadratic Quantization)

Calibration-free, arbitrary bit count. 1, 2, 3, 4, 8 bit. Replaces nn.Linear with HQQLinear at runtime.

Upsides: no calibration dataset (unlike GPTQ/AWQ), free choice of bits, runs on any GPU. Downsides: Python forward() path with no fused kernel (slow for low bits - see HQQ-3bit in the table), 1/2/3 bit produce visible artifacts on SDXL, 30+ second load times.

5. GGUF (community)

Format from llama.cpp, adapted by the community for FLUX/SD3 (e.g. city96/FLUX.1-dev-gguf repo).

bnb int4

hqq 4bit

Stability table: not everything quantizes equally well

Here's where it gets actually interesting - because some components handle quantization fine, while others... fall apart completely. Empirical observations on SDXL:

Component	bnb-int8	bnb-int4/nf4	torchao-int8	torchao-fp8	torchao-int4	quanto-int8	hqq-4/8bit
UNet	OK	OK	OK	OK	OK	OK	OK
text_encoder (CLIP-L)	OK	OK	OK	OK	questionable	OK	OK
text_encoder_2 (CLIP-G)	OK	OK	OK	NaN!	NaN!	OK	OK
VAE	avoid	avoid	avoid	avoid	avoid	avoid	avoid

Two observations worth noting.

First - CLIP-G hates fp8 and int4 weight-only. Why? CLIP-G has 1.4B parameters with significant dynamics outside the [-448, 448] range (the fp8 E4M3 range). Quantize the weights → lose the outliers → embeddings come out garbage → UNet computes noise → VAE outputs NaN → black image.

A black image is the classic symptom of NaN in latents. If you see it, first thought should be: "what did I break in the text encoder". That's where it is 90% of the time.

Second - VAE stays in fp16, always. It's a small component (~80 MB), but the operations (convolutions, upsampling) are very sensitive to noise. There's no point touching it - the VRAM savings are negligible and the quality degradation can be visible.

hqq 2bit

quantro int2

hqq 1bit

Profiler output

sdf-xl-1.0:torchao-fp4   TextToImageGenerator:create_model  not finished  N/A           N/A            N/A           N/A                               1
sdf-xl-1.0:quanto-int8   TextToImageGenerator:create_model  9.67669 sec   9.67669 sec   9.67669 sec    9.67669 sec   4458246144                        1
sdf-xl-1.0:quanto-int8   TextToImageGenerator:generate      10.05025 sec  10.05025 sec  10.05025 sec   10.05025 sec  -12623872                         1
sdf-xl-1.0:quanto-int8   TextToImageGenerator:save_results  0.02003 sec   0.02003 sec   0.02003 sec    0.02003 sec   0                                 1
sdf-xl-1.0:torchao-int8  TextToImageGenerator:create_model  11.80200 sec  11.80200 sec  11.80200 sec   11.80200 sec  3877761024                        1
sdf-xl-1.0:torchao-int8  TextToImageGenerator:generate      13.60181 sec  13.60181 sec  13.60181 sec   13.60181 sec  10059776                          1
sdf-xl-1.0:torchao-int8  TextToImageGenerator:save_results  0.01174 sec   0.01174 sec   0.01174 sec    0.01174 sec   0                                 1
sdf-xl-1.0:hqq-8bit      TextToImageGenerator:create_model  29.19757 sec  29.19757 sec  29.19757 sec   29.19757 sec  4755091456                        1
sdf-xl-1.0:hqq-8bit      TextToImageGenerator:generate      16.60006 sec  16.60006 sec  16.60006 sec   16.60006 sec  4853760                           1
sdf-xl-1.0:hqq-8bit      TextToImageGenerator:save_results  0.01172 sec   0.01172 sec   0.01172 sec    0.01172 sec   0                                 1
sdf-xl-1.0:torchao-int4  TextToImageGenerator:create_model  not finished  N/A           N/A            N/A           N/A                               2
sdf-xl-1.0:hqq-4bit      TextToImageGenerator:create_model  35.28185 sec  35.28185 sec  35.28185 sec   35.28185 sec  1573851136                        1
sdf-xl-1.0:hqq-4bit      TextToImageGenerator:generate      28.32846 sec  28.32846 sec  28.32846 sec   28.32846 sec  22056960                          1
sdf-xl-1.0:hqq-4bit      TextToImageGenerator:save_results  0.01318 sec   0.01318 sec   0.01318 sec    0.01318 sec   0                                 1
sdf-xl-1.0:bnb-int4      TextToImageGenerator:create_model  17.52835 sec  17.52835 sec  17.52835 sec   17.52835 sec  1704247296                        1
sdf-xl-1.0:bnb-int4      TextToImageGenerator:generate      13.44075 sec  13.44075 sec  13.44075 sec   13.44075 sec  -14581760                         1
sdf-xl-1.0:bnb-int4      TextToImageGenerator:save_results  0.01033 sec   0.01033 sec   0.01033 sec    0.01033 sec   0                                 1
sdf-xl-1.0:quanto-int2   TextToImageGenerator:create_model  14.87508 sec  14.87508 sec  14.87508 sec   14.87508 sec  870027264                         1
sdf-xl-1.0:quanto-int2   TextToImageGenerator:generate      24.74862 sec  24.74862 sec  24.74862 sec   24.74862 sec  13602816                          1
sdf-xl-1.0:quanto-int2   TextToImageGenerator:save_results  0.01268 sec   0.01268 sec   0.01268 sec    0.01268 sec   0                                 1
sdf-xl-1.0:hqq-2bit      TextToImageGenerator:create_model  29.01123 sec  29.01123 sec  29.01123 sec   29.01123 sec  1251397632                        1
sdf-xl-1.0:hqq-2bit      TextToImageGenerator:generate      15.55375 sec  15.55375 sec  15.55375 sec   15.55375 sec  21401600                          1
sdf-xl-1.0:hqq-2bit      TextToImageGenerator:save_results  0.01248 sec   0.01248 sec   0.01248 sec    0.01248 sec   0                                 1
sdf-xl-1.0:hqq-1bit      TextToImageGenerator:create_model  31.83030 sec  31.83030 sec  31.83030 sec   31.83030 sec  558329856                         1
sdf-xl-1.0:hqq-1bit      TextToImageGenerator:generate      34.51453 sec  34.51453 sec  34.51453 sec   34.51453 sec  -3112960                          1
sdf-xl-1.0:hqq-1bit      TextToImageGenerator:save_results  0.01281 sec   0.01281 sec   0.01281 sec    0.01281 sec   0                                 1

Overall time for each index:

Index                    Total Time     Min          Max            Average         Call Count
-----------------------  -------------  -----------  -------------  ------------  ------------
sdf-xl-1.0:quanto-int8   19.74698 sec   0.02003 sec  10.05025 sec   6.58233 sec              3
sdf-xl-1.0:torchao-int8  25.41556 sec   0.01174 sec  13.60181 sec   8.47185 sec              3
sdf-xl-1.0:hqq-8bit      45.80934 sec   0.01172 sec  29.19757 sec   15.26978 sec             3
sdf-xl-1.0:hqq-4bit      63.62349 sec   0.01318 sec  35.28185 sec   21.20783 sec             3
sdf-xl-1.0:bnb-int4      30.97943 sec   0.01033 sec  17.52835 sec   10.32648 sec             3
sdf-xl-1.0:quanto-int2   39.63639 sec   0.01268 sec  24.74862 sec   13.21213 sec             3
sdf-xl-1.0:hqq-2bit      44.57746 sec   0.01248 sec  29.01123 sec   14.85915 sec             3
sdf-xl-1.0:hqq-1bit      66.35763 sec   0.01281 sec  34.51453 sec   22.11921 sec             3

Every LLM writes the same Todo App. And I mean that literally

Jarek — Sun, 05 Jul 2026 17:53:35 +0000

I'm building a benchmark of local models.

Simple task: build a Todo App in a single HTML file - add, delete, mark as done, filters, localStorage. Trivial. Each model gets 6 attempts. 96 files total from 16 models - from Claude Sonnet 4.6 and Haiku 4.5, through the whole Qwen family (9B to 122B), Gemma 4, GPT-OSS 20B and 120B, GPT-4.1-mini, GPT-5.4-mini, Codex-mini, all the way to Z.AI GLM-5-turbo.

I open file after file to judge whether it works, how the code looks, and run it in the browser - just like a caveman, instead of automating it, because I actually want to see what the AI produced.

Somewhere around the Nth one I get this weird déjà vu. After the next one it clicks - it's as if one person wrote all of them. The same function names (addTodo, saveTodos, deleteTodo), the same CSS classes (.filter-btn, .container, .todo-text), the same localStorage key: 'todos'. Models from three different companies, trained on three different datasets - structurally indistinguishable.

I decided to check it - but this time using an agent.

Fingerprint instead of diff

I'm not comparing the code literally - whitespace, quotes, function order are noise. I want to compare structure. For each file I pull out a set of "tokens":

id="..." attributes from the HTML
CSS class names
JS function names
CSS variables

Each file becomes a bag of ~30 labels. For every pair I compute the Jaccard coefficient - |A ∩ B| / |A ∪ B|. 1.0 = identical fingerprint, 0.0 = not a single shared identifier.

Then single-link clustering with a 0.45 threshold: two files land in the same cluster if there's a chain of pairs with similarity ≥ 0.45. A naive algorithm, but on 96 files it runs in a fraction of a second. The script is ~150 lines of Node.js with no dependencies beyond Playwright for generating the PNGs.

The result: one giant monocluster

A 96×96 heatmap. The top-left corner is one huge dark block taking up 62.5% of the whole matrix - 60 files, four companies:

Anthropic Haiku 4.5 - all 6 attempts
Qwen 3.5 (9B, 27B, 35B, 122B) and Qwen 3.6 (27B, 35B)
Gemma 4 (26B and 31B)
Z.AI GLM-5-turbo - all 6 attempts

Four companies, five model families, spanning 9-122 billion parameters. Structurally the same code.

The most surprising numbers:

Jaccard 1.000  qwen3.5_122b_03  ↔  qwen3.5_122b_04
Jaccard 1.000  qwen3.6_35b_01   ↔  qwen3.6_35b_03

Same model, two different runs - an identical set of identifiers. Zero variance. The model has a "favorite piece of code" that it returns almost deterministically.

The most common signatures across the whole corpus: the .active class in 92/96 files, addTodo in 80/96, deleteTodo and .filters in 75/96 each, saveTodos in 74/96. These aren't coincidences.

It's the imprint of one specific canon - TodoMVC plus its hundreds of clones on GitHub, which all these models very likely swallowed during training.

There are other schools, though

Beyond the mainstream you can see two distinct bubbles.

The "OpenAI minimalist" cluster - GPT-OSS 120B and 20B, GPT-4.1-mini, part of Codex-mini. Characteristics: ~200 LOC instead of ~330, fewer CSS classes, simpler structure. Same problem, terser and without the decoration. Interestingly, GPT-OSS 120B and 20B sit together despite a 6× difference in parameters - clearly the same "style school" inside the company.

GPT-5.4-mini stands out differently: it consistently uses CSS variables with a color palette - --primary-color, --bg-color, --danger-color. The other models hard-code their colors. This one ran into newer frontend patterns during training.

Sonnet 4.6 - an interesting case. 6 attempts split into three small subgroups and one singleton. Intra-model Jaccard = 0.441 - the lowest in the whole "class A". Haiku 4.5 sits at 0.815. So the same Haiku gives you the same thing every time, while Sonnet gives you something different every time.

Codex-mini - the lowest score in the set, mean Jaccard = 0.277. Three attempts in the OpenAI cluster, three singletons. A model that doesn't even have a "favorite solution" for a trivial task.

Will Snake be any different?

I said above that todo is a recall task. Snake - with modes (human-vs-human, human-vs-cpu, cpu-vs-cpu), a score, a start menu and a game-over screen - has enough decision dimensions to force the models to design rather than remember. I ran the same script on another 96 files.

The heatmap looks completely different. No big dark blocks. The network view: a few small groups in one corner, surrounded by a ring of 73 singletons.

Metric	Todo App	Snake
Largest cluster	60 files (62.5%)	6 files (6.3%)
Singletons	4 (4%)	73 (76%)
Identical pairs (Jaccard 1.0)	2	0
Cross-company clusters	2	0

All seven Snake clusters are intra-family - not one of them links models from different companies. The exact opposite of todo.

In todo the dominant signatures were things like addTodo, .active, .filter-btn - names of specific design decisions that everyone could have made differently, but everyone made the same. In Snake the dominant ones are startGame, id="gameCanvas", endGame, draw, spawnFood - that's the domain vocabulary, the minimum you can't get around. Everything else - game modes, menu structure, how state is held - is different in every file.

Sonnet 4.6 in Snake: 6 attempts, 6 singletons, max Jaccard = 0.324. The highest intra-model diversity in the whole set.

What follows from this

The Todo App isn't a benchmark of programming skill. It's a recall test. The code exists in the training data in hundreds of copies, and the model reproduces it. Swap the task for something with real complexity and the "model consensus" shatters into 78% individual approaches.

Four companies in one bucket isn't a coincidence - Anthropic, Alibaba, Google and Z.AI very likely trained on overlapping GitHub scrapes (TodoMVC + clones) and/or on synthetic data from a shared stronger teacher. It's not a conspiracy, it's the consequence of "good code" on the web being a finite set.

Low variance between attempts isn't a feature - it's a defect. Codex-mini with a mean Jaccard of 0.277 isn't the "worse" model. It's the model that gives you different approaches on the first answer. In real work, where the first answer is rarely the final one, that has value. On leaderboards with one attempt per task it counts as "instability".

And practically: if you're building CRUDs and wondering whether to grab Haiku, Qwen 9B or Z.AI GLM - it probably makes no difference to code quality. What matters is price, latency and context size. Not "intelligence".

I started this benchmark to check how the agent I'm building works with different models, and which is the "best local model for coding".

I ended up with the conclusion that on basic tasks all local models are the same model in different wrappers. The real difference starts where TodoMVC ends in the training data - at real projects with specific context and state the model has never seen before.