<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jarek</title>
    <description>The latest articles on DEV Community by Jarek (@shadbb).</description>
    <link>https://dev.to/shadbb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4016320%2Faf9d5c6b-808b-4190-9ea6-f3e836e36bce.png</url>
      <title>DEV Community: Jarek</title>
      <link>https://dev.to/shadbb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shadbb"/>
    <language>en</language>
    <item>
      <title>Every LLM writes the same Todo App. And I mean that literally</title>
      <dc:creator>Jarek</dc:creator>
      <pubDate>Sun, 05 Jul 2026 17:53:35 +0000</pubDate>
      <link>https://dev.to/shadbb/every-llm-writes-the-same-todo-app-and-i-mean-that-literally-721</link>
      <guid>https://dev.to/shadbb/every-llm-writes-the-same-todo-app-and-i-mean-that-literally-721</guid>
      <description>&lt;p&gt;I'm building a &lt;a href="https://benchmark.refio.dev/" rel="noopener noreferrer"&gt;benchmark of local&lt;/a&gt; models.&lt;/p&gt;

&lt;p&gt;Simple task: build a Todo App in a single HTML file - add, delete, mark as done, filters, localStorage. Trivial. Each model gets 6 attempts. 96 files total from 16 models - from Claude Sonnet 4.6 and Haiku 4.5, through the whole Qwen family (9B to 122B), Gemma 4, GPT-OSS 20B and 120B, GPT-4.1-mini, GPT-5.4-mini, Codex-mini, all the way to Z.AI GLM-5-turbo.&lt;/p&gt;

&lt;p&gt;I open file after file to judge whether it works, how the code looks, and run it in the browser - just like a caveman, instead of automating it, because I actually want to see what the AI produced.&lt;/p&gt;

&lt;p&gt;Somewhere around the Nth one I get this weird &lt;em&gt;déjà vu&lt;/em&gt;. After the next one it clicks - it's as if one person wrote all of them. The same function names (&lt;code&gt;addTodo&lt;/code&gt;, &lt;code&gt;saveTodos&lt;/code&gt;, &lt;code&gt;deleteTodo&lt;/code&gt;), the same CSS classes (&lt;code&gt;.filter-btn&lt;/code&gt;, &lt;code&gt;.container&lt;/code&gt;, &lt;code&gt;.todo-text&lt;/code&gt;), the same localStorage key: &lt;code&gt;'todos'&lt;/code&gt;. Models from three different companies, trained on three different datasets - structurally indistinguishable.&lt;/p&gt;

&lt;p&gt;I decided to check it - but this time using an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fingerprint instead of diff
&lt;/h2&gt;

&lt;p&gt;I'm not comparing the code literally - whitespace, quotes, function order are noise. I want to compare structure. For each file I pull out a set of "tokens":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id="..."&lt;/code&gt; attributes from the HTML&lt;/li&gt;
&lt;li&gt;CSS class names&lt;/li&gt;
&lt;li&gt;JS function names&lt;/li&gt;
&lt;li&gt;CSS variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each file becomes a bag of ~30 labels. For every pair I compute the &lt;strong&gt;Jaccard coefficient&lt;/strong&gt; - &lt;code&gt;|A ∩ B| / |A ∪ B|&lt;/code&gt;. 1.0 = identical fingerprint, 0.0 = not a single shared identifier.&lt;/p&gt;

&lt;p&gt;Then single-link clustering with a 0.45 threshold: two files land in the same cluster if there's a chain of pairs with similarity ≥ 0.45. A naive algorithm, but on 96 files it runs in a fraction of a second. The script is ~150 lines of Node.js with no dependencies beyond Playwright for generating the PNGs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result: one giant monocluster
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fg84yaeq2omzmn46z4pa2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fg84yaeq2omzmn46z4pa2.png" alt="todo-similarity-heatmap" width="800" height="954"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A 96×96 heatmap. The top-left corner is one huge dark block taking up &lt;strong&gt;62.5% of the whole matrix&lt;/strong&gt; - 60 files, four companies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic Haiku 4.5 - all 6 attempts&lt;/li&gt;
&lt;li&gt;Qwen 3.5 (9B, 27B, 35B, 122B) and Qwen 3.6 (27B, 35B)&lt;/li&gt;
&lt;li&gt;Gemma 4 (26B and 31B)&lt;/li&gt;
&lt;li&gt;Z.AI GLM-5-turbo - all 6 attempts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Four companies, five model families, spanning 9-122 billion parameters. &lt;strong&gt;Structurally the same code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most surprising numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Jaccard 1.000  qwen3.5_122b_03  ↔  qwen3.5_122b_04
Jaccard 1.000  qwen3.6_35b_01   ↔  qwen3.6_35b_03
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same model, two different runs - &lt;strong&gt;an identical set of identifiers&lt;/strong&gt;. Zero variance. The model has a "favorite piece of code" that it returns almost deterministically.&lt;/p&gt;

&lt;p&gt;The most common signatures across the whole corpus: the &lt;code&gt;.active&lt;/code&gt; class in 92/96 files, &lt;code&gt;addTodo&lt;/code&gt; in 80/96, &lt;code&gt;deleteTodo&lt;/code&gt; and &lt;code&gt;.filters&lt;/code&gt; in 75/96 each, &lt;code&gt;saveTodos&lt;/code&gt; in 74/96. These aren't coincidences.&lt;/p&gt;

&lt;p&gt;It's the imprint of one specific canon - &lt;strong&gt;TodoMVC plus its hundreds of clones on GitHub&lt;/strong&gt;, which all these models very likely swallowed during training.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffyf5cmogkuanlv8p53zj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffyf5cmogkuanlv8p53zj.png" alt="todo-similarity-network" width="800" height="708"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  There are other schools, though
&lt;/h2&gt;

&lt;p&gt;Beyond the mainstream you can see two distinct bubbles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "OpenAI minimalist" cluster&lt;/strong&gt; - GPT-OSS 120B and 20B, GPT-4.1-mini, part of Codex-mini. Characteristics: ~200 LOC instead of ~330, fewer CSS classes, simpler structure. Same problem, terser and without the decoration. Interestingly, GPT-OSS 120B and 20B sit together despite a 6× difference in parameters - clearly the same "style school" inside the company.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.4-mini&lt;/strong&gt; stands out differently: it consistently uses CSS variables with a color palette - &lt;code&gt;--primary-color&lt;/code&gt;, &lt;code&gt;--bg-color&lt;/code&gt;, &lt;code&gt;--danger-color&lt;/code&gt;. The other models hard-code their colors. This one ran into newer frontend patterns during training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sonnet 4.6&lt;/strong&gt; - an interesting case. 6 attempts split into three small subgroups and one singleton. Intra-model Jaccard = 0.441 - the lowest in the whole "class A". Haiku 4.5 sits at 0.815. So the same Haiku gives you the same thing every time, while Sonnet gives you something different every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex-mini&lt;/strong&gt; - the lowest score in the set, mean Jaccard = 0.277. Three attempts in the OpenAI cluster, three singletons. A model that doesn't even have a "favorite solution" for a trivial task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Will Snake be any different?
&lt;/h2&gt;

&lt;p&gt;I said above that todo is a recall task. Snake - with modes (human-vs-human, human-vs-cpu, cpu-vs-cpu), a score, a start menu and a game-over screen - has enough decision dimensions to force the models to &lt;em&gt;design&lt;/em&gt; rather than &lt;em&gt;remember&lt;/em&gt;. I ran the same script on another 96 files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhc1jpaec816u2n0ijnz0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhc1jpaec816u2n0ijnz0.png" alt="snake-similarity-heatmap" width="800" height="927"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The heatmap looks completely different. No big dark blocks. The network view: a few small groups in one corner, surrounded by a ring of &lt;strong&gt;73 singletons&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Todo App&lt;/th&gt;
&lt;th&gt;Snake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Largest cluster&lt;/td&gt;
&lt;td&gt;60 files (62.5%)&lt;/td&gt;
&lt;td&gt;6 files (6.3%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Singletons&lt;/td&gt;
&lt;td&gt;4 (4%)&lt;/td&gt;
&lt;td&gt;73 (76%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Identical pairs (Jaccard 1.0)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-company clusters&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All seven Snake clusters are &lt;strong&gt;intra-family&lt;/strong&gt; - not one of them links models from different companies. The exact opposite of todo.&lt;/p&gt;

&lt;p&gt;In todo the dominant signatures were things like &lt;code&gt;addTodo&lt;/code&gt;, &lt;code&gt;.active&lt;/code&gt;, &lt;code&gt;.filter-btn&lt;/code&gt; - names of &lt;em&gt;specific design decisions&lt;/em&gt; that everyone could have made differently, but everyone made the same. In Snake the dominant ones are &lt;code&gt;startGame&lt;/code&gt;, &lt;code&gt;id="gameCanvas"&lt;/code&gt;, &lt;code&gt;endGame&lt;/code&gt;, &lt;code&gt;draw&lt;/code&gt;, &lt;code&gt;spawnFood&lt;/code&gt; - that's the &lt;strong&gt;domain vocabulary&lt;/strong&gt;, the minimum you can't get around. Everything else - game modes, menu structure, how state is held - is different in every file.&lt;/p&gt;

&lt;p&gt;Sonnet 4.6 in Snake: 6 attempts, 6 singletons, max Jaccard = 0.324. The highest intra-model diversity in the whole set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F88rj4fpzj3em5uie15rk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F88rj4fpzj3em5uie15rk.png" alt="snake-similarity-network" width="800" height="708"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What follows from this
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Todo App isn't a benchmark of programming skill.&lt;/strong&gt; It's a recall test. The code exists in the training data in hundreds of copies, and the model reproduces it. Swap the task for something with real complexity and the "model consensus" shatters into 78% individual approaches.&lt;/p&gt;

&lt;p&gt;Four companies in one bucket isn't a coincidence - Anthropic, Alibaba, Google and Z.AI very likely trained on overlapping GitHub scrapes (TodoMVC + clones) and/or on synthetic data from a shared stronger teacher. It's not a conspiracy, it's the consequence of "good code" on the web being a finite set.&lt;/p&gt;

&lt;p&gt;Low variance between attempts isn't a feature - it's a defect. Codex-mini with a mean Jaccard of 0.277 isn't the "worse" model. It's the model that gives you different approaches on the first answer. In real work, where the first answer is rarely the final one, that has value. On leaderboards with one attempt per task it counts as "instability".&lt;/p&gt;

&lt;p&gt;And practically: if you're building CRUDs and wondering whether to grab Haiku, Qwen 9B or Z.AI GLM - &lt;strong&gt;it probably makes no difference&lt;/strong&gt; to code quality. What matters is price, latency and context size. Not "intelligence".&lt;/p&gt;

&lt;p&gt;I started this benchmark to check how the agent I'm building works with different models, and which is the "best local model for coding".&lt;/p&gt;

&lt;p&gt;I ended up with the conclusion that on basic tasks all local models are the same model in different wrappers. The real difference starts where TodoMVC ends in the training data - at real projects with specific context and state the model has never seen before.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
