<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jonathan D Borgia</title>
    <description>The latest articles on DEV Community by Jonathan D Borgia (@jborgia).</description>
    <link>https://dev.to/jborgia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F261634%2Fad35a796-4392-4b35-bf86-b8db808a2773.jpeg</url>
      <title>DEV Community: Jonathan D Borgia</title>
      <link>https://dev.to/jborgia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jborgia"/>
    <language>en</language>
    <item>
      <title>A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries</title>
      <dc:creator>Jonathan D Borgia</dc:creator>
      <pubDate>Sat, 30 May 2026 00:37:51 +0000</pubDate>
      <link>https://dev.to/jborgia/a-13-kb-text-file-beat-a-smarter-model-benchmarking-ai-codegen-across-5-angular-state-libraries-3p36</link>
      <guid>https://dev.to/jborgia/a-13-kb-text-file-beat-a-smarter-model-benchmarking-ai-codegen-across-5-angular-state-libraries-3p36</guid>
      <description>&lt;p&gt;&lt;strong&gt;Disclosure up front:&lt;/strong&gt; I maintain one of the five libraries tested (SignalTree), and it's the one that scored &lt;em&gt;worst&lt;/em&gt; in the cold run — so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Libraries&lt;/strong&gt;: NgRx (classic), NgRx SignalStore, Akita, Elf, SignalTree.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents&lt;/strong&gt;: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Perplexity Sonar Pro, Claude Haiku 4.5, GPT-5.4-mini.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8 prompts&lt;/strong&gt;: counter, paginated users, debounced search, derived totals, login form, undo/redo, deep nested state, multi-marker editor.&lt;/li&gt;
&lt;li&gt;5 libs × 6 agents × 3 priming modes = &lt;strong&gt;720 cells&lt;/strong&gt;. Temperature 0. Identical prompt text per library (only the library name swapped).&lt;/li&gt;
&lt;li&gt;Scored on three orthogonal checks: idiomatic-pattern match, import resolution (does every import resolve to a real package), and method validity (do the called methods actually exist on the API).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What this measures: one-shot generation.&lt;/strong&gt; The agent gets the prompt, returns a file, we score it. Real interactive use — Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try — is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1: cold accuracy basically tracks how much the library is in the training data
&lt;/h2&gt;

&lt;p&gt;No context provided, just "write this in library X":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Cold score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Akita&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elf&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NgRx (classic)&lt;/td&gt;
&lt;td&gt;91%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NgRx SignalStore&lt;/td&gt;
&lt;td&gt;86%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SignalTree&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal — it's a &lt;em&gt;corpus&lt;/em&gt; signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2: a single retrievable context file closes most of that gap
&lt;/h2&gt;

&lt;p&gt;I shipped a ~13.5 KB &lt;code&gt;llms.txt&lt;/code&gt; (a plain-text API summary) inside the npm package and re-ran with it in context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;SignalTree score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cold&lt;/td&gt;
&lt;td&gt;49%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ &lt;code&gt;llms.txt&lt;/code&gt; (13.5 KB)&lt;/td&gt;
&lt;td&gt;91%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ &lt;code&gt;llms.txt&lt;/code&gt; + extra notes (~25 KB)&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;+42 percentage points from one small file — enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More context made it worse.&lt;/strong&gt; Adding a second doc regressed accuracy — the extra source seems to dilute the signal rather than reinforce it, with Gemini in particular over-indexing on the noise. Past some point you're hurting, not helping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It bled across libraries.&lt;/strong&gt; Loading &lt;em&gt;one&lt;/em&gt; library's context dragged the &lt;em&gt;others&lt;/em&gt; down — models that had SignalTree's API in context started cross-pollinating it into their Akita/Elf answers:&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Cold&lt;/th&gt;
&lt;th&gt;With SignalTree's context loaded&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SignalTree&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NgRx (classic)&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NgRx SignalStore&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Akita&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elf&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Practical takeaway: more context is not better.&lt;/strong&gt; Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model — primed mid-tier models beat cold top-tier ones in my runs — but dumping your whole docs site in backfires.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3 (the one I found most useful): the benchmark exposed an API-consistency bug
&lt;/h2&gt;

&lt;p&gt;The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency — I'd named predicate accessors two different ways across the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// some markers used an is- prefix&lt;/span&gt;
&lt;span class="nx"&gt;saveStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isLoading&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// others used bare names&lt;/span&gt;
&lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;feed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent that learned &lt;code&gt;isLoading()&lt;/code&gt; would confidently try &lt;code&gt;isDirty()&lt;/code&gt;, which never existed. That's not an AI failure — it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching &lt;code&gt;FormControl.dirty&lt;/code&gt;/&lt;code&gt;.valid&lt;/code&gt;), kept the old names as deprecated aliases, shipped it.&lt;/p&gt;

&lt;p&gt;The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: &lt;strong&gt;an API surface a model can't keep straight is usually one a human can't either.&lt;/strong&gt; Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'd attack this
&lt;/h2&gt;

&lt;p&gt;I'd rather list the holes than have them found, so here are the three I'd lead with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conflict of interest, four ways.&lt;/strong&gt; I built one of the five libraries, wrote its context file, picked the eight prompts, &lt;em&gt;and&lt;/em&gt; wrote the scoring rubric. That's four levers I could have pulled in my own favor, so don't trust the absolute percentages. The one number I &lt;em&gt;couldn't&lt;/em&gt; rig in my favor is that my library scored worst in the cold run — and the thing I'd actually defend is the &lt;em&gt;relative&lt;/em&gt; movement (cold→primed deltas, cross-library bleed), not the exact values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample size / "temperature 0 ≠ deterministic."&lt;/strong&gt; This is one pass per cell, and temp 0 on hosted APIs is not truly deterministic — identical prompts drift run to run. So treat any single cell as having a few points of noise around it; this is indicative, not publication-grade. The effects I'm pointing at (a ~40pp jump, established libs clustering in the 90s cold) are large enough to survive that noise, but a 3-point difference between two cells means nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The rubric structurally favors smaller APIs.&lt;/strong&gt; One of the three checks is "does the called method exist." A library with fewer total methods is simply easier to get right — and the tree-shaped library I built has a smaller surface than, say, classic NgRx. So some of the primed lift is plausibly "less to get wrong," not "better designed." I think that's a real confound, not a fully-controlled variable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And one that's less a flaw than a "yeah, obviously": &lt;strong&gt;cold score ≈ training-data volume is barely a finding&lt;/strong&gt; — it's close to a truism once you say it out loud. The only mildly non-obvious part is &lt;em&gt;how cheaply&lt;/em&gt; a retrievable file substitutes for years of corpus presence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce it yourself
&lt;/h2&gt;

&lt;p&gt;One OpenRouter key, ~$15, ~30 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/JBorgia/signaltree
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompts (YAML), scoring rubric, adapters, and per-cell results all live in &lt;code&gt;scripts/ai-codegen-benchmark/&lt;/code&gt;. The prompts and rubric are the parts most worth disagreeing with — if you spot one that's unfair to a particular library, that's the most useful feedback I can get.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark source + full per-agent results: &lt;a href="https://github.com/JBorgia/signaltree/tree/main/scripts/ai-codegen-benchmark" rel="noopener noreferrer"&gt;https://github.com/JBorgia/signaltree/tree/main/scripts/ai-codegen-benchmark&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/JBorgia/signaltree" rel="noopener noreferrer"&gt;https://github.com/JBorgia/signaltree&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Demo + docs: &lt;a href="https://jborgia.github.io/signaltree/" rel="noopener noreferrer"&gt;https://jborgia.github.io/signaltree/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  I'd love to hear
&lt;/h2&gt;

&lt;p&gt;For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, &lt;strong&gt;what's actually fixed it for you&lt;/strong&gt; — a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>angular</category>
      <category>typescript</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
