Jonathan D Borgia

Posted on May 30

A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries

#ai #angular #typescript #webdev

Disclosure up front: I maintain one of the five libraries tested (SignalTree), and it's the one that scored worst in the cold run — so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.

Setup

Libraries: NgRx (classic), NgRx SignalStore, Akita, Elf, SignalTree.
Agents: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Perplexity Sonar Pro, Claude Haiku 4.5, GPT-5.4-mini.
8 prompts: counter, paginated users, debounced search, derived totals, login form, undo/redo, deep nested state, multi-marker editor.
5 libs × 6 agents × 3 priming modes = 720 cells. Temperature 0. Identical prompt text per library (only the library name swapped).
Scored on three orthogonal checks: idiomatic-pattern match, import resolution (does every import resolve to a real package), and method validity (do the called methods actually exist on the API).

What this measures: one-shot generation. The agent gets the prompt, returns a file, we score it. Real interactive use — Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try — is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.

Finding 1: cold accuracy basically tracks how much the library is in the training data

No context provided, just "write this in library X":

Library	Cold score
Akita	94%
Elf	94%
NgRx (classic)	91%
NgRx SignalStore	86%
SignalTree	49%

The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal — it's a corpus signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.

Finding 2: a single retrievable context file closes most of that gap

I shipped a ~13.5 KB llms.txt (a plain-text API summary) inside the npm package and re-ran with it in context:

Mode	SignalTree score
Cold	49%
+ `llms.txt` (13.5 KB)	91%
+ `llms.txt` + extra notes (~25 KB)	87%

+42 percentage points from one small file — enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:

More context made it worse. Adding a second doc regressed accuracy — the extra source seems to dilute the signal rather than reinforce it, with Gemini in particular over-indexing on the noise. Past some point you're hurting, not helping.
It bled across libraries. Loading one library's context dragged the others down — models that had SignalTree's API in context started cross-pollinating it into their Akita/Elf answers:

Library	Cold	With SignalTree's context loaded
SignalTree	49	91
NgRx (classic)	91	88
NgRx SignalStore	86	80
Akita	94	85
Elf	94	87

Practical takeaway: more context is not better. Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model — primed mid-tier models beat cold top-tier ones in my runs — but dumping your whole docs site in backfires.

Finding 3 (the one I found most useful): the benchmark exposed an API-consistency bug

The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency — I'd named predicate accessors two different ways across the API:

// some markers used an is- prefix
saveStatus.isLoading()
users.isEmpty()

// others used bare names
profile.dirty()
feed.loading()

An agent that learned isLoading() would confidently try isDirty(), which never existed. That's not an AI failure — it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching FormControl.dirty/.valid), kept the old names as deprecated aliases, shipped it.

The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: an API surface a model can't keep straight is usually one a human can't either. Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.

Where I'd attack this

I'd rather list the holes than have them found, so here are the three I'd lead with:

Conflict of interest, four ways. I built one of the five libraries, wrote its context file, picked the eight prompts, and wrote the scoring rubric. That's four levers I could have pulled in my own favor, so don't trust the absolute percentages. The one number I couldn't rig in my favor is that my library scored worst in the cold run — and the thing I'd actually defend is the relative movement (cold→primed deltas, cross-library bleed), not the exact values.
Sample size / "temperature 0 ≠ deterministic." This is one pass per cell, and temp 0 on hosted APIs is not truly deterministic — identical prompts drift run to run. So treat any single cell as having a few points of noise around it; this is indicative, not publication-grade. The effects I'm pointing at (a ~40pp jump, established libs clustering in the 90s cold) are large enough to survive that noise, but a 3-point difference between two cells means nothing.
The rubric structurally favors smaller APIs. One of the three checks is "does the called method exist." A library with fewer total methods is simply easier to get right — and the tree-shaped library I built has a smaller surface than, say, classic NgRx. So some of the primed lift is plausibly "less to get wrong," not "better designed." I think that's a real confound, not a fully-controlled variable.

And one that's less a flaw than a "yeah, obviously": cold score ≈ training-data volume is barely a finding — it's close to a truism once you say it out loud. The only mildly non-obvious part is how cheaply a retrievable file substitutes for years of corpus presence.

Reproduce it yourself

One OpenRouter key, ~$15, ~30 minutes:

git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs

Prompts (YAML), scoring rubric, adapters, and per-cell results all live in scripts/ai-codegen-benchmark/. The prompts and rubric are the parts most worth disagreeing with — if you spot one that's unfair to a particular library, that's the most useful feedback I can get.

I'd love to hear

For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, what's actually fixed it for you — a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.

Top comments (1)

Jonathan D Borgia • May 30

Author here. I maintain one of the five libraries and it scored worst cold, so this isn't a pitch. The part I found interesting is that cold accuracy tracked training-data volume more than design, and a small retrievable file mostly erased the gap. Happy to defend the methodology; the harness is one command.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community