Disclosure up front: I maintain one of the five libraries tested (SignalTree), and it's the one that scored worst in the cold run — so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.
Setup
- Libraries: NgRx (classic), NgRx SignalStore, Akita, Elf, SignalTree.
- Agents: Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Perplexity Sonar Pro, Claude Haiku 4.5, GPT-5.4-mini.
- 8 prompts: counter, paginated users, debounced search, derived totals, login form, undo/redo, deep nested state, multi-marker editor.
- 5 libs × 6 agents × 3 priming modes = 720 cells. Temperature 0. Identical prompt text per library (only the library name swapped).
- Scored on three orthogonal checks: idiomatic-pattern match, import resolution (does every import resolve to a real package), and method validity (do the called methods actually exist on the API).
What this measures: one-shot generation. The agent gets the prompt, returns a file, we score it. Real interactive use — Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try — is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.
Finding 1: cold accuracy basically tracks how much the library is in the training data
No context provided, just "write this in library X":
| Library | Cold score |
|---|---|
| Akita | 94% |
| Elf | 94% |
| NgRx (classic) | 91% |
| NgRx SignalStore | 86% |
| SignalTree | 49% |
The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal — it's a corpus signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.
Finding 2: a single retrievable context file closes most of that gap
I shipped a ~13.5 KB llms.txt (a plain-text API summary) inside the npm package and re-ran with it in context:
| Mode | SignalTree score |
|---|---|
| Cold | 49% |
+ llms.txt (13.5 KB) |
91% |
+ llms.txt + extra notes (~25 KB) |
87% |
+42 percentage points from one small file — enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:
- More context made it worse. Adding a second doc regressed accuracy — the extra source seems to dilute the signal rather than reinforce it, with Gemini in particular over-indexing on the noise. Past some point you're hurting, not helping.
- It bled across libraries. Loading one library's context dragged the others down — models that had SignalTree's API in context started cross-pollinating it into their Akita/Elf answers:
| Library | Cold | With SignalTree's context loaded |
|---|---|---|
| SignalTree | 49 | 91 |
| NgRx (classic) | 91 | 88 |
| NgRx SignalStore | 86 | 80 |
| Akita | 94 | 85 |
| Elf | 94 | 87 |
Practical takeaway: more context is not better. Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model — primed mid-tier models beat cold top-tier ones in my runs — but dumping your whole docs site in backfires.
Finding 3 (the one I found most useful): the benchmark exposed an API-consistency bug
The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency — I'd named predicate accessors two different ways across the API:
// some markers used an is- prefix
saveStatus.isLoading()
users.isEmpty()
// others used bare names
profile.dirty()
feed.loading()
An agent that learned isLoading() would confidently try isDirty(), which never existed. That's not an AI failure — it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching FormControl.dirty/.valid), kept the old names as deprecated aliases, shipped it.
The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: an API surface a model can't keep straight is usually one a human can't either. Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.
Where I'd attack this
I'd rather list the holes than have them found, so here are the three I'd lead with:
- Conflict of interest, four ways. I built one of the five libraries, wrote its context file, picked the eight prompts, and wrote the scoring rubric. That's four levers I could have pulled in my own favor, so don't trust the absolute percentages. The one number I couldn't rig in my favor is that my library scored worst in the cold run — and the thing I'd actually defend is the relative movement (cold→primed deltas, cross-library bleed), not the exact values.
- Sample size / "temperature 0 ≠ deterministic." This is one pass per cell, and temp 0 on hosted APIs is not truly deterministic — identical prompts drift run to run. So treat any single cell as having a few points of noise around it; this is indicative, not publication-grade. The effects I'm pointing at (a ~40pp jump, established libs clustering in the 90s cold) are large enough to survive that noise, but a 3-point difference between two cells means nothing.
- The rubric structurally favors smaller APIs. One of the three checks is "does the called method exist." A library with fewer total methods is simply easier to get right — and the tree-shaped library I built has a smaller surface than, say, classic NgRx. So some of the primed lift is plausibly "less to get wrong," not "better designed." I think that's a real confound, not a fully-controlled variable.
And one that's less a flaw than a "yeah, obviously": cold score ≈ training-data volume is barely a finding — it's close to a truism once you say it out loud. The only mildly non-obvious part is how cheaply a retrievable file substitutes for years of corpus presence.
Reproduce it yourself
One OpenRouter key, ~$15, ~30 minutes:
git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs
Prompts (YAML), scoring rubric, adapters, and per-cell results all live in scripts/ai-codegen-benchmark/. The prompts and rubric are the parts most worth disagreeing with — if you spot one that's unfair to a particular library, that's the most useful feedback I can get.
Links
- Benchmark source + full per-agent results: https://github.com/JBorgia/signaltree/tree/main/scripts/ai-codegen-benchmark
- Repo: https://github.com/JBorgia/signaltree
- Demo + docs: https://jborgia.github.io/signaltree/
I'd love to hear
For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, what's actually fixed it for you — a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.
Top comments (1)
Author here. I maintain one of the five libraries and it scored worst cold, so this isn't a pitch. The part I found interesting is that cold accuracy tracked training-data volume more than design, and a small retrievable file mostly erased the gap. Happy to defend the methodology; the harness is one command.