TL;DR. I built an Anthropic Agent Skill for
@ngrx/signalsand ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing). The description optimizer ran for three iterations and never beat the description I started with. And my evals are now saturated at 100%, which is itself a problem. All four findings are useful, and most teams ship skills without measuring any of them.
The two questions almost no one asks
You write a 200-line SKILL.md, drop it in ~/.claude/skills/, restart Claude Code, and move on. Skill shipped.
Two questions you didn't answer:
A) Does this skill actually outperform the base model on the same task, or are you shipping a glorified prompt that the model could already do without help?
B) Does the description match how real users phrase requests, so the model auto-loads the skill at the right time, and stays out of the way when it shouldn't?
Both questions are invisible failures. The skill loads, the output looks fine, you assume it's working. You can do this for months.
Anthropic ran their own description-optimization loop across their six built-in document-creation skills and "saw improved triggering on 5 out of 6." If the team that designed the system has gaps in its own skills, your custom skills almost certainly do too.
Stronger evidence still: in January 2026, Vercel published AGENTS.md outperforms Skills in our agent evals. They measured Skills against an inline AGENTS.md doc index on a Next.js 16 task suite:
| Configuration | Pass rate |
|---|---|
| Baseline (no docs) | 53% |
| Skills (default) | 53% |
| Skills (with explicit instructions) | 79% |
AGENTS.md docs index |
100% |
Their finding: Skills remained uninvoked in 56% of cases despite the agent having access. Same under-triggering problem I'm about to show in my own data.
Vercel concluded "skip Skills, inline the docs." I disagree for libraries the model already knows. Skills are still the right shape for encoded preference, your team's specific idioms, where progressive disclosure beats stuffing thousands of tokens of style guide into every conversation. But the measurement is the same: agents do not reliably consult skills on conversational prompts. Both teams converge on one thing. Without an eval suite you cannot tell whether your skill is doing what you think it's doing.
In March 2026 Anthropic shipped an updated skill-creator that brings test/measure/refine discipline to skills. I used it on a real @ngrx/signals skill and wrote down everything that happened.
An eval suite for a skill: three parts, written before the skill
A skill eval has three parts:
- Test prompts. Realistic tasks a user would actually give.
- Assertions. Verifiable statements about good output, like "HTTP is in a separate service, not inlined in the store."
- Two configurations. The same prompt run with the skill loaded and without it. The delta is the skill's value.
For my @ngrx/signals skill I wrote five tasks:
| ID | Task |
|---|---|
| 1 | Build a CartStore with optimistic updates, request status, and rollback |
| 2 | Refactor a legacy BehaviorSubject-based service to Signal Store + API service split |
| 3 | Write a typed reusable withSelectedEntity<T>() custom feature |
| 4 | Build a typeahead BookSearchStore with rxMethod, debounce, and cancellation |
| 5 | Build a TodosStore using @ngrx/signals/entities updaters |
41 assertions total. Each was written before the skill was finalised. Assertions are the spec, not a confirmation. (Caveat: my v2 skill ends up passing all 41. That's a problem in itself, and I come back to it at the bottom under "100% is a saturation signal".)
Methodology disclosure. I ran each (eval × configuration) pair once, not three times. The
± stddevnumbers below are dispersion across the 5 evals, not across repeated runs of the same eval. A more rigorous setup would be 3 runs per cell to absorb temperature noise. Treat the deltas as directional signal, not as p-values.
Failure mode A: the skill exists but adds no value
This is the embarrassing one. Your skill loads, costs tokens, and does nothing the base model couldn't do on its own.
The benchmark viewer (generated by skill-creator's eval-viewer/generate_review.py) shows the headline numbers and the per-eval breakdown side by side:
Per-eval, base model vs. the same model with the skill loaded:
| Eval | with_skill | without_skill | Δ |
|---|---|---|---|
| 1. cart-store-optimistic | 10/10 = 100% | 7/10 = 70% | +30pp |
| 2. refactor-behaviorsubject | 8/8 = 100% | 7/8 = 87.5% | +12.5pp |
| 3. custom-feature-with-selected-entity | 8/8 = 100% | 7/8 = 87.5% | +12.5pp |
| 4. rxmethod-debounced-search | 7/7 = 100% | 7/7 = 100% | ±0 |
| 5. todos-entities-crud | 8/8 = 100% | 6/8 = 75% | +25pp |
| Aggregate | 41/41 = 100% | 34/41 = 84% | +16pp |
The skill provides a real, measurable +16 percentage-point uplift across 41 assertions. Look at eval 4 though: zero uplift. The base model already aces the debounceTime → distinctUntilChanged → switchMap → tapResponse recipe. Keeping the rxMethod section of the skill in context costs tokens on every invocation while contributing nothing.
This is what blind A/B benchmarking is for. If a skill section scores no better than baseline, that section is paying tokens for nothing. Trim it. Without the benchmark I would have happily kept that section in there forever.
Where the baseline failed and the skill caught it
The interesting question isn't "did the skill help," it's "where exactly did it help?" The viewer renders this side-by-side per assertion:
The green-and-red wall makes it obvious where the skill is doing real work. The recurring pattern across the 7 baseline failures:
-
Explicit state typing. Base model writes
withState(initialState)(inferred). Skill teacheswithState<FooState>(initialState). Catches patches with the wrong shape at compile time. -
getState-based snapshot for rollback. Base model snapshots viastore.items()directly. Skill teaches thegetState()helper for full-state snapshots, the right tool when rollback needs to restore multiple slices. -
Standalone updaters vs. store-bound methods. Base model exposes
setPending/setFulfilledas methods on the store. Skill teaches them as standalone functions returningPartial<State>, composable in a singlepatchState(...)call. -
updateEntityfunction-form changes. Base model reads the entity, computes the toggle, passes a partial. Skill teacheschanges: (entity) => ({ ... })for atomic state-dependent updates. -
removeEntitiespredicate form. Base model filters and maps to ids and passes an id array. Skill teaches the predicate form. -
Typed
signalStoreFeatureprerequisite overload. Base model uses baresignalStoreFeature(...)with manual casts. Skill teachessignalStoreFeature({ state: type<EntityState<T>>() }, ...)so misuse is a compile error, not a runtime surprise.
All six are encoded-preference failures. The model knows the API, but doesn't know which form is idiomatic. That distinction sets the skill's lifetime: a capability uplift skill (teaching the model something it didn't know) gets retired the moment the next model release closes the gap. An encoded preference skill stays useful as long as your team's idioms stay the same. For popular libraries like @ngrx/signals, you should expect to land in the second bucket, and the eval is what tells you you're there.
Failure mode B: the skill exists but doesn't trigger
Worse than failure mode A, because it's invisible. The skill never runs, you blame the model, you never discover the description was the bottleneck.
skill-creator ships a script called run_loop.py that handles description optimization with proper ML discipline:
- Generate ~20 synthetic prompts: half should-trigger, half should-NOT-trigger.
- The negatives deliberately share keywords with the skill, adjacent intent rather than random noise. This is what catches false positives.
- 60/40 train/test split. 3 runs per query (single runs are too noisy to be a signal).
- Up to N iterations of: evaluate, propose new description, re-evaluate.
- Pick the description with the best test-set score, not train score. Standard ML hygiene applied to a YAML field.
I drafted 20 queries: 10 substantive Signal Store scenarios and 10 near-misses (classic @ngrx/store, plain RxJS, plain signal() for component-local state, @ngrx/effects setup, the Subject vs. BehaviorSubject question, and so on). Then I let the loop run for 3 iterations.
The HTML report is the most striking visual in the whole experiment:
The per-iteration scores:
| Iteration | Description style | Train (n=13) | Test (n=7, held-out) |
|---|---|---|---|
| 0 (original) | Keyword-anchored + explicit exclusions | 6/13 = 46.2% | 3/7 = 42.9% |
| 1 | "Use for any Angular state management question…" (intent-driven, longer) | 6/13 = 46.2% | 3/7 = 42.9% |
| 2 | "Invoke this skill when…" (imperative, codebase-context cue) | 6/13 = 46.2% | 3/7 = 42.9% |
The optimizer ran exactly as designed. It proposed two alternatives, evaluated each over 60 model calls per iteration, and selected the description with the best test score. It picked the original. No proposed alternative beat the baseline.
This is not a failure of the loop. It's the right answer when no improvement exists.
The recall ceiling no description rewrite can break
Two reasons the optimizer found nothing better, both diagnostic:
The model under-triggers conversational queries no matter what the description says. Several should-trigger queries used plain English with no
signalStore/withStateanchor, like "what's the right pattern for an optimistic update with rollback?" or "I have an order detail page where the OrderEditStore should live for the duration of /orders/:id/edit". The model decides it can answer without consulting a skill. No description rewrite fixes that.Precision was already 100%. All three descriptions correctly avoided triggering on the 10 should-not-trigger queries. The only failure mode left was recall, and the model's under-triggering bias is hostile to recall improvements via prose alone.
This is the kind of insight pure capability evals would never surface. Capability evals only describe the world where the skill loads. Trigger evals add the rest of the picture: even a perfect skill leaks value if the agent only consults it half the time it ought to. That leak is invisible from the capability side.
Independent confirmation. This isn't a quirk of my eval set. Vercel's January 2026 benchmark hit the same wall from a different angle: their Skills configuration was uninvoked in 56% of cases even when the agent had access. Different conclusion (drop Skills entirely), same root observation: agents do not reliably consult skills on conversational prompts.
What it costs: +14 seconds and ~$0.04 per invocation
Skills aren't free. Every invocation reads the SKILL.md body into context. Reference files are loaded on demand, but they still cost tokens when the model decides to read them.
| Metric | with_skill (v2) | baseline | Δ vs baseline |
|---|---|---|---|
| Pass rate | 100% ± 0% | 84% ± 12% | +16 pp |
| Time per run | 62.7s ± 16.1s | 49.0s ± 4.8s | +13.7s |
| Tokens per run | 32,343 ± 2,737 | 19,927 ± 394 | +12,416 |
That's a 28% wall-time increase and a 62% token increase per invocation. At Sonnet 4.6 pricing ($3 / M input, $15 / M output) the per-call delta works out to about +$0.04 at the headline rate, since the 12k delta is almost entirely additional input from reading the SKILL.md and references. With prompt caching on, repeat invocations of the same skill inside the 5-minute cache window read cached input at 10% of base ($0.30 / M), so each warm call drops to roughly +$0.004. Cold starts still pay the full +$0.04, and any edit to the skill body or the container's skills list breaks the cache. For a team running 500 skill-loading prompts a day, the realistic range is somewhere between ~$2/day if most calls hit a warm cache and ~$20/day if every call is cold. Either way it's the +14 second wall-time hit that's more painful than the dollar cost for an interactive coding assistant. That's the difference between "snappy" and "noticeably waiting."
Two things make the trade-off worth it:
-
The +16pp pass rate is concentrated on patterns that are hard to debug after the fact. A wrong-form
updateEntitywon't crash, it'll just race in production. A missedgetStatesnapshot won't crash either, it'll just leak optimistic-update state. Catching these at generation time is genuinely valuable. - You can iterate the skill to be cheaper. My initial v1 cost +20.6s and +13.3k tokens. After one iteration (adding a canonical scaffold and six "wrong → right" snippets inline) v2 dropped to +13.7s and +12.4k tokens, a 33% time reduction and 6% token reduction while keeping the same 100% pass rate. The trick: inline corrections meant the model didn't need to read as many reference files per task.
Skill body length is a poor proxy for skill cost. What matters is how much additional content the model has to load on top of the body. v2 was 138 lines longer than v1 and still cost less in practice.
Wire benchmarks into PR CI before the model changes under you
Minimum viable eval setup for any custom skill:
your-skill/
├── SKILL.md
├── references/
└── evals/
├── evals.json # 4-6 substantive tasks with assertions
└── trigger-eval.json # 20 should-/should-not queries
Then a CI job that re-runs both benchmarks after every model bump or skill PR. Minimal GitHub Actions shape:
name: skill-evals
on:
push:
paths: [ 'skills/your-skill/**' ]
workflow_dispatch:
schedule:
- cron: '0 6 * * 1' # weekly, catches model-side drift
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Run capability benchmark
run: python -m scripts.aggregate_benchmark workspace/iteration-latest --skill-name your-skill
- name: Fail PR if pass rate drops below threshold
run: |
jq '.run_summary.with_skill.pass_rate.mean' workspace/iteration-latest/benchmark.json \
| awk '{ if ($1 < 0.95) { print "Pass rate below 95%"; exit 1 } }'
- name: Run description trigger optimizer
run: python -m scripts.run_loop --eval-set evals/trigger-eval.json --skill-path skills/your-skill --model claude-sonnet-4-6 --max-iterations 1
The whole loop takes about 15 minutes of wall time (with executor and grader subagents running in parallel). The cost of not running it is the slow accretion of capability-uplift skills that became no-ops two model versions ago, and triggering descriptions that quietly stopped firing.
A footnote on my own evals: 100% is a saturation signal
My with-skill pass rate is 100% across both iterations. That sounds good. It isn't.
In Demystifying evals for AI agents, Anthropic put it bluntly:
"An eval at 100% tracks regressions but provides no signal for improvement."
Mine track regressions just fine. If claude-opus-5 ships and pass rate drops, I'll know within 15 minutes. But for further skill iteration the suite is saturated. I can't tell whether v3 of the skill is better than v2, because both will hit 100%.
Concrete next-round eval candidates that would not hit 100%:
- A multi-store coordination task where a
CheckoutStorereads fromCartStoreandUserStorewithout violating the no-cross-store-injection rule. - An entity store with a custom
idKey: 'uuid'plus a named collection. The kind of thing that exposes whether the model knows the multi-collection updater syntax. - A migration from
@ngrx/store(classic NgRx with reducers/effects) to Signal Store, where the agent has to translate actions into store methods correctly. - A route-scoped store with
withHooks({ onDestroy })cleanup that triggers a pending HTTP cancel.
Same skill, same idioms, harder ground truth. Without explicit pass-rate visibility I would happily keep iterating in the dark, congratulating myself on each "100%" run.
Where to go from here
For a popular library like @ngrx/signals, the base model knows the API. The skill's job is encoded preference: making the model produce your team's idiomatic forms, not just any working code. Without evals you can't tell whether the skill is actually doing that.
Four findings, all worth carrying forward:
- Capability uplift is real (+16pp pass rate), but concentrated in 6 specific idiom corrections, not in API knowledge. As long as your idioms are the value, the skill survives model upgrades.
- Skill cost is real (+14s wall time, +12k tokens, ~$0.04 per call), and iteration can reduce it (v1 → v2 cut wall time by 33%).
- Triggering has a ceiling (~46% recall) that prose-only optimization can't break, independently confirmed by Vercel's 56% uninvoked rate.
- 100% pass rate is a saturation signal. Time to add harder evals, not to declare victory.
Concrete next steps if you want to do this on your own skills:
- The skill itself, with eval suite and per-skill README, is at
github.com/danielsogl/skills. Install withnpx skills add danielsogl/skills@ngrx-signals. - The
skill-creatorplugin and itsrun_loop.pyscript are atanthropics/skills. You can copy the eval scaffolding from there and adapt the assertions to whatever skill you maintain. - After every Claude release, re-run the capability benchmark first (~5 minutes), then the description optimizer (~10 minutes). Track the deltas in a CHANGELOG next to your
SKILL.md. That's how you catch the day a capability-uplift skill becomes redundant.
References
Anthropic, Agent Skills
- Improving skill-creator: Test, measure, and refine Agent Skills (Mar 3, 2026): central announcement; "5 of 6 document-creation skills" comes from here.
- Equipping agents for the real world with Agent Skills by Barry Zhang, Keith Lazuka, Mahesh Murag: engineering deep-dive on progressive disclosure.
- Introducing Agent Skills: original Oct 2025 announcement.
- Agent Skills documentation: current API and filesystem reference.
- Seeing like an agent: how we design tools in Claude Code: broader tool-design philosophy.
-
anthropics/skillson GitHub: official skills repo, includingskill-creator/run_loop.py.
Anthropic, eval methodology
- Demystifying evals for AI agents (Jan 9, 2026) by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, Jiri De Jonghe: source of the "an eval at 100% tracks regressions but provides no signal for improvement" quote and the test-both-sides principle.
- Effective context engineering for AI agents: context-window discipline that progressive disclosure is built on.
Vercel, counterpoint and eval framework
- AGENTS.md outperforms Skills in our agent evals (Jan 27, 2026) by Jude Gao: source of the 100% / 79% / 53% numbers and the "uninvoked in 56% of cases" finding.
- An Introduction to Evals: runner / scorer / dataset pattern; "20 well-chosen test cases > 200 random examples."
- The no-nonsense approach to AI agent development: tighten-the-loop methodology before adding eval scaffolding.
- AI SDK 6 release notes: Agent abstraction, ToolLoopAgent, MCP support.
Open standard
- agentskills.io: Optimizing skill descriptions: train/validation split walkthrough.
Library under test
- NgRx Signal Store official guide.
- Custom Store Features.
- Manfred Steyer: NGRX Signal Store and Your Architecture (AngularArchitects): store/service split rules used in the skill.
Framing
- Rick Hightower (Towards AI): Build, Evaluate, and Tune AI Agent Skills: Capability Uplift vs. Encoded Preference framing this article borrows.



Top comments (0)