Daniel Sogl

Posted on May 1

Skills Without Evals Are Just Markdown and Hope

#claude #ai #angular #ngrx

TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing). The description optimizer ran for three iterations and never beat the description I started with. And my evals are now saturated at 100%, which is itself a problem. All four findings are useful, and most teams ship skills without measuring any of them.

The two questions almost no one asks

You write a 200-line SKILL.md, drop it in ~/.claude/skills/, restart Claude Code, and move on. Skill shipped.

Two questions you didn't answer:

A) Does this skill actually outperform the base model on the same task, or are you shipping a glorified prompt that the model could already do without help?

B) Does the description match how real users phrase requests, so the model auto-loads the skill at the right time, and stays out of the way when it shouldn't?

Both questions are invisible failures. The skill loads, the output looks fine, you assume it's working. You can do this for months.

Anthropic ran their own description-optimization loop across their six built-in document-creation skills and "saw improved triggering on 5 out of 6." If the team that designed the system has gaps in its own skills, your custom skills almost certainly do too.

Stronger evidence still: in January 2026, Vercel published AGENTS.md outperforms Skills in our agent evals. They measured Skills against an inline AGENTS.md doc index on a Next.js 16 task suite:

Configuration	Pass rate
Baseline (no docs)	53%
Skills (default)	53%
Skills (with explicit instructions)	79%
`AGENTS.md` docs index	100%

Their finding: Skills remained uninvoked in 56% of cases despite the agent having access. Same under-triggering problem I'm about to show in my own data.

Vercel concluded "skip Skills, inline the docs." I disagree for libraries the model already knows. Skills are still the right shape for encoded preference, your team's specific idioms, where progressive disclosure beats stuffing thousands of tokens of style guide into every conversation. But the measurement is the same: agents do not reliably consult skills on conversational prompts. Both teams converge on one thing. Without an eval suite you cannot tell whether your skill is doing what you think it's doing.

In March 2026 Anthropic shipped an updated skill-creator that brings test/measure/refine discipline to skills. I used it on a real @ngrx/signals skill and wrote down everything that happened.

An eval suite for a skill: three parts, written before the skill

A skill eval has three parts:

Test prompts. Realistic tasks a user would actually give.
Assertions. Verifiable statements about good output, like "HTTP is in a separate service, not inlined in the store."
Two configurations. The same prompt run with the skill loaded and without it. The delta is the skill's value.

For my @ngrx/signals skill I wrote five tasks:

ID	Task
1	Build a `CartStore` with optimistic updates, request status, and rollback
2	Refactor a legacy `BehaviorSubject`-based service to Signal Store + API service split
3	Write a typed reusable `withSelectedEntity<T>()` custom feature
4	Build a typeahead `BookSearchStore` with `rxMethod`, debounce, and cancellation
5	Build a `TodosStore` using `@ngrx/signals/entities` updaters

41 assertions total. Each was written before the skill was finalised. Assertions are the spec, not a confirmation. (Caveat: my v2 skill ends up passing all 41. That's a problem in itself, and I come back to it at the bottom under "100% is a saturation signal".)

Methodology disclosure. I ran each (eval × configuration) pair once, not three times. The ± stddev numbers below are dispersion across the 5 evals, not across repeated runs of the same eval. A more rigorous setup would be 3 runs per cell to absorb temperature noise. Treat the deltas as directional signal, not as p-values.

Failure mode A: the skill exists but adds no value

This is the embarrassing one. Your skill loads, costs tokens, and does nothing the base model couldn't do on its own.

The benchmark viewer (generated by skill-creator's eval-viewer/generate_review.py) shows the headline numbers and the per-eval breakdown side by side:

Per-eval, base model vs. the same model with the skill loaded:

Eval	with_skill	without_skill	Δ
1. cart-store-optimistic	10/10 = 100%	7/10 = 70%	+30pp
2. refactor-behaviorsubject	8/8 = 100%	7/8 = 87.5%	+12.5pp
3. custom-feature-with-selected-entity	8/8 = 100%	7/8 = 87.5%	+12.5pp
4. rxmethod-debounced-search	7/7 = 100%	7/7 = 100%	±0
5. todos-entities-crud	8/8 = 100%	6/8 = 75%	+25pp
Aggregate	41/41 = 100%	34/41 = 84%	+16pp

The skill provides a real, measurable +16 percentage-point uplift across 41 assertions. Look at eval 4 though: zero uplift. The base model already aces the debounceTime → distinctUntilChanged → switchMap → tapResponse recipe. Keeping the rxMethod section of the skill in context costs tokens on every invocation while contributing nothing.

This is what blind A/B benchmarking is for. If a skill section scores no better than baseline, that section is paying tokens for nothing. Trim it. Without the benchmark I would have happily kept that section in there forever.

Where the baseline failed and the skill caught it

The interesting question isn't "did the skill help," it's "where exactly did it help?" The viewer renders this side-by-side per assertion:

The green-and-red wall makes it obvious where the skill is doing real work. The recurring pattern across the 7 baseline failures:

Explicit state typing. Base model writes withState(initialState) (inferred). Skill teaches withState<FooState>(initialState). Catches patches with the wrong shape at compile time.
getState-based snapshot for rollback. Base model snapshots via store.items() directly. Skill teaches the getState() helper for full-state snapshots, the right tool when rollback needs to restore multiple slices.
Standalone updaters vs. store-bound methods. Base model exposes setPending/setFulfilled as methods on the store. Skill teaches them as standalone functions returning Partial<State>, composable in a single patchState(...) call.
updateEntity function-form changes. Base model reads the entity, computes the toggle, passes a partial. Skill teaches changes: (entity) => ({ ... }) for atomic state-dependent updates.
removeEntities predicate form. Base model filters and maps to ids and passes an id array. Skill teaches the predicate form.
Typed signalStoreFeature prerequisite overload. Base model uses bare signalStoreFeature(...) with manual casts. Skill teaches signalStoreFeature({ state: type<EntityState<T>>() }, ...) so misuse is a compile error, not a runtime surprise.

All six are encoded-preference failures. The model knows the API, but doesn't know which form is idiomatic. That distinction sets the skill's lifetime: a capability uplift skill (teaching the model something it didn't know) gets retired the moment the next model release closes the gap. An encoded preference skill stays useful as long as your team's idioms stay the same. For popular libraries like @ngrx/signals, you should expect to land in the second bucket, and the eval is what tells you you're there.

Failure mode B: the skill exists but doesn't trigger

Worse than failure mode A, because it's invisible. The skill never runs, you blame the model, you never discover the description was the bottleneck.

skill-creator ships a script called run_loop.py that handles description optimization with proper ML discipline:

Generate ~20 synthetic prompts: half should-trigger, half should-NOT-trigger.
The negatives deliberately share keywords with the skill, adjacent intent rather than random noise. This is what catches false positives.
60/40 train/test split. 3 runs per query (single runs are too noisy to be a signal).
Up to N iterations of: evaluate, propose new description, re-evaluate.
Pick the description with the best test-set score, not train score. Standard ML hygiene applied to a YAML field.

I drafted 20 queries: 10 substantive Signal Store scenarios and 10 near-misses (classic @ngrx/store, plain RxJS, plain signal() for component-local state, @ngrx/effects setup, the Subject vs. BehaviorSubject question, and so on). Then I let the loop run for 3 iterations.

The HTML report is the most striking visual in the whole experiment:

The per-iteration scores:

Iteration	Description style	Train (n=13)	Test (n=7, held-out)
0 (original)	Keyword-anchored + explicit exclusions	6/13 = 46.2%	3/7 = 42.9%
1	"Use for any Angular state management question…" (intent-driven, longer)	6/13 = 46.2%	3/7 = 42.9%
2	"Invoke this skill when…" (imperative, codebase-context cue)	6/13 = 46.2%	3/7 = 42.9%

The optimizer ran exactly as designed. It proposed two alternatives, evaluated each over 60 model calls per iteration, and selected the description with the best test score. It picked the original. No proposed alternative beat the baseline.

This is not a failure of the loop. It's the right answer when no improvement exists.

The recall ceiling no description rewrite can break

Two reasons the optimizer found nothing better, both diagnostic:

The model under-triggers conversational queries no matter what the description says. Several should-trigger queries used plain English with no signalStore/withState anchor, like "what's the right pattern for an optimistic update with rollback?" or "I have an order detail page where the OrderEditStore should live for the duration of /orders/:id/edit". The model decides it can answer without consulting a skill. No description rewrite fixes that.
Precision was already 100%. All three descriptions correctly avoided triggering on the 10 should-not-trigger queries. The only failure mode left was recall, and the model's under-triggering bias is hostile to recall improvements via prose alone.

This is the kind of insight pure capability evals would never surface. Capability evals only describe the world where the skill loads. Trigger evals add the rest of the picture: even a perfect skill leaks value if the agent only consults it half the time it ought to. That leak is invisible from the capability side.

Independent confirmation. This isn't a quirk of my eval set. Vercel's January 2026 benchmark hit the same wall from a different angle: their Skills configuration was uninvoked in 56% of cases even when the agent had access. Different conclusion (drop Skills entirely), same root observation: agents do not reliably consult skills on conversational prompts.

What it costs: +14 seconds and ~$0.04 per invocation

Skills aren't free. Every invocation reads the SKILL.md body into context. Reference files are loaded on demand, but they still cost tokens when the model decides to read them.

Metric	with_skill (v2)	baseline	Δ vs baseline
Pass rate	100% ± 0%	84% ± 12%	+16 pp
Time per run	62.7s ± 16.1s	49.0s ± 4.8s	+13.7s
Tokens per run	32,343 ± 2,737	19,927 ± 394	+12,416

That's a 28% wall-time increase and a 62% token increase per invocation. At Sonnet 4.6 pricing ($3 / M input, $15 / M output) the per-call delta works out to about +$0.04 at the headline rate, since the 12k delta is almost entirely additional input from reading the SKILL.md and references. With prompt caching on, repeat invocations of the same skill inside the 5-minute cache window read cached input at 10% of base ($0.30 / M), so each warm call drops to roughly +$0.004. Cold starts still pay the full +$0.04, and any edit to the skill body or the container's skills list breaks the cache. For a team running 500 skill-loading prompts a day, the realistic range is somewhere between ~$2/day if most calls hit a warm cache and ~$20/day if every call is cold. Either way it's the +14 second wall-time hit that's more painful than the dollar cost for an interactive coding assistant. That's the difference between "snappy" and "noticeably waiting."

Two things make the trade-off worth it:

The +16pp pass rate is concentrated on patterns that are hard to debug after the fact. A wrong-form updateEntity won't crash, it'll just race in production. A missed getState snapshot won't crash either, it'll just leak optimistic-update state. Catching these at generation time is genuinely valuable.
You can iterate the skill to be cheaper. My initial v1 cost +20.6s and +13.3k tokens. After one iteration (adding a canonical scaffold and six "wrong → right" snippets inline) v2 dropped to +13.7s and +12.4k tokens, a 33% time reduction and 6% token reduction while keeping the same 100% pass rate. The trick: inline corrections meant the model didn't need to read as many reference files per task.

Skill body length is a poor proxy for skill cost. What matters is how much additional content the model has to load on top of the body. v2 was 138 lines longer than v1 and still cost less in practice.

Wire benchmarks into PR CI before the model changes under you

Minimum viable eval setup for any custom skill:

your-skill/
├── SKILL.md
├── references/
└── evals/
    ├── evals.json              # 4-6 substantive tasks with assertions
    └── trigger-eval.json       # 20 should-/should-not queries

Then a CI job that re-runs both benchmarks after every model bump or skill PR. Minimal GitHub Actions shape:

name: skill-evals
on:
  push:
    paths: [ 'skills/your-skill/**' ]
  workflow_dispatch:
  schedule:
    - cron: '0 6 * * 1'  # weekly, catches model-side drift

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Run capability benchmark
        run: python -m scripts.aggregate_benchmark workspace/iteration-latest --skill-name your-skill
      - name: Fail PR if pass rate drops below threshold
        run: |
          jq '.run_summary.with_skill.pass_rate.mean' workspace/iteration-latest/benchmark.json \
            | awk '{ if ($1 < 0.95) { print "Pass rate below 95%"; exit 1 } }'
      - name: Run description trigger optimizer
        run: python -m scripts.run_loop --eval-set evals/trigger-eval.json --skill-path skills/your-skill --model claude-sonnet-4-6 --max-iterations 1

The whole loop takes about 15 minutes of wall time (with executor and grader subagents running in parallel). The cost of not running it is the slow accretion of capability-uplift skills that became no-ops two model versions ago, and triggering descriptions that quietly stopped firing.

A footnote on my own evals: 100% is a saturation signal

My with-skill pass rate is 100% across both iterations. That sounds good. It isn't.

In Demystifying evals for AI agents, Anthropic put it bluntly:

"An eval at 100% tracks regressions but provides no signal for improvement."

Mine track regressions just fine. If claude-opus-5 ships and pass rate drops, I'll know within 15 minutes. But for further skill iteration the suite is saturated. I can't tell whether v3 of the skill is better than v2, because both will hit 100%.

Concrete next-round eval candidates that would not hit 100%:

A multi-store coordination task where a CheckoutStore reads from CartStore and UserStore without violating the no-cross-store-injection rule.
An entity store with a custom idKey: 'uuid' plus a named collection. The kind of thing that exposes whether the model knows the multi-collection updater syntax.
A migration from @ngrx/store (classic NgRx with reducers/effects) to Signal Store, where the agent has to translate actions into store methods correctly.
A route-scoped store with withHooks({ onDestroy }) cleanup that triggers a pending HTTP cancel.

Same skill, same idioms, harder ground truth. Without explicit pass-rate visibility I would happily keep iterating in the dark, congratulating myself on each "100%" run.

Where to go from here

For a popular library like @ngrx/signals, the base model knows the API. The skill's job is encoded preference: making the model produce your team's idiomatic forms, not just any working code. Without evals you can't tell whether the skill is actually doing that.

Four findings, all worth carrying forward:

Capability uplift is real (+16pp pass rate), but concentrated in 6 specific idiom corrections, not in API knowledge. As long as your idioms are the value, the skill survives model upgrades.
Skill cost is real (+14s wall time, +12k tokens, ~$0.04 per call), and iteration can reduce it (v1 → v2 cut wall time by 33%).
Triggering has a ceiling (~46% recall) that prose-only optimization can't break, independently confirmed by Vercel's 56% uninvoked rate.
100% pass rate is a saturation signal. Time to add harder evals, not to declare victory.

Concrete next steps if you want to do this on your own skills:

The skill itself, with eval suite and per-skill README, is at github.com/danielsogl/skills. Install with npx skills add danielsogl/skills@ngrx-signals.
The skill-creator plugin and its run_loop.py script are at anthropics/skills. You can copy the eval scaffolding from there and adapt the assertions to whatever skill you maintain.
After every Claude release, re-run the capability benchmark first (~5 minutes), then the description optimizer (~10 minutes). Track the deltas in a CHANGELOG next to your SKILL.md. That's how you catch the day a capability-uplift skill becomes redundant.

References

Anthropic, Agent Skills

Improving skill-creator: Test, measure, and refine Agent Skills (Mar 3, 2026): central announcement; "5 of 6 document-creation skills" comes from here.
Equipping agents for the real world with Agent Skills by Barry Zhang, Keith Lazuka, Mahesh Murag: engineering deep-dive on progressive disclosure.
Introducing Agent Skills: original Oct 2025 announcement.
Agent Skills documentation: current API and filesystem reference.
Seeing like an agent: how we design tools in Claude Code: broader tool-design philosophy.
anthropics/skills on GitHub: official skills repo, including skill-creator/run_loop.py.

Anthropic, eval methodology

Demystifying evals for AI agents (Jan 9, 2026) by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, Jiri De Jonghe: source of the "an eval at 100% tracks regressions but provides no signal for improvement" quote and the test-both-sides principle.
Effective context engineering for AI agents: context-window discipline that progressive disclosure is built on.

Vercel, counterpoint and eval framework

AGENTS.md outperforms Skills in our agent evals (Jan 27, 2026) by Jude Gao: source of the 100% / 79% / 53% numbers and the "uninvoked in 56% of cases" finding.
An Introduction to Evals: runner / scorer / dataset pattern; "20 well-chosen test cases > 200 random examples."
The no-nonsense approach to AI agent development: tighten-the-loop methodology before adding eval scaffolding.
AI SDK 6 release notes: Agent abstraction, ToolLoopAgent, MCP support.

Open standard

agentskills.io: Optimizing skill descriptions: train/validation split walkthrough.

Library under test

NgRx Signal Store official guide.
Custom Store Features.
Manfred Steyer: NGRX Signal Store and Your Architecture (AngularArchitects): store/service split rules used in the skill.

Framing

Rick Hightower (Towards AI): Build, Evaluate, and Tune AI Agent Skills: Capability Uplift vs. Encoded Preference framing this article borrows.

Top comments (2)

Max • May 5

Resonates. Skills without evals are markdown and hope — but skills with weak evals are markdown and a false sense of confidence, which is worse.

The eval question I keep running into on my team: what's the unit of measurement? Skills shape process, not just final output. A skill can produce the same diff as before but skip the right step (read first, then edit). That's a regression an output-diff eval misses entirely.

Curious what eval shape you've found that catches process drift, not just answer drift.

— Max

PEACEBINFLOW • May 2

The distinction between capability uplift and encoded preference is the quiet insight that determines whether a skill has a shelf life or not. Capability uplift skills are doomed by the next model release—the base model catches up and your carefully crafted SKILL.md becomes a token tax that adds latency for no measurable gain. Encoded preference skills survive as long as your team's idioms do, because no amount of model improvement teaches Claude that your team prefers standalone updater functions over store-bound methods.

What I keep thinking about is the triggering ceiling. 46% recall, an independent confirmation from Vercel's 56% uninvoked rate, and a description optimizer that tried three alternatives and couldn't beat the baseline. That's not a bug. It's a structural property of how agents decide whether to consult external knowledge. The model doesn't know what it doesn't know, so it can't reliably self-diagnose when a skill would help. The implications are uncomfortable because they suggest the entire progressive disclosure model has a leak: skills can be perfect and still irrelevant half the time.

This reminds me of the problem search engines solved with explicit invocation—you don't hope Google surfaces the right result from a passive index, you type a query. The equivalent for skills might be more aggressive triggering logic at the framework level rather than relying on the model's conversational judgment. But that trades recall for precision risk: a skill that loads too eagerly costs tokens and context window on every invocation where it wasn't needed. The eval you described can measure both sides of that tradeoff. Curious if you've considered making the trigger more aggressive as a deliberate experiment—just to see how much the precision cost actually hurts in practice, rather than assuming it's unacceptable based on one round of evals. The 100% saturation on your capability side suggests you've got headroom to absorb some noise if it means the skill actually fires when it should.