gentic news

Posted on Jun 29 • Originally published at gentic.news

Caliper: Run Your Claude Code Skills k Times and Get a pass@k Score That

#ai #opensource #programming #machinelearning

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent. Install via pipx or npx.

Key Takeaways

Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent.
Install via pipx or npx.

What Changed — Caliper Brings pass@k Reliability Testing to Claude Code Skills

Skills for Claude Code are non-deterministic. A skill that works on your machine, with your prompt, today, might silently break tomorrow after a model update or a one-line prompt edit. Until now, there was no standard way to catch that.

Caliper is a lightweight, local harness that runs a skill k times in isolated environments and gives you a pass@k score. It answers the question: "How many times did the skill succeed out of k attempts?"

It also includes a --baseline flag that re-runs everything without the skill, so you see the delta — proving whether your skill is actually doing the work, or the base agent would have passed anyway.

What It Means For You — Concrete Impact on Daily Claude Code Usage

If you publish or maintain Claude Code skills, Caliper replaces guesswork with data. Here's what you get:

Track reliability over time. Did your prompt edit actually improve the skill? Run Caliper before and after.
Catch regressions. Does it still pass the workflows it passed last week? Caliper saves results to .caliper/results/.
Compare agents. Run the same skill on Claude Code, Codex, and Pi — see which agent runs it more reliably.
Prove your skill adds value. The delta between "with skill" and "no skill" is your evidence.

Try It Now — How to Install and Run Caliper

Option 1: Install as a skill (works inside Claude Code)

npx skills@latest add edonadei/caliper

Then, inside Claude Code or Codex, use:

/grill-skill ./my-skill/SKILL.md — reads your SKILL.md, interviews you, and writes a 3-task .eval.yaml spec (happy path, edge case, adversarial)
/evaluate-skill run my-skill.eval.yaml --k 3 --baseline — runs the evaluation
/evaluate-skill list — browse past runs
/evaluate-skill report my-skill — view a report

Option 2: Install as a standalone CLI

pipx install caliper-eval  # requires Python 3.10+

Write a YAML spec:

# my-skill.eval.yaml
skill:
  path: ./SKILL.md
  backend: claude-code
judge:
  backend: claude-code
tasks:
  - name: Writes a conventional commit message
    prompt: "Summarize the staged git diff as a commit message."
    expect: >
      The response is a conventional-commit message: a concise subject line under 72 characters, followed by a body explaining why the change was made.

  - name: Generates a valid config file
    cleanup: rm -f /tmp/app.config.json
    prompt: "Generate a config at /tmp/app.config.json with a 'port' of 8080."
    assert: |
      import json
      from pathlib import Path
      data = json.loads(Path("/tmp/app.config.json").read_text())
      assert data["port"] == 8080

Run it:

caliper run my-skill.eval.yaml --k 5 --baseline

Output example:

CALIPER - my-skill - k=5 - claude-code
ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS

With skill   100%
No skill      60%
Delta        +40%

The Eval Starter Pack

Caliper includes four copy-paste templates that catch real agent failures: false success, tool misuse, run-to-run variance, and instruction drift. These are available in the project's GitHub repo.

Why This Matters for Claude Code Users

Skills are how you extend Claude Code's capabilities. But without testing, you're shipping blind. Caliper gives you a pass@k score you can track, compare, and cite when you tell your team "this skill works."

It's also the first tool to surface the delta — how much better the skill performs than the base agent. Sometimes that delta is 0%. Sometimes it's -100%. Now you'll know before your users do.

Source: github.com

[Updated 29 Jun via hn_claude_code]

Caliper now supports running skills on multiple backends including Claude Code, Codex, Pi, Claude API, and OpenAI API, with the ability to use separate backends for the agent and the judge [per Show HN]. The project also introduces two new companion skills: evaluate-skill for managing evals within your workflow, and grill-skill that reads your SKILL.md, interviews you, and auto-generates a 3-task evaluation spec covering happy path, edge case, and adversarial scenarios.

Originally published on gentic.news

DEV Community