Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent. Install via pipx or npx.
Key Takeaways
- Caliper gives Claude Code users a pass@k reliability score for skills, with a baseline delta showing if the skill beats the base agent.
- Install via pipx or npx.
What Changed — Caliper Brings pass@k Reliability Testing to Claude Code Skills
Skills for Claude Code are non-deterministic. A skill that works on your machine, with your prompt, today, might silently break tomorrow after a model update or a one-line prompt edit. Until now, there was no standard way to catch that.
Caliper is a lightweight, local harness that runs a skill k times in isolated environments and gives you a pass@k score. It answers the question: "How many times did the skill succeed out of k attempts?"
It also includes a --baseline flag that re-runs everything without the skill, so you see the delta — proving whether your skill is actually doing the work, or the base agent would have passed anyway.
What It Means For You — Concrete Impact on Daily Claude Code Usage
If you publish or maintain Claude Code skills, Caliper replaces guesswork with data. Here's what you get:
- Track reliability over time. Did your prompt edit actually improve the skill? Run Caliper before and after.
-
Catch regressions. Does it still pass the workflows it passed last week? Caliper saves results to
.caliper/results/. - Compare agents. Run the same skill on Claude Code, Codex, and Pi — see which agent runs it more reliably.
- Prove your skill adds value. The delta between "with skill" and "no skill" is your evidence.
Try It Now — How to Install and Run Caliper
Option 1: Install as a skill (works inside Claude Code)
npx skills@latest add edonadei/caliper
Then, inside Claude Code or Codex, use:
-
/grill-skill ./my-skill/SKILL.md— reads your SKILL.md, interviews you, and writes a 3-task.eval.yamlspec (happy path, edge case, adversarial) -
/evaluate-skill run my-skill.eval.yaml --k 3 --baseline— runs the evaluation -
/evaluate-skill list— browse past runs -
/evaluate-skill report my-skill— view a report
Option 2: Install as a standalone CLI
pipx install caliper-eval # requires Python 3.10+
Write a YAML spec:
# my-skill.eval.yaml
skill:
path: ./SKILL.md
backend: claude-code
judge:
backend: claude-code
tasks:
- name: Writes a conventional commit message
prompt: "Summarize the staged git diff as a commit message."
expect: >
The response is a conventional-commit message: a concise subject line under 72 characters, followed by a body explaining why the change was made.
- name: Generates a valid config file
cleanup: rm -f /tmp/app.config.json
prompt: "Generate a config at /tmp/app.config.json with a 'port' of 8080."
assert: |
import json
from pathlib import Path
data = json.loads(Path("/tmp/app.config.json").read_text())
assert data["port"] == 8080
Run it:
caliper run my-skill.eval.yaml --k 5 --baseline
Output example:
CALIPER - my-skill - k=5 - claude-code
ID Task k(5) pass@k
task-1 Extracts action items as JSON 5/5 100% PASS
With skill 100%
No skill 60%
Delta +40%
The Eval Starter Pack
Caliper includes four copy-paste templates that catch real agent failures: false success, tool misuse, run-to-run variance, and instruction drift. These are available in the project's GitHub repo.
Why This Matters for Claude Code Users
Skills are how you extend Claude Code's capabilities. But without testing, you're shipping blind. Caliper gives you a pass@k score you can track, compare, and cite when you tell your team "this skill works."
It's also the first tool to surface the delta — how much better the skill performs than the base agent. Sometimes that delta is 0%. Sometimes it's -100%. Now you'll know before your users do.
Source: github.com
[Updated 29 Jun via hn_claude_code]
Caliper now supports running skills on multiple backends including Claude Code, Codex, Pi, Claude API, and OpenAI API, with the ability to use separate backends for the agent and the judge [per Show HN]. The project also introduces two new companion skills: evaluate-skill for managing evals within your workflow, and grill-skill that reads your SKILL.md, interviews you, and auto-generates a 3-task evaluation spec covering happy path, edge case, and adversarial scenarios.
Originally published on gentic.news

Top comments (0)