Caliper is a lightweight harness that runs Claude Code skills k times, scores them with pass@k, and compares against a no-skill baseline so you know if your skill actually helps.
What Changed — Skills Are Now Testable, Not Guesswork
If you've ever published a Claude Code skill, you've felt the anxiety: Will this work for other people? Will the next model update silently break it? The answer was always "I don't know" — until now.
Caliper (github.com/edonadei/caliper) is a new open-source harness that runs your skill k times in isolated environments and gives you a pass@k score. You define success in a YAML spec with either an LLM judge, a Python assertion, or both. Then you run:
caliper run extract-actions.eval.yaml --k 5 --baseline
And see:
ID Task k(5) pass@k
task-1 Extracts action items as JSON 5/5 100% PASS
With skill 100%
No skill 60%
Delta +40%
The --baseline flag is the killer feature: it re-runs everything without your skill, so you see the delta. A +40% means your skill is actually helping. A 0% or -100% means it's doing nothing or actively harming results.
What It Means For You — Stop Shipping Untested Skills
Most Claude Code skills are tested once, look good, and then break silently when a new model releases. Caliper solves this by making evaluation deterministic and repeatable.
Here's what you can do right now:
- Install Caliper via a Claude Code skill:
npx skills@latest add edonadei/caliper
This installs two skills: evaluate-skill (run and manage evals) and grill-skill (reads your SKILL.md, interviews you, and writes a 3-task spec).
-
Write your first eval spec in a
.eval.yamlfile:
tasks:
- name: Extracts action items as clean JSON
prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json."
expect: "A valid JSON array where every item has owner, task, due. No markdown fences."
assert: |
import json
items = json.load(open("/tmp/actions.json"))
assert isinstance(items, list)
assert all({"owner","task","due"} <= i.keys() for i in items)
-
Run it with
--k 5and--baselineto see your skill's true performance.
Try It Now — Your First Caliper Run
# Install Caliper
pip install caliper-eval

# Or add it as a Claude Code skill
npx skills@latest add edonadei/caliper
# Create a simple eval
cat > my-skill.eval.yaml << 'EOF'
tasks:
- name: Generates valid Python
prompt: "Write a function that returns the nth Fibonacci number to /tmp/fib.py"
expect: "A valid Python file with a function that returns correct Fibonacci numbers"
assert: |
import sys
sys.path.insert(0, "/tmp")
from fib import fibonacci
assert fibonacci(0) == 0
assert fibonacci(1) == 1
assert fibonacci(10) == 55
EOF
# Run it 10 times with baseline
caliper run my-skill.eval.yaml --k 10 --baseline
Caliper supports multiple backends: you can run the skill on one model and judge with another. This is especially useful if you want to test a Claude Code skill but use a cheaper model (like GPT-4o-mini) for judging.
The Bottom Line
Testing agentic code is fundamentally different from testing deterministic code. A skill that works once might fail 40% of the time. Caliper gives you the data to know for sure — and the --baseline flag tells you if your skill is actually adding value or just getting in the way.
Source: apimatic.io
[Updated 30 Jun via hn_claude_code]
Caliper now supports multiple agent harnesses beyond Claude Code, including Codex, Pi, Claude API, and OpenAI API [per creator's Show HN on Hacker News]. The --baseline flag reveals that even basic JSON extraction — 'solved by 2-year-old models' — can show a 40% delta, proving the skill's real value. The project also ships two companion skills: evaluate-skill for running evals without leaving your workflow, and grill-skill which reads your SKILL.md, interviews you, and auto-generates a 3-task spec covering happy path, edge case, and adversarial scenarios.
Originally published on gentic.news

Top comments (0)