Stop Testing Skills Once: Use Caliper's pass@k to Measure What Actually

#ai #opensource #programming #machinelearning

Caliper is a lightweight harness that runs Claude Code skills k times, scores them with pass@k, and compares against a no-skill baseline so you know if your skill actually helps.

What Changed — Skills Are Now Testable, Not Guesswork

If you've ever published a Claude Code skill, you've felt the anxiety: Will this work for other people? Will the next model update silently break it? The answer was always "I don't know" — until now.

Caliper (github.com/edonadei/caliper) is a new open-source harness that runs your skill k times in isolated environments and gives you a pass@k score. You define success in a YAML spec with either an LLM judge, a Python assertion, or both. Then you run:

caliper run extract-actions.eval.yaml --k 5 --baseline

And see:

ID      Task                           k(5)  pass@k
task-1  Extracts action items as JSON  5/5   100%  PASS
With skill   100%
No skill      60%
Delta        +40%

The --baseline flag is the killer feature: it re-runs everything without your skill, so you see the delta. A +40% means your skill is actually helping. A 0% or -100% means it's doing nothing or actively harming results.

What It Means For You — Stop Shipping Untested Skills

Most Claude Code skills are tested once, look good, and then break silently when a new model releases. Caliper solves this by making evaluation deterministic and repeatable.

Here's what you can do right now:

Install Caliper via a Claude Code skill:

   npx skills@latest add edonadei/caliper

This installs two skills: evaluate-skill (run and manage evals) and grill-skill (reads your SKILL.md, interviews you, and writes a 3-task spec).

Write your first eval spec in a .eval.yaml file:

   tasks:
     - name: Extracts action items as clean JSON
       prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json."
       expect: "A valid JSON array where every item has owner, task, due. No markdown fences."
       assert: |
         import json
         items = json.load(open("/tmp/actions.json"))
         assert isinstance(items, list)
         assert all({"owner","task","due"} <= i.keys() for i in items)

Run it with --k 5 and --baseline to see your skill's true performance.

Try It Now — Your First Caliper Run

# Install Caliper
pip install caliper-eval

![Claude Code showing the code review](https://www.apimatic.io/hs-fs/hubfs/claude-code-review.png?width=932&height=581&name=claude-code-review.png)


# Or add it as a Claude Code skill
npx skills@latest add edonadei/caliper

# Create a simple eval
cat > my-skill.eval.yaml << 'EOF'
tasks:
  - name: Generates valid Python
    prompt: "Write a function that returns the nth Fibonacci number to /tmp/fib.py"
    expect: "A valid Python file with a function that returns correct Fibonacci numbers"
    assert: |
      import sys
      sys.path.insert(0, "/tmp")
      from fib import fibonacci
      assert fibonacci(0) == 0
      assert fibonacci(1) == 1
      assert fibonacci(10) == 55
EOF

# Run it 10 times with baseline
caliper run my-skill.eval.yaml --k 10 --baseline

Caliper supports multiple backends: you can run the skill on one model and judge with another. This is especially useful if you want to test a Claude Code skill but use a cheaper model (like GPT-4o-mini) for judging.

The Bottom Line

Testing agentic code is fundamentally different from testing deterministic code. A skill that works once might fail 40% of the time. Caliper gives you the data to know for sure — and the --baseline flag tells you if your skill is actually adding value or just getting in the way.

Source: apimatic.io

[Updated 30 Jun via hn_claude_code]

Caliper now supports multiple agent harnesses beyond Claude Code, including Codex, Pi, Claude API, and OpenAI API [per creator's Show HN on Hacker News]. The --baseline flag reveals that even basic JSON extraction — 'solved by 2-year-old models' — can show a 40% delta, proving the skill's real value. The project also ships two companion skills: evaluate-skill for running evals without leaving your workflow, and grill-skill which reads your SKILL.md, interviews you, and auto-generates a 3-task spec covering happy path, edge case, and adversarial scenarios.

Originally published on gentic.news