Catch skill regressions before they ship in 2026
Build a deterministic regression suite that reruns every skill case on each prompt change and blocks the merge when the risk-weighted pass rate drops below your gate. The Skill Regression Suite Builder generates those case files for you in a CI-ready format. It can't replace human judgment on edge cases. What it does is stop the silent breakages that ship when you tweak one line of a system prompt.
The Skill Regression Suite Builder I link to below is one I built. I tried four eval frameworks before this, and every one assumed I'd ship my prompts and test data to their servers. I didn't want my system prompts leaving my laptop. So it runs entirely client-side. It's free and asks for no signup, and nothing you paste ever leaves the browser. If you've got a better workflow, tell me, I'm not precious about it.
The bug I shipped because I trusted a one-line prompt edit
Three weeks ago, on a Tuesday afternoon, I edited a single sentence in a skill that classifies incoming support tickets. The change looked harmless. I added "prioritize billing issues" to the instructions because a stakeholder asked. I ran the skill once by hand. The output looked sensible, so I merged it.
By Thursday morning the skill was misrouting password-reset tickets into the billing queue. Not all of them. About 1 in 6. Enough that nobody caught it for a day, and enough that our support lead pinged me at 8:42am asking why the billing queue had doubled overnight.
The fix took two minutes. Finding it took most of the morning, because I had no record of what "working" used to look like. I'd been testing skills the way most people do: open it, type a few inputs I can think of off the top of my head, eyeball the output, ship. That works right up until it doesn't. The trouble is that the inputs I dream up on the spot are the easy ones. The cases that actually break are the ones I'd never think to type, which is exactly why they slip through.
Here's the thing about skills. A skill is just a prompt plus some tools, and prompts are absurdly sensitive to wording. Shift one clause and the model quietly re-weights everything downstream. A regression suite for application code is normal practice. A regression suite for prompts barely exists in most teams I've talked to, even though prompts break in the same sneaky, invisible ways code does. I wanted the same safety net I already have for my functions: a fixed set of inputs with known-good outputs that runs automatically on every change and ends in one clear pass or fail.
That gap is what this tool fills. It does one job: turn the test cases living in your head into a file your CI can run on every change.
How the builder turns loose test ideas into a CI gate
The core idea is boring on purpose, and that's a compliment. You hand it a list of cases. Each case has two halves: an input with its expected behavior, plus a risk weight you assign to it. The builder produces a deterministic suite file plus a small runner you can drop straight into CI. Deterministic is the operative word: it pins temperature to 0 and uses exact or rule-based matching, so the same input gives you the same verdict every single run. Flaky evals are worse than no evals, because the first time one fails for no reason you stop trusting all of them. A suite you don't trust is just guilt that runs in CI.
The risk weight is the piece I didn't expect to care about, and now I won't build a suite without it. Test cases don't all matter the same amount. A misrouted billing ticket costs real money and a real apology. A slightly stiff greeting costs nothing. So each case carries a weight, and the gate checks a weighted pass rate rather than a raw count. You can pass 95% of your cases and still fail the gate, if the 5% you broke happened to be the expensive ones. I set my own weights on a one-to-five scale and treat anything that touches money or data as a five. You'll pick your own scale, the point is just that the gate respects it.
Here is what a generated suite and its runner look like. This is real, runnable Python against a stubbed skill function, so you can read the gate logic without needing an API key:
# Cases generated by the Skill Regression Suite Builder.
# Each case: input, expected route, and a risk weight.
CASES = [
{"id": "route-password-reset",
"input": "I forgot my password and can't log in",
"expect_route": "auth", "weight": 5},
{"id": "route-billing-charge",
"input": "Why was I charged twice this month?",
"expect_route": "billing", "weight": 5},
{"id": "greeting-tone",
"input": "hello there",
"expect_route": "general", "weight": 1},
]
def run_skill(text):
# Swap this stub for your real skill call.
return "billing" if "charg" in text.lower() else "general"
def evaluate(cases, gate=0.90):
total = sum(c["weight"] for c in cases)
earned = 0
for c in cases:
ok = run_skill(c["input"]) == c["expect_route"]
earned += c["weight"] if ok else 0
print(f"{'PASS' if ok else 'FAIL'} {c['id']} (w={c['weight']})")
score = earned / total
print(f"weighted pass rate: {score:.1%}")
raise SystemExit(0 if score >= gate else 1)
evaluate(CASES)
Run that and the stubbed skill fails the password-reset case, which carries weight 5, because the dummy router only recognizes the word "charg". The weighted score lands at 6 out of 11, about 54.5%, well under the 0.90 gate, so the process exits with code 1 and your CI run goes red. Swap run_skill for your actual skill call and you have a genuine regression gate. The builder writes both files for you, including the GitHub Actions step, so the only part you actually do by hand is describe the cases and assign the weights.
How it stacks up against the eval tools I tried
I didn't build this in a vacuum. I tried the obvious options first, and a couple of them are excellent. Here's the honest comparison, including the rows where the builder loses:
| Tool | Risk-weighted gates | Runs client-side | Setup time | Best for |
|---|---|---|---|---|
| Skill Regression Suite Builder | Yes | Yes | ~5 min | Skill/prompt regression in CI |
| Promptfoo | No (raw pass rate) | Local CLI, configs on disk | ~30 min | Broad LLM eval matrices |
| Hosted eval platform (generic) | Partial | No, cloud upload | ~1 hr + account | Teams wanting dashboards |
| Hand-rolled pytest | Whatever you build | Yes | Hours | Full control, no time budget |
Promptfoo is genuinely good and far more flexible than what I built. If you need a sprawling matrix of models against providers, reach for it. It does treat every assertion as equal by default, though, so wiring up weighted gates meant writing my own scoring on top. The hosted platforms gave me pretty dashboards and a login screen I didn't ask for, plus the upload problem from earlier. Hand-rolled pytest is exactly what I had before the bug, and it works fine, it just costs you an afternoon every time you want to reshape the suite.
The builder sits in a narrow spot: you want weighted regression gates for your skills running in CI today, and you don't want your prompts sitting on someone else's server. If that describes you, generate your suite here and paste the output into your repo. It took me about five minutes the first time, and most of that was me arguing with myself over the weights. The weights are the only judgment call, and that's by design.
When you should skip this entirely
I'd rather tell you when not to use it, because the "when not to" section is the part AI-written tool reviews always quietly drop. Honesty about limits is the only reason to trust the rest.
Don't reach for a regression suite while your skill is still changing shape every day. Early on, the "correct" output is a moving target, and you'll burn more time rewriting expected values than improving the actual skill. Wait until behavior settles. I usually start a suite once a skill has gone a full week without a structural rewrite.
Skip it for purely generative, open-ended skills where no single right answer exists. If the output is a creative paragraph, exact matching is useless and rule-based matching is mostly wishful thinking. You want an LLM-as-judge setup there, which is a different tool and a longer argument about how much you trust the judge model.
And please don't read a green gate as proof the skill is correct. A green gate means the skill still does the things you remembered to test. My password-reset disaster taught me that the case I most need is usually the exact one I forgot to write down. A suite catches regressions against behavior you already knew about. It can't catch a whole category of input you never imagined existed.
One last thing. If you have three test cases total, skip all of this and use plain pytest. The weighting and the CI scaffolding start earning their keep somewhere around 15 to 20 cases, not 3.
FAQ
Q: Does this work with Claude skills and Codex skills?
A: Yes. The case format is model-agnostic. You point the runner's skill call at whatever backend you use, and the builder only cares about the case input and the expected behavior.
Q: How is the weighted pass rate actually calculated?
A: Each case has a weight. The score is the sum of the weights for passing cases divided by the sum of every weight. A gate of 0.90 means you need 90% of your weighted risk to pass, so breaking one heavy case hurts far more than breaking a light one.
Q: Can I run it without an API key?
A: The builder itself runs in your browser with no key and no signup. The generated runner needs whatever credentials your real skill call uses, since it has to actually invoke your skill to test it.
Q: Is it really deterministic?
A: As deterministic as your skill call is. The suite pins temperature to 0 and uses exact or rule-based matching, but if your model still wobbles at temperature 0, add a tolerance rule or a retry. I haven't needed to yet, though your mileage may vary.
Written with AI assistance and human review. Try the tool at aidevhub.io/skill-regression-suite-builder.
Top comments (0)