I updated a system prompt on a Friday. By Monday, a user filed a bug: the chatbot was giving wrong answers.
The output looked totally fine. Valid format. Natural language. No errors in the logs. Just... wrong.
That's the thing about LLM regressions — they're completely silent.
The problem with testing LLMs
Traditional software tests don't catch this. Unit tests mock the model. Integration tests verify the request went through. Neither catches that your prompt change made the model start hallucinating, or quietly drop a required field.
I looked at what existed:
- Promptfoo — gets it, but regression is manual diffs
- DeepEval — Python-only, useless if you're not on that stack
- LangSmith / Braintrust — cloud platforms starting at $249/month
- RAGAS — RAG-specific, no baseline comparison
I wanted something that:
- Compares every run against a baseline automatically
- Runs in CI and returns a meaningful exit code
- Doesn't require Python, Node, or Docker
So I built it.
Introducing Regtrace
Regtrace is an open-source CLI for LLM quality gates. Standalone binary — drop it in and go.
curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o regtrace
chmod +x regtrace
sudo mv regtrace /usr/local/bin/
regtrace init
regtrace run --generate
Deterministic checks (format, JSON schema, regex, length) work with zero API keys. LLM-judged metrics need a provider key via .env.
Four metric pillars
| Pillar | What it checks | How |
|---|---|---|
| Factuality | Accuracy against expected output | Heuristic overlap or LLM-as-judge, auto-detects JSON |
| Format | Structure compliance | JSON validity, schema, required fields, regex, forbidden content |
| Tone | Style consistency | Formality, sentiment, assertiveness, persona, verbosity |
| Regression | Drift over time | Every run vs baseline, per-metric tolerance, stale alerts |
The regression pillar is the one most tools skip entirely.
Why delta-gating beats threshold-gating
Most eval frameworks gate on absolute thresholds:
pass_rate >= 0.85 ✓
That seems fine — until your model improves, every test is passing at 0.97, and then a regression to 0.88 slips right through because it's still above the threshold.
Regtrace gates on delta vs baseline. Pass rates can go up. They should never go down.
metrics:
regression:
enabled: true
metric_tolerances:
format: 0 # zero tolerance for format drift
factuality: 0.1 # 10% variance allowed
CI in one YAML file
name: LLM Quality Gate
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download regtrace
run: |
curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o /usr/local/bin/regtrace
chmod +x /usr/local/bin/regtrace
- name: Evaluate
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: regtrace run --format json --output results.json
# Exit codes: 0 = pass, 1 = gate failure, 2 = config error
Four quality gates — suite score, max failures, regression status, NFR — are AND-composed. All must pass.
NFR enforcement (the gates most tools ignore)
nfr_gates:
max_latency_ms: 5000
max_cost_usd: 1.00
min_coverage: 80
Latency, cost, and test coverage thresholds. Failed NFRs block the suite just like a regression would.
Try it in 2 minutes — no API key needed
regtrace init
# Edit golden-sets/qa.yaml — fill in actual_output values
regtrace run
Format checks, word overlap, and JSON validation all run locally. Only factuality (deep) and tone require a provider key.
How it compares
| Tool | Interface | Regression detection |
|---|---|---|
| Promptfoo | CLI + Web UI | Manual diff |
| DeepEval | Library | Pytest plugin |
| LangSmith | Platform | Platform-level |
| Braintrust | Platform | Experiment tracking |
| RAGAS | Library | None |
| Regtrace | CLI | Automatic, per-metric, CI-native |
Full roadmap on GitHub.
The weekend project I wish I'd had before that Friday deploy. Currently in beta would love feedback from the community, especially from anyone who's fought silent LLM regressions before. Every suggestion helps improve it.
Top comments (0)