I Broke a Chatbot With a Prompt Change. Then I Built the Tool That Would've Caught It.

#llm #opensource #devops #testing

I updated a system prompt on a Friday. By Monday, a user filed a bug: the chatbot was giving wrong answers.

The output looked totally fine. Valid format. Natural language. No errors in the logs. Just... wrong.

That's the thing about LLM regressions — they're completely silent.

The problem with testing LLMs

Traditional software tests don't catch this. Unit tests mock the model. Integration tests verify the request went through. Neither catches that your prompt change made the model start hallucinating, or quietly drop a required field.

I looked at what existed:

Promptfoo — gets it, but regression is manual diffs
DeepEval — Python-only, useless if you're not on that stack
LangSmith / Braintrust — cloud platforms starting at $249/month
RAGAS — RAG-specific, no baseline comparison

I wanted something that:

Compares every run against a baseline automatically
Runs in CI and returns a meaningful exit code
Doesn't require Python, Node, or Docker

So I built it.

Introducing Regtrace

Regtrace is an open-source CLI for LLM quality gates. Standalone binary — drop it in and go.

curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o regtrace
chmod +x regtrace
sudo mv regtrace /usr/local/bin/
regtrace init
regtrace run --generate

Deterministic checks (format, JSON schema, regex, length) work with zero API keys. LLM-judged metrics need a provider key via .env.

Four metric pillars

Pillar	What it checks	How
Factuality	Accuracy against expected output	Heuristic overlap or LLM-as-judge, auto-detects JSON
Format	Structure compliance	JSON validity, schema, required fields, regex, forbidden content
Tone	Style consistency	Formality, sentiment, assertiveness, persona, verbosity
Regression	Drift over time	Every run vs baseline, per-metric tolerance, stale alerts

The regression pillar is the one most tools skip entirely.

Why delta-gating beats threshold-gating

Most eval frameworks gate on absolute thresholds:

pass_rate >= 0.85  ✓

That seems fine — until your model improves, every test is passing at 0.97, and then a regression to 0.88 slips right through because it's still above the threshold.

Regtrace gates on delta vs baseline. Pass rates can go up. They should never go down.

metrics:
  regression:
    enabled: true
    metric_tolerances:
      format: 0        # zero tolerance for format drift
      factuality: 0.1  # 10% variance allowed

CI in one YAML file

name: LLM Quality Gate
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download regtrace
        run: |
          curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o /usr/local/bin/regtrace
          chmod +x /usr/local/bin/regtrace
      - name: Evaluate
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: regtrace run --format json --output results.json
        # Exit codes: 0 = pass, 1 = gate failure, 2 = config error

Four quality gates — suite score, max failures, regression status, NFR — are AND-composed. All must pass.

NFR enforcement (the gates most tools ignore)

nfr_gates:
  max_latency_ms: 5000
  max_cost_usd: 1.00
  min_coverage: 80

Latency, cost, and test coverage thresholds. Failed NFRs block the suite just like a regression would.

Try it in 2 minutes — no API key needed

regtrace init
# Edit golden-sets/qa.yaml — fill in actual_output values
regtrace run

Format checks, word overlap, and JSON validation all run locally. Only factuality (deep) and tone require a provider key.

How it compares

Tool	Interface	Regression detection
Promptfoo	CLI + Web UI	Manual diff
DeepEval	Library	Pytest plugin
LangSmith	Platform	Platform-level
Braintrust	Platform	Experiment tracking
RAGAS	Library	None
Regtrace	CLI	Automatic, per-metric, CI-native

Full roadmap on GitHub.

The weekend project I wish I'd had before that Friday deploy. Currently in beta would love feedback from the community, especially from anyone who's fought silent LLM regressions before. Every suggestion helps improve it.