DEV Community

Marlon Martin
Marlon Martin

Posted on

I Broke a Chatbot With a Prompt Change. Then I Built the Tool That Would've Caught It.

I updated a system prompt on a Friday. By Monday, a user filed a bug: the chatbot was giving wrong answers.

The output looked totally fine. Valid format. Natural language. No errors in the logs. Just... wrong.

That's the thing about LLM regressions — they're completely silent.

The problem with testing LLMs

Traditional software tests don't catch this. Unit tests mock the model. Integration tests verify the request went through. Neither catches that your prompt change made the model start hallucinating, or quietly drop a required field.

I looked at what existed:

  • Promptfoo — gets it, but regression is manual diffs
  • DeepEval — Python-only, useless if you're not on that stack
  • LangSmith / Braintrust — cloud platforms starting at $249/month
  • RAGAS — RAG-specific, no baseline comparison

I wanted something that:

  • Compares every run against a baseline automatically
  • Runs in CI and returns a meaningful exit code
  • Doesn't require Python, Node, or Docker

So I built it.

Introducing Regtrace

Regtrace is an open-source CLI for LLM quality gates. Standalone binary — drop it in and go.

curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o regtrace
chmod +x regtrace
sudo mv regtrace /usr/local/bin/
regtrace init
regtrace run --generate
Enter fullscreen mode Exit fullscreen mode

Deterministic checks (format, JSON schema, regex, length) work with zero API keys. LLM-judged metrics need a provider key via .env.

Four metric pillars

Pillar What it checks How
Factuality Accuracy against expected output Heuristic overlap or LLM-as-judge, auto-detects JSON
Format Structure compliance JSON validity, schema, required fields, regex, forbidden content
Tone Style consistency Formality, sentiment, assertiveness, persona, verbosity
Regression Drift over time Every run vs baseline, per-metric tolerance, stale alerts

The regression pillar is the one most tools skip entirely.

Why delta-gating beats threshold-gating

Most eval frameworks gate on absolute thresholds:

pass_rate >= 0.85  ✓
Enter fullscreen mode Exit fullscreen mode

That seems fine — until your model improves, every test is passing at 0.97, and then a regression to 0.88 slips right through because it's still above the threshold.

Regtrace gates on delta vs baseline. Pass rates can go up. They should never go down.

metrics:
  regression:
    enabled: true
    metric_tolerances:
      format: 0        # zero tolerance for format drift
      factuality: 0.1  # 10% variance allowed
Enter fullscreen mode Exit fullscreen mode

CI in one YAML file

name: LLM Quality Gate
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download regtrace
        run: |
          curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o /usr/local/bin/regtrace
          chmod +x /usr/local/bin/regtrace
      - name: Evaluate
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: regtrace run --format json --output results.json
        # Exit codes: 0 = pass, 1 = gate failure, 2 = config error
Enter fullscreen mode Exit fullscreen mode

Four quality gates — suite score, max failures, regression status, NFR — are AND-composed. All must pass.

NFR enforcement (the gates most tools ignore)

nfr_gates:
  max_latency_ms: 5000
  max_cost_usd: 1.00
  min_coverage: 80
Enter fullscreen mode Exit fullscreen mode

Latency, cost, and test coverage thresholds. Failed NFRs block the suite just like a regression would.

Try it in 2 minutes — no API key needed

regtrace init
# Edit golden-sets/qa.yaml — fill in actual_output values
regtrace run
Enter fullscreen mode Exit fullscreen mode

Format checks, word overlap, and JSON validation all run locally. Only factuality (deep) and tone require a provider key.

How it compares

Tool Interface Regression detection
Promptfoo CLI + Web UI Manual diff
DeepEval Library Pytest plugin
LangSmith Platform Platform-level
Braintrust Platform Experiment tracking
RAGAS Library None
Regtrace CLI Automatic, per-metric, CI-native

Full roadmap on GitHub.

The weekend project I wish I'd had before that Friday deploy. Currently in beta would love feedback from the community, especially from anyone who's fought silent LLM regressions before. Every suggestion helps improve it.

Top comments (0)