<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marlon Martin</title>
    <description>The latest articles on DEV Community by Marlon Martin (@decimozs).</description>
    <link>https://dev.to/decimozs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3972006%2F877b3d2d-e0f8-4b13-b30b-92f8292d6496.jpeg</url>
      <title>DEV Community: Marlon Martin</title>
      <link>https://dev.to/decimozs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/decimozs"/>
    <language>en</language>
    <item>
      <title>I Broke a Chatbot With a Prompt Change. Then I Built the Tool That Would've Caught It.</title>
      <dc:creator>Marlon Martin</dc:creator>
      <pubDate>Sun, 07 Jun 2026 03:32:07 +0000</pubDate>
      <link>https://dev.to/decimozs/i-broke-a-chatbot-with-a-prompt-change-then-i-built-the-tool-that-wouldve-caught-it-m1g</link>
      <guid>https://dev.to/decimozs/i-broke-a-chatbot-with-a-prompt-change-then-i-built-the-tool-that-wouldve-caught-it-m1g</guid>
      <description>&lt;p&gt;I updated a system prompt on a Friday. By Monday, a user filed a bug: the chatbot was giving wrong answers.&lt;/p&gt;

&lt;p&gt;The output looked totally fine. Valid format. Natural language. No errors in the logs. Just... wrong.&lt;/p&gt;

&lt;p&gt;That's the thing about LLM regressions — they're completely silent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem with testing LLMs
&lt;/h3&gt;

&lt;p&gt;Traditional software tests don't catch this. Unit tests mock the model. Integration tests verify the request went through. Neither catches that your prompt change made the model start hallucinating, or quietly drop a required field.&lt;/p&gt;

&lt;p&gt;I looked at what existed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Promptfoo&lt;/strong&gt; — gets it, but regression is manual diffs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepEval&lt;/strong&gt; — Python-only, useless if you're not on that stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith / Braintrust&lt;/strong&gt; — cloud platforms starting at $249/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAGAS&lt;/strong&gt; — RAG-specific, no baseline comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted something that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compares every run against a baseline &lt;strong&gt;automatically&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Runs in CI and returns a meaningful exit code&lt;/li&gt;
&lt;li&gt;Doesn't require Python, Node, or Docker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I built it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Introducing Regtrace
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/decimozs/regtrace" rel="noopener noreferrer"&gt;&lt;strong&gt;Regtrace&lt;/strong&gt;&lt;/a&gt; is an open-source CLI for LLM quality gates. Standalone binary — drop it in and go.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 &lt;span class="nt"&gt;-o&lt;/span&gt; regtrace
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x regtrace
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;regtrace /usr/local/bin/
regtrace init
regtrace run &lt;span class="nt"&gt;--generate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deterministic checks (format, JSON schema, regex, length) work with zero API keys. LLM-judged metrics need a provider key via &lt;code&gt;.env&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four metric pillars
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pillar&lt;/th&gt;
&lt;th&gt;What it checks&lt;/th&gt;
&lt;th&gt;How&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Factuality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Accuracy against expected output&lt;/td&gt;
&lt;td&gt;Heuristic overlap or LLM-as-judge, auto-detects JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structure compliance&lt;/td&gt;
&lt;td&gt;JSON validity, schema, required fields, regex, forbidden content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style consistency&lt;/td&gt;
&lt;td&gt;Formality, sentiment, assertiveness, persona, verbosity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regression&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Drift over time&lt;/td&gt;
&lt;td&gt;Every run vs baseline, per-metric tolerance, stale alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The regression pillar is the one most tools skip entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why delta-gating beats threshold-gating
&lt;/h3&gt;

&lt;p&gt;Most eval frameworks gate on absolute thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pass_rate &amp;gt;= 0.85  ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That seems fine — until your model improves, every test is passing at 0.97, and then a regression to 0.88 slips right through because it's still above the threshold.&lt;/p&gt;

&lt;p&gt;Regtrace gates on &lt;strong&gt;delta vs baseline&lt;/strong&gt;. Pass rates can go up. They should never go down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;regression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;metric_tolerances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;        &lt;span class="c1"&gt;# zero tolerance for format drift&lt;/span&gt;
      &lt;span class="na"&gt;factuality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.1&lt;/span&gt;  &lt;span class="c1"&gt;# 10% variance allowed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CI in one YAML file
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM Quality Gate&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;evaluate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Download regtrace&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -L https://github.com/decimozs/regtrace/releases/latest/download/regtrace-linux-x64 -o /usr/local/bin/regtrace&lt;/span&gt;
          &lt;span class="s"&gt;chmod +x /usr/local/bin/regtrace&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Evaluate&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;regtrace run --format json --output results.json&lt;/span&gt;
        &lt;span class="c1"&gt;# Exit codes: 0 = pass, 1 = gate failure, 2 = config error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four quality gates — suite score, max failures, regression status, NFR — are AND-composed. All must pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  NFR enforcement (the gates most tools ignore)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nfr_gates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_latency_ms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
  &lt;span class="na"&gt;max_cost_usd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.00&lt;/span&gt;
  &lt;span class="na"&gt;min_coverage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency, cost, and test coverage thresholds. Failed NFRs block the suite just like a regression would.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try it in 2 minutes — no API key needed
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;regtrace init
&lt;span class="c"&gt;# Edit golden-sets/qa.yaml — fill in actual_output values&lt;/span&gt;
regtrace run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Format checks, word overlap, and JSON validation all run locally. Only &lt;code&gt;factuality (deep)&lt;/code&gt; and tone require a provider key.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it compares
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;th&gt;Regression detection&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Promptfoo&lt;/td&gt;
&lt;td&gt;CLI + Web UI&lt;/td&gt;
&lt;td&gt;Manual diff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepEval&lt;/td&gt;
&lt;td&gt;Library&lt;/td&gt;
&lt;td&gt;Pytest plugin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;td&gt;Platform-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Braintrust&lt;/td&gt;
&lt;td&gt;Platform&lt;/td&gt;
&lt;td&gt;Experiment tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAGAS&lt;/td&gt;
&lt;td&gt;Library&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regtrace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Automatic, per-metric, CI-native&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full roadmap on &lt;a href="https://github.com/decimozs/regtrace" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The weekend project I wish I'd had before that Friday deploy. Currently in beta would love feedback from the community, especially from anyone who's fought silent LLM regressions before. Every suggestion helps improve it.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>devops</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
