<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shaun vd</title>
    <description>The latest articles on DEV Community by shaun vd (@shaun_vd_7562913ba77e1e0b).</description>
    <link>https://dev.to/shaun_vd_7562913ba77e1e0b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924736%2F3ac00581-3e98-4c3d-81ab-42c5033026cb.jpg</url>
      <title>DEV Community: shaun vd</title>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shaun_vd_7562913ba77e1e0b"/>
    <language>en</language>
    <item>
      <title>Prompt regression testing in CI: a 5-minute setup</title>
      <dc:creator>shaun vd</dc:creator>
      <pubDate>Mon, 11 May 2026 10:33:40 +0000</pubDate>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b/prompt-regression-testing-in-ci-a-5-minute-setup-4g03</link>
      <guid>https://dev.to/shaun_vd_7562913ba77e1e0b/prompt-regression-testing-in-ci-a-5-minute-setup-4g03</guid>
      <description>&lt;p&gt;Your code has tests. Your code has a CI pipeline. A bad change can't merge&lt;br&gt;
without going green.&lt;/p&gt;

&lt;p&gt;Your prompts? Vibes. A teammate edits the system prompt to fix one customer&lt;br&gt;
complaint, output quality drops 8% on the other 99% of cases, nobody&lt;br&gt;
notices for a month, and the regression eventually surfaces as a&lt;br&gt;
mysterious churn bump in the metrics deck.&lt;/p&gt;

&lt;p&gt;This post is the 5-minute setup that closes that gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  What "tests for prompts" actually means
&lt;/h2&gt;

&lt;p&gt;There are two viable approaches and you need to know which to use when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assertion-based.&lt;/strong&gt; You write code that checks the LLM output against&lt;br&gt;
fixed rules: regex matches, JSON shape validation, field-presence checks,&lt;br&gt;
length bounds. Fast, cheap, deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; the output is structured and the contract is rigid. JSON&lt;br&gt;
extraction, classification, function-call payloads, schema-conformant&lt;br&gt;
generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-judge.&lt;/strong&gt; Another LLM compares the candidate output to a baseline and&lt;br&gt;
returns "regressed: yes/no" with a severity score. Slower, costs a few&lt;br&gt;
cents per comparison, handles fuzzy outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; the output is freeform — summaries, rewrites, creative&lt;br&gt;
generation, anything where two correct answers can look very different.&lt;/p&gt;

&lt;p&gt;A mature setup uses both. PromptFork ships the LLM-judge built in (we&lt;br&gt;
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"&lt;br&gt;
rubric); assertions are easy to add yourself in custom test cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  The 5-minute setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Pin your prompts in version control
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Plain text files. Not constants in &lt;code&gt;prompts.py&lt;/code&gt;. Not Notion docs. Files&lt;br&gt;
with a git history.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Push them to PromptFork
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;promptfork
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROMPTFORK_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pf_xxxx

&lt;span class="k"&gt;for &lt;/span&gt;f &lt;span class="k"&gt;in &lt;/span&gt;prompts/&lt;span class="k"&gt;*&lt;/span&gt;.txt&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; .txt&lt;span class="si"&gt;)&lt;/span&gt;
  promptfork push &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--file&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"initial commit"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This creates v1 of each prompt server-side and gives you a stable identifier.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Add test cases
&lt;/h3&gt;

&lt;p&gt;For each prompt, pin 5-30 representative inputs. Real production inputs are&lt;br&gt;
worth 10x synthetic ones.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;promptfork add-test summarize_ticket happy_path &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Order arrived. Loved it."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"summary should be positive and under 20 words"&lt;/span&gt;

promptfork add-test summarize_ticket angry_refund &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"3 weeks late, want money back NOW"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"must mention refund and high urgency"&lt;/span&gt;

promptfork add-test summarize_ticket edge_garbled &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"hi pls help thx"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"summary should request more info, not invent details"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three test cases is a starting point. Six is a good baseline. Thirty is&lt;br&gt;
production-grade.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Wire the GitHub Action
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/prompt-tests.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prompt Regression Tests&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompts/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Push current prompts&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;PROMPTFORK_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROMPTFORK_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;pip install promptfork&lt;/span&gt;
          &lt;span class="s"&gt;for f in prompts/*.txt; do&lt;/span&gt;
            &lt;span class="s"&gt;name=$(basename "$f" .txt)&lt;/span&gt;
            &lt;span class="s"&gt;promptfork push "$name" --file "$f" \&lt;/span&gt;
              &lt;span class="s"&gt;--message "PR #${{ github.event.pull_request.number }}"&lt;/span&gt;
          &lt;span class="s"&gt;done&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shaunvand/promptfork-cli@v0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_ticket&lt;/span&gt;
          &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
          &lt;span class="na"&gt;api-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROMPTFORK_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the secret at &lt;code&gt;Settings → Secrets → PROMPTFORK_API_KEY&lt;/code&gt;. Done.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Open a PR that changes a prompt
&lt;/h3&gt;

&lt;p&gt;The action runs, executes your prompt across Claude/GPT/Gemini, has the&lt;br&gt;
LLM-judge compare each output against your baseline version, and posts a&lt;br&gt;
PR comment with the regression report. If anything regresses, the action&lt;br&gt;
exits non-zero, branch protection blocks the merge, the change goes back&lt;br&gt;
for review.&lt;/p&gt;

&lt;p&gt;You now have a CI gate for prompts. The same gate you have for code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What goes in the test suite
&lt;/h2&gt;

&lt;p&gt;After running this on a few projects, the pattern that works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One happy-path case.&lt;/strong&gt; "Normal" input, expected output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One edge case.&lt;/strong&gt; Empty input, very long input, input in another
language, malformed structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One adversarial case.&lt;/strong&gt; Prompt-injection attempt, contradictory
instructions, a customer trying to break the bot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.&lt;/p&gt;

&lt;h2&gt;
  
  
  What goes wrong if you don't do this
&lt;/h2&gt;

&lt;p&gt;We've seen this play out enough times to predict it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New model drops. Team migrates. "Looks fine in playground." Ships.&lt;/li&gt;
&lt;li&gt;Quality degrades 5-15% on a subset of inputs. No alert fires.&lt;/li&gt;
&lt;li&gt;Customer support volume creeps up. Nobody connects the dots.&lt;/li&gt;
&lt;li&gt;Three weeks later, churn ticks up. "Why?"&lt;/li&gt;
&lt;li&gt;Eventually somebody runs an A/B back-test and finds the regression.&lt;/li&gt;
&lt;li&gt;Rollback. Apology emails. Deck slide titled "Lessons Learned."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole loop is six commands and an afternoon.&lt;/p&gt;

&lt;p&gt;PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the&lt;br&gt;
setup above on a small project. &lt;a href="https://promptfork.online/diff" rel="noopener noreferrer"&gt;https://promptfork.online/diff&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>cicd</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
