<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: shaun vd</title>
    <description>The latest articles on DEV Community by shaun vd (@shaun_vd_7562913ba77e1e0b).</description>
    <link>https://dev.to/shaun_vd_7562913ba77e1e0b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924736%2F3ac00581-3e98-4c3d-81ab-42c5033026cb.jpg</url>
      <title>DEV Community: shaun vd</title>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shaun_vd_7562913ba77e1e0b"/>
    <language>en</language>
    <item>
      <title>How a model upgrade silently broke our extraction prompt (and how we caught it)</title>
      <dc:creator>shaun vd</dc:creator>
      <pubDate>Sat, 23 May 2026 08:57:46 +0000</pubDate>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it-40ol</link>
      <guid>https://dev.to/shaun_vd_7562913ba77e1e0b/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it-40ol</guid>
      <description>&lt;p&gt;A friend's product summarizes customer support tickets using a fine-tuned LLM&lt;br&gt;
prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated&lt;br&gt;
4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said&lt;br&gt;
"looks fine," and shipped.&lt;/p&gt;

&lt;p&gt;Two weeks later a customer escalated: "Your urgency tagging is wrong on&lt;br&gt;
basically everything since last Wednesday."&lt;/p&gt;

&lt;p&gt;The prompt asked for &lt;code&gt;{"intent": "...", "urgency": "low|medium|high"}&lt;/code&gt;. On&lt;br&gt;
4o, the model returned exactly that. On 4.1, it started returning&lt;br&gt;
&lt;code&gt;{"intent": "...", "urgency_level": "..."}&lt;/code&gt; — semantically identical, but&lt;br&gt;
the downstream classifier was indexing on &lt;code&gt;urgency&lt;/code&gt; and silently fell&lt;br&gt;
through to a default value of "low" on 100% of new tickets.&lt;/p&gt;

&lt;p&gt;Nobody saw it because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The prompt didn't error. JSON parsed. Fields existed.&lt;/li&gt;
&lt;li&gt;The unit tests checked the &lt;em&gt;prompt string&lt;/em&gt;, not the &lt;em&gt;prompt output&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;The integration tests mocked the LLM call.&lt;/li&gt;
&lt;li&gt;The output was indistinguishable from "everything's fine and quiet."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the silent regression problem. Code has tests; prompts have vibes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Three categories of model-swap failure
&lt;/h2&gt;

&lt;p&gt;After looking at a dozen of these incidents, the failures cluster into three&lt;br&gt;
groups. Knowing which kind you're looking at tells you what to test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Format drift.&lt;/strong&gt; The model decides to rename a field, drop a field, add&lt;br&gt;
a field you didn't ask for, or change list ordering. JSON still parses. Your&lt;br&gt;
downstream code breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reasoning regression.&lt;/strong&gt; The model is "improved" but loses a hidden&lt;br&gt;
constraint your prompt depended on. Classic example: GPT-4 reliably extracted&lt;br&gt;
&lt;em&gt;all&lt;/em&gt; requirements from a contract; GPT-4-Turbo extracted "the most important&lt;br&gt;
ones," dropping 15-20% of clauses. The format was fine. The data was wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tone shift.&lt;/strong&gt; Less common but expensive. The new model's outputs are&lt;br&gt;
more verbose, less verbose, friendlier, blunter. If anything downstream&lt;br&gt;
(another model, a regex, a fuzzy matcher) was tuned to the old tone, it&lt;br&gt;
breaks.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the team should have had
&lt;/h2&gt;

&lt;p&gt;A test suite of 30 representative tickets, each with an expected JSON shape.&lt;br&gt;
On model swap day:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;promptfork &lt;span class="nb"&gt;test &lt;/span&gt;summarize_ticket &lt;span class="nt"&gt;--baseline&lt;/span&gt; gpt-4o
&lt;span class="go"&gt;→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six lines. Seven seconds. Two-week customer-facing bug avoided.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually do this
&lt;/h2&gt;

&lt;p&gt;The setup for the team that got bitten took four minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;promptfork

&lt;span class="c"&gt;# Save the current production prompt, version 1&lt;/span&gt;
promptfork push summarize_ticket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt; prompts/summarize.txt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"current prod"&lt;/span&gt;

&lt;span class="c"&gt;# Pin 30 real tickets from your support inbox&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;t &lt;span class="k"&gt;in &lt;/span&gt;tickets/&lt;span class="k"&gt;*&lt;/span&gt;.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$t&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; .json&lt;span class="si"&gt;)&lt;/span&gt;
  promptfork add-test summarize_ticket &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$t&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"must return urgency in {low,medium,high}"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Run baseline on 4o&lt;/span&gt;
promptfork &lt;span class="nb"&gt;test &lt;/span&gt;summarize_ticket &lt;span class="nt"&gt;--models&lt;/span&gt; gpt-4o

&lt;span class="c"&gt;# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)&lt;/span&gt;
&lt;span class="c"&gt;# Run with v1 (4o) as the baseline, get an LLM-judge regression report&lt;/span&gt;
promptfork &lt;span class="nb"&gt;test &lt;/span&gt;summarize_ticket &lt;span class="nt"&gt;--baseline&lt;/span&gt; 1 &lt;span class="nt"&gt;--models&lt;/span&gt; gpt-4.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The &lt;code&gt;--baseline&lt;/code&gt; flag is what catches drift — it pulls the&lt;br&gt;
baseline output, runs the candidate, and asks Claude Haiku to compare them&lt;br&gt;
under a strict "only flag strictly worse" rubric.&lt;/p&gt;
&lt;h2&gt;
  
  
  The CI version
&lt;/h2&gt;

&lt;p&gt;The same command in a GitHub Action means &lt;em&gt;no prompt change ever ships&lt;/em&gt;&lt;br&gt;
without running against a known-good baseline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shaunvand/promptfork-cli@v0&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_ticket&lt;/span&gt;
    &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;api-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROMPTFORK_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The action exits non-zero on regression. Branch protection blocks the merge.&lt;/p&gt;

&lt;p&gt;If you ship LLM features, you need this. The first time it catches a silent&lt;br&gt;
regression, it pays for itself a hundred times over. PromptFork has a free&lt;br&gt;
tier (3 prompts, 50 runs/mo) at &lt;a href="https://promptfork.online/diff" rel="noopener noreferrer"&gt;https://promptfork.online/diff&lt;/a&gt; — set it up&lt;br&gt;
in five minutes, sleep better forever.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?</title>
      <dc:creator>shaun vd</dc:creator>
      <pubDate>Wed, 20 May 2026 10:23:56 +0000</pubDate>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b/claude-sonnet-46-vs-gpt-41-vs-gemini-25-flash-which-wins-json-extraction-poa</link>
      <guid>https://dev.to/shaun_vd_7562913ba77e1e0b/claude-sonnet-46-vs-gpt-41-vs-gemini-25-flash-which-wins-json-extraction-poa</guid>
      <description>&lt;p&gt;We had a question: for structured-output tasks where you just need clean&lt;br&gt;
JSON back, which frontier model wins on a cost/quality basis?&lt;/p&gt;

&lt;p&gt;The answer matters because most production LLM features aren't writing&lt;br&gt;
poetry — they're extracting fields from emails, summarizing tickets,&lt;br&gt;
classifying intents. Boring, structured, repetitive. The kind of work where&lt;br&gt;
overpaying by 5x for marginal quality gains is just a tax on your margins.&lt;/p&gt;

&lt;p&gt;We benchmarked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task:&lt;/strong&gt; extract &lt;code&gt;{sender, intent, urgency, refund_amount}&lt;/code&gt; from
customer support emails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inputs:&lt;/strong&gt; 30 real tickets (anonymized), ranging from 50 to 800 tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models:&lt;/strong&gt; &lt;code&gt;claude-sonnet-4-6&lt;/code&gt;, &lt;code&gt;claude-haiku-4-5&lt;/code&gt;, &lt;code&gt;gpt-4.1&lt;/code&gt;, &lt;code&gt;gpt-5&lt;/code&gt;,
&lt;code&gt;gemini-2.5-flash&lt;/code&gt;, &lt;code&gt;gemini-2.5-pro&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoring:&lt;/strong&gt; field completeness (all 4 fields present, correct types),
hallucination rate (made-up refund amounts), JSON validity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run:&lt;/strong&gt; &lt;code&gt;promptfork test extract_email&lt;/code&gt; against all 6 models in parallel.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Completeness&lt;/th&gt;
&lt;th&gt;Hallucinations&lt;/th&gt;
&lt;th&gt;$ / 30 tickets&lt;/th&gt;
&lt;th&gt;Latency p50&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude-sonnet-4-6&lt;/td&gt;
&lt;td&gt;30/30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;$0.024&lt;/td&gt;
&lt;td&gt;1.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-haiku-4-5&lt;/td&gt;
&lt;td&gt;29/30&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;0.7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-5&lt;/td&gt;
&lt;td&gt;30/30&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;$0.045&lt;/td&gt;
&lt;td&gt;1.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-4.1&lt;/td&gt;
&lt;td&gt;28/30&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;$0.018&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemini-2.5-pro&lt;/td&gt;
&lt;td&gt;27/30&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;$0.012&lt;/td&gt;
&lt;td&gt;1.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gemini-2.5-flash&lt;/td&gt;
&lt;td&gt;26/30&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;0.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Numbers are illustrative — run the same suite on your own prompts to get&lt;br&gt;
results that actually predict your production behaviour.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised us
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Haiku is the value pick.&lt;/strong&gt; 96.7% completeness for 8x less cost than&lt;br&gt;
Sonnet. For straight extraction with rubric-defined fields, paying for&lt;br&gt;
Sonnet is a luxury, not a need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.5 Flash is fast and cheap and wrong.&lt;/strong&gt; Three hallucinated refund&lt;br&gt;
amounts in 30 tickets is a customer-facing accident waiting to happen.&lt;br&gt;
We're not saying Gemini is bad — we're saying Gemini is bad for this &lt;em&gt;kind&lt;/em&gt;&lt;br&gt;
of task. Probably great for creative writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 doesn't pay for itself on simple tasks.&lt;/strong&gt; It's a smarter model. But&lt;br&gt;
when the task is "return four fields with these types," the smarter model&lt;br&gt;
isn't writing better outputs, it's writing the same outputs more slowly and&lt;br&gt;
more expensively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;urgency&lt;/code&gt; field was where models diverged most.&lt;/strong&gt; All six models&lt;br&gt;
nailed &lt;code&gt;sender&lt;/code&gt; and &lt;code&gt;intent&lt;/code&gt;. Urgency is subjective; that's where reasoning&lt;br&gt;
quality showed up.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we actually ran this
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;promptfork
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROMPTFORK_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pf_xxx

&lt;span class="c"&gt;# Push the prompt&lt;/span&gt;
promptfork push extract_email &lt;span class="nt"&gt;--file&lt;/span&gt; prompts/extract.txt

&lt;span class="c"&gt;# Pin 30 tickets as test cases (script your own loop)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;f &lt;span class="k"&gt;in &lt;/span&gt;tickets/&lt;span class="k"&gt;*&lt;/span&gt;.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; ...&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Run all 6 models in parallel&lt;/span&gt;
promptfork &lt;span class="nb"&gt;test &lt;/span&gt;extract_email &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--models&lt;/span&gt; claude-sonnet-4-6,claude-haiku-4-5,gpt-5,gpt-4.1,gemini-2.5-pro,gemini-2.5-flash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PromptFork fans out one call per (model × test case), captures cost +&lt;br&gt;
latency + tokens, persists everything. We then exported the run as a CSV&lt;br&gt;
and scored manually for hallucinations (the LLM-judge handles regression&lt;br&gt;
detection but not novel correctness scoring — that's still a human's job&lt;br&gt;
the first time).&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical takeaway
&lt;/h2&gt;

&lt;p&gt;If you're shipping a structured-output LLM feature today, your stack should&lt;br&gt;
probably be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default:&lt;/strong&gt; Haiku. Cheap, fast, accurate enough for most extraction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard reasoning:&lt;/strong&gt; Sonnet. When Haiku misses, it usually misses on
multi-step reasoning, not format. Sonnet picks that up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid:&lt;/strong&gt; routing the same simple task to a frontier model "just in
case." You're paying 5-10x for nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need a benchmark blog post to validate this for &lt;em&gt;your&lt;/em&gt; prompts —&lt;br&gt;
you need to run the benchmark on &lt;em&gt;your&lt;/em&gt; inputs. PromptFork makes that one&lt;br&gt;
command. Free tier handles ~50 runs/mo: &lt;a href="https://promptfork.online/diff" rel="noopener noreferrer"&gt;https://promptfork.online/diff&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>claude</category>
    </item>
    <item>
      <title>Prompt regression testing in CI: a 5-minute setup</title>
      <dc:creator>shaun vd</dc:creator>
      <pubDate>Mon, 11 May 2026 10:33:40 +0000</pubDate>
      <link>https://dev.to/shaun_vd_7562913ba77e1e0b/prompt-regression-testing-in-ci-a-5-minute-setup-4g03</link>
      <guid>https://dev.to/shaun_vd_7562913ba77e1e0b/prompt-regression-testing-in-ci-a-5-minute-setup-4g03</guid>
      <description>&lt;p&gt;Your code has tests. Your code has a CI pipeline. A bad change can't merge&lt;br&gt;
without going green.&lt;/p&gt;

&lt;p&gt;Your prompts? Vibes. A teammate edits the system prompt to fix one customer&lt;br&gt;
complaint, output quality drops 8% on the other 99% of cases, nobody&lt;br&gt;
notices for a month, and the regression eventually surfaces as a&lt;br&gt;
mysterious churn bump in the metrics deck.&lt;/p&gt;

&lt;p&gt;This post is the 5-minute setup that closes that gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  What "tests for prompts" actually means
&lt;/h2&gt;

&lt;p&gt;There are two viable approaches and you need to know which to use when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assertion-based.&lt;/strong&gt; You write code that checks the LLM output against&lt;br&gt;
fixed rules: regex matches, JSON shape validation, field-presence checks,&lt;br&gt;
length bounds. Fast, cheap, deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; the output is structured and the contract is rigid. JSON&lt;br&gt;
extraction, classification, function-call payloads, schema-conformant&lt;br&gt;
generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-judge.&lt;/strong&gt; Another LLM compares the candidate output to a baseline and&lt;br&gt;
returns "regressed: yes/no" with a severity score. Slower, costs a few&lt;br&gt;
cents per comparison, handles fuzzy outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it when:&lt;/strong&gt; the output is freeform — summaries, rewrites, creative&lt;br&gt;
generation, anything where two correct answers can look very different.&lt;/p&gt;

&lt;p&gt;A mature setup uses both. PromptFork ships the LLM-judge built in (we&lt;br&gt;
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"&lt;br&gt;
rubric); assertions are easy to add yourself in custom test cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  The 5-minute setup
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Pin your prompts in version control
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Plain text files. Not constants in &lt;code&gt;prompts.py&lt;/code&gt;. Not Notion docs. Files&lt;br&gt;
with a git history.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Push them to PromptFork
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;promptfork
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PROMPTFORK_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pf_xxxx

&lt;span class="k"&gt;for &lt;/span&gt;f &lt;span class="k"&gt;in &lt;/span&gt;prompts/&lt;span class="k"&gt;*&lt;/span&gt;.txt&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; .txt&lt;span class="si"&gt;)&lt;/span&gt;
  promptfork push &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$name&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--file&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"initial commit"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This creates v1 of each prompt server-side and gives you a stable identifier.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Add test cases
&lt;/h3&gt;

&lt;p&gt;For each prompt, pin 5-30 representative inputs. Real production inputs are&lt;br&gt;
worth 10x synthetic ones.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;promptfork add-test summarize_ticket happy_path &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Order arrived. Loved it."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"summary should be positive and under 20 words"&lt;/span&gt;

promptfork add-test summarize_ticket angry_refund &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"3 weeks late, want money back NOW"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"must mention refund and high urgency"&lt;/span&gt;

promptfork add-test summarize_ticket edge_garbled &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; &lt;span class="nv"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"hi pls help thx"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rubric&lt;/span&gt; &lt;span class="s2"&gt;"summary should request more info, not invent details"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three test cases is a starting point. Six is a good baseline. Thirty is&lt;br&gt;
production-grade.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Wire the GitHub Action
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/prompt-tests.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prompt Regression Tests&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompts/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Push current prompts&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;PROMPTFORK_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROMPTFORK_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;pip install promptfork&lt;/span&gt;
          &lt;span class="s"&gt;for f in prompts/*.txt; do&lt;/span&gt;
            &lt;span class="s"&gt;name=$(basename "$f" .txt)&lt;/span&gt;
            &lt;span class="s"&gt;promptfork push "$name" --file "$f" \&lt;/span&gt;
              &lt;span class="s"&gt;--message "PR #${{ github.event.pull_request.number }}"&lt;/span&gt;
          &lt;span class="s"&gt;done&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shaunvand/promptfork-cli@v0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_ticket&lt;/span&gt;
          &lt;span class="na"&gt;baseline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
          &lt;span class="na"&gt;api-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROMPTFORK_API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the secret at &lt;code&gt;Settings → Secrets → PROMPTFORK_API_KEY&lt;/code&gt;. Done.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Open a PR that changes a prompt
&lt;/h3&gt;

&lt;p&gt;The action runs, executes your prompt across Claude/GPT/Gemini, has the&lt;br&gt;
LLM-judge compare each output against your baseline version, and posts a&lt;br&gt;
PR comment with the regression report. If anything regresses, the action&lt;br&gt;
exits non-zero, branch protection blocks the merge, the change goes back&lt;br&gt;
for review.&lt;/p&gt;

&lt;p&gt;You now have a CI gate for prompts. The same gate you have for code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What goes in the test suite
&lt;/h2&gt;

&lt;p&gt;After running this on a few projects, the pattern that works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One happy-path case.&lt;/strong&gt; "Normal" input, expected output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One edge case.&lt;/strong&gt; Empty input, very long input, input in another
language, malformed structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One adversarial case.&lt;/strong&gt; Prompt-injection attempt, contradictory
instructions, a customer trying to break the bot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.&lt;/p&gt;

&lt;h2&gt;
  
  
  What goes wrong if you don't do this
&lt;/h2&gt;

&lt;p&gt;We've seen this play out enough times to predict it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New model drops. Team migrates. "Looks fine in playground." Ships.&lt;/li&gt;
&lt;li&gt;Quality degrades 5-15% on a subset of inputs. No alert fires.&lt;/li&gt;
&lt;li&gt;Customer support volume creeps up. Nobody connects the dots.&lt;/li&gt;
&lt;li&gt;Three weeks later, churn ticks up. "Why?"&lt;/li&gt;
&lt;li&gt;Eventually somebody runs an A/B back-test and finds the regression.&lt;/li&gt;
&lt;li&gt;Rollback. Apology emails. Deck slide titled "Lessons Learned."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole loop is six commands and an afternoon.&lt;/p&gt;

&lt;p&gt;PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the&lt;br&gt;
setup above on a small project. &lt;a href="https://promptfork.online/diff" rel="noopener noreferrer"&gt;https://promptfork.online/diff&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>cicd</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
