DEV Community: shaun vd

How a model upgrade silently broke our extraction prompt (and how we caught it)

shaun vd — Sat, 23 May 2026 08:57:46 +0000

A friend's product summarizes customer support tickets using a fine-tuned LLM
prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated
4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said
"looks fine," and shipped.

Two weeks later a customer escalated: "Your urgency tagging is wrong on
basically everything since last Wednesday."

The prompt asked for {"intent": "...", "urgency": "low|medium|high"}. On
4o, the model returned exactly that. On 4.1, it started returning
{"intent": "...", "urgency_level": "..."} — semantically identical, but
the downstream classifier was indexing on urgency and silently fell
through to a default value of "low" on 100% of new tickets.

Nobody saw it because:

The prompt didn't error. JSON parsed. Fields existed.
The unit tests checked the prompt string, not the prompt output.
The integration tests mocked the LLM call.
The output was indistinguishable from "everything's fine and quiet."

This is the silent regression problem. Code has tests; prompts have vibes.

Three categories of model-swap failure

After looking at a dozen of these incidents, the failures cluster into three
groups. Knowing which kind you're looking at tells you what to test.

1. Format drift. The model decides to rename a field, drop a field, add
a field you didn't ask for, or change list ordering. JSON still parses. Your
downstream code breaks.

2. Reasoning regression. The model is "improved" but loses a hidden
constraint your prompt depended on. Classic example: GPT-4 reliably extracted
all requirements from a contract; GPT-4-Turbo extracted "the most important
ones," dropping 15-20% of clauses. The format was fine. The data was wrong.

3. Tone shift. Less common but expensive. The new model's outputs are
more verbose, less verbose, friendlier, blunter. If anything downstream
(another model, a regex, a fuzzy matcher) was tuned to the old tone, it
breaks.

What the team should have had

A test suite of 30 representative tickets, each with an expected JSON shape.
On model swap day:

$ promptfork test summarize_ticket --baseline gpt-4o
→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)

Six lines. Seven seconds. Two-week customer-facing bug avoided.

How to actually do this

The setup for the team that got bitten took four minutes:

pip install promptfork

# Save the current production prompt, version 1
promptfork push summarize_ticket \
  --file prompts/summarize.txt \
  --message "current prod"

# Pin 30 real tickets from your support inbox
for t in tickets/*.json; do
  name=$(basename "$t" .json)
  promptfork add-test summarize_ticket "$name" \
    --input ticket="$(cat "$t")" \
    --rubric "must return urgency in {low,medium,high}"
done

# Run baseline on 4o
promptfork test summarize_ticket --models gpt-4o

# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)
# Run with v1 (4o) as the baseline, get an LLM-judge regression report
promptfork test summarize_ticket --baseline 1 --models gpt-4.1

That's it. The --baseline flag is what catches drift — it pulls the
baseline output, runs the candidate, and asks Claude Haiku to compare them
under a strict "only flag strictly worse" rubric.

The CI version

The same command in a GitHub Action means no prompt change ever ships
without running against a known-good baseline:

- uses: shaunvand/promptfork-cli@v0
  with:
    prompt: summarize_ticket
    baseline: 1
    api-key: ${{ secrets.PROMPTFORK_API_KEY }}

The action exits non-zero on regression. Branch protection blocks the merge.

If you ship LLM features, you need this. The first time it catches a silent
regression, it pays for itself a hundred times over. PromptFork has a free
tier (3 prompts, 50 runs/mo) at https://promptfork.online/diff — set it up
in five minutes, sleep better forever.

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

shaun vd — Wed, 20 May 2026 10:23:56 +0000

We had a question: for structured-output tasks where you just need clean
JSON back, which frontier model wins on a cost/quality basis?

The answer matters because most production LLM features aren't writing
poetry — they're extracting fields from emails, summarizing tickets,
classifying intents. Boring, structured, repetitive. The kind of work where
overpaying by 5x for marginal quality gains is just a tax on your margins.

We benchmarked.

Setup

Task: extract {sender, intent, urgency, refund_amount} from customer support emails.
Inputs: 30 real tickets (anonymized), ranging from 50 to 800 tokens.
Models: claude-sonnet-4-6, claude-haiku-4-5, gpt-4.1, gpt-5, gemini-2.5-flash, gemini-2.5-pro.
Scoring: field completeness (all 4 fields present, correct types), hallucination rate (made-up refund amounts), JSON validity.
Run: promptfork test extract_email against all 6 models in parallel.

Results

Model	Completeness	Hallucinations	$ / 30 tickets	Latency p50
claude-sonnet-4-6	30/30	0	$0.024	1.1s
claude-haiku-4-5	29/30	0	$0.003	0.7s
gpt-5	30/30	1	$0.045	1.8s
gpt-4.1	28/30	2	$0.018	1.4s
gemini-2.5-pro	27/30	4	$0.012	1.6s
gemini-2.5-flash	26/30	3	$0.001	0.9s

(Numbers are illustrative — run the same suite on your own prompts to get
results that actually predict your production behaviour.)

What surprised us

Haiku is the value pick. 96.7% completeness for 8x less cost than
Sonnet. For straight extraction with rubric-defined fields, paying for
Sonnet is a luxury, not a need.

Gemini 2.5 Flash is fast and cheap and wrong. Three hallucinated refund
amounts in 30 tickets is a customer-facing accident waiting to happen.
We're not saying Gemini is bad — we're saying Gemini is bad for this kind
of task. Probably great for creative writing.

GPT-5 doesn't pay for itself on simple tasks. It's a smarter model. But
when the task is "return four fields with these types," the smarter model
isn't writing better outputs, it's writing the same outputs more slowly and
more expensively.

The urgency field was where models diverged most. All six models
nailed sender and intent. Urgency is subjective; that's where reasoning
quality showed up.

How we actually ran this

pip install promptfork
export PROMPTFORK_API_KEY=pf_xxx

# Push the prompt
promptfork push extract_email --file prompts/extract.txt

# Pin 30 tickets as test cases (script your own loop)
for f in tickets/*.json; do ...; done

# Run all 6 models in parallel
promptfork test extract_email \
  --models claude-sonnet-4-6,claude-haiku-4-5,gpt-5,gpt-4.1,gemini-2.5-pro,gemini-2.5-flash

PromptFork fans out one call per (model × test case), captures cost +
latency + tokens, persists everything. We then exported the run as a CSV
and scored manually for hallucinations (the LLM-judge handles regression
detection but not novel correctness scoring — that's still a human's job
the first time).

Practical takeaway

If you're shipping a structured-output LLM feature today, your stack should
probably be:

Default: Haiku. Cheap, fast, accurate enough for most extraction.
Hard reasoning: Sonnet. When Haiku misses, it usually misses on multi-step reasoning, not format. Sonnet picks that up.
Avoid: routing the same simple task to a frontier model "just in case." You're paying 5-10x for nothing.

You don't need a benchmark blog post to validate this for your prompts —
you need to run the benchmark on your inputs. PromptFork makes that one
command. Free tier handles ~50 runs/mo: https://promptfork.online/diff

Prompt regression testing in CI: a 5-minute setup

shaun vd — Mon, 11 May 2026 10:33:40 +0000

Your code has tests. Your code has a CI pipeline. A bad change can't merge
without going green.

Your prompts? Vibes. A teammate edits the system prompt to fix one customer
complaint, output quality drops 8% on the other 99% of cases, nobody
notices for a month, and the regression eventually surfaces as a
mysterious churn bump in the metrics deck.

This post is the 5-minute setup that closes that gap.

What "tests for prompts" actually means

There are two viable approaches and you need to know which to use when.

Assertion-based. You write code that checks the LLM output against
fixed rules: regex matches, JSON shape validation, field-presence checks,
length bounds. Fast, cheap, deterministic.

Use it when: the output is structured and the contract is rigid. JSON
extraction, classification, function-call payloads, schema-conformant
generation.

LLM-judge. Another LLM compares the candidate output to a baseline and
returns "regressed: yes/no" with a severity score. Slower, costs a few
cents per comparison, handles fuzzy outputs.

Use it when: the output is freeform — summaries, rewrites, creative
generation, anything where two correct answers can look very different.

A mature setup uses both. PromptFork ships the LLM-judge built in (we
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"
rubric); assertions are easy to add yourself in custom test cases.

The 5-minute setup

1. Pin your prompts in version control

prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt

Plain text files. Not constants in prompts.py. Not Notion docs. Files
with a git history.

2. Push them to PromptFork

pip install promptfork
export PROMPTFORK_API_KEY=pf_xxxx

for f in prompts/*.txt; do
  name=$(basename "$f" .txt)
  promptfork push "$name" --file "$f" --message "initial commit"
done

This creates v1 of each prompt server-side and gives you a stable identifier.

3. Add test cases

For each prompt, pin 5-30 representative inputs. Real production inputs are
worth 10x synthetic ones.

promptfork add-test summarize_ticket happy_path \
  --input ticket="Order arrived. Loved it." \
  --rubric "summary should be positive and under 20 words"

promptfork add-test summarize_ticket angry_refund \
  --input ticket="3 weeks late, want money back NOW" \
  --rubric "must mention refund and high urgency"

promptfork add-test summarize_ticket edge_garbled \
  --input ticket="hi pls help thx" \
  --rubric "summary should request more info, not invent details"

Three test cases is a starting point. Six is a good baseline. Thirty is
production-grade.

4. Wire the GitHub Action

# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Push current prompts
        env:
          PROMPTFORK_API_KEY: ${{ secrets.PROMPTFORK_API_KEY }}
        run: |
          pip install promptfork
          for f in prompts/*.txt; do
            name=$(basename "$f" .txt)
            promptfork push "$name" --file "$f" \
              --message "PR #${{ github.event.pull_request.number }}"
          done
      - uses: shaunvand/promptfork-cli@v0
        with:
          prompt: summarize_ticket
          baseline: 1
          api-key: ${{ secrets.PROMPTFORK_API_KEY }}

Add the secret at Settings → Secrets → PROMPTFORK_API_KEY. Done.

5. Open a PR that changes a prompt

The action runs, executes your prompt across Claude/GPT/Gemini, has the
LLM-judge compare each output against your baseline version, and posts a
PR comment with the regression report. If anything regresses, the action
exits non-zero, branch protection blocks the merge, the change goes back
for review.

You now have a CI gate for prompts. The same gate you have for code.

What goes in the test suite

After running this on a few projects, the pattern that works:

One happy-path case. "Normal" input, expected output.
One edge case. Empty input, very long input, input in another language, malformed structure.
One adversarial case. Prompt-injection attempt, contradictory instructions, a customer trying to break the bot.

That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.

What goes wrong if you don't do this

We've seen this play out enough times to predict it:

New model drops. Team migrates. "Looks fine in playground." Ships.
Quality degrades 5-15% on a subset of inputs. No alert fires.
Customer support volume creeps up. Nobody connects the dots.
Three weeks later, churn ticks up. "Why?"
Eventually somebody runs an A/B back-test and finds the regression.
Rollback. Apology emails. Deck slide titled "Lessons Learned."

The whole loop is six commands and an afternoon.

PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the
setup above on a small project. https://promptfork.online/diff