Gabriel Anhaia

Posted on May 24

Prompt Diff Testing: A/B Your Prompts Without Changing the Model

#ai #llm #prompt #testing

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Someone on your team changed two words in the system prompt. The PR diff was tiny. The reviewer approved in under a minute. Quality dropped 6 points on the enterprise customer slice and nobody noticed for a week, until the support tickets stacked up.

That's the failure mode this post fixes. A 50-line prompt-diff testing script, a bootstrap confidence interval, per-slice deltas, and a GitHub Actions job that comments on the PR before merge.

Why prompt changes are migrations, not edits

The framing matters. When you change a database schema, nobody calls it "an edit." It's a migration. There's a forward path, a rollback, a verification step, and someone signs off.

Prompt changes have the same properties and almost none of the discipline. The system prompt is the contract between your code and the model. Touching it can shift the output distribution in ways that don't show up in a unit test. The compiler doesn't catch it. The linter doesn't catch it. A spot-check on 5 examples doesn't catch it.

What catches it is the same thing that catches schema regressions: a test suite that runs against a known input set and compares the result distribution before and after.

The shape of the suite differs (you're scoring stochastic outputs, not asserting equality) but the contract is the same. Treat the prompt like code. Version it. Test the diff. Block the merge if the numbers say so.

A diff-testing script in 50 lines

The script has three jobs. Run the same eval set against two prompts. Score each output. Report the delta with enough statistical rigor that a 2-point swing means something.

Here's the core:

import asyncio, json, random, statistics
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def score_one(prompt_text: str, case: dict) -> float:
    resp = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        system=prompt_text,
        messages=[{"role": "user", "content": case["input"]}],
    )
    output = resp.content[0].text
    # task-specific judge: exact match, rubric LLM, regex, etc.
    return float(case["judge"](output, case["expected"]))

async def run(prompt_text: str, cases: list, k: int = 3) -> list:
    # k samples per case to smooth model stochasticity
    tasks = [score_one(prompt_text, c) for c in cases for _ in range(k)]
    raw = await asyncio.gather(*tasks)
    return [statistics.mean(raw[i*k:(i+1)*k]) for i in range(len(cases))]

def bootstrap_ci(deltas: list, iters: int = 2000, alpha: float = 0.05):
    samples = []
    for _ in range(iters):
        resample = [random.choice(deltas) for _ in deltas]
        samples.append(statistics.mean(resample))
    samples.sort()
    lo = samples[int(iters * alpha / 2)]
    hi = samples[int(iters * (1 - alpha / 2))]
    return statistics.mean(deltas), lo, hi

async def diff(baseline: str, candidate: str, cases: list):
    a, b = await asyncio.gather(run(baseline, cases), run(candidate, cases))
    deltas = [bi - ai for ai, bi in zip(a, b)]
    mean, lo, hi = bootstrap_ci(deltas)
    by_slice = {}
    for case, d in zip(cases, deltas):
        by_slice.setdefault(case["slice"], []).append(d)
    slice_report = {s: bootstrap_ci(ds) for s, ds in by_slice.items()}
    return {"overall": (mean, lo, hi), "slices": slice_report}

Counted that and it's 38 lines of body. The remaining headroom is yours for retry logic, cost tracking, or the JSON output your CI wants.

A few things to call out. The k parameter samples each case 3 times and averages; without it, single-shot scoring gives you noise instead of signal. The judge is whatever fits your task: exact match for extraction, regex for format checks, an LLM-as-judge for open-ended generation, or assert output["status"] in expected_statuses for structured tasks. The script doesn't care.

The eval set is the load-bearing part. 30 cases is the floor for the math to work; 100 is comfortable; 300 means you can slice it five ways and still have power per slice.

Statistical significance: when a 2-point delta is real

You'll see a 2-point average difference between baseline and candidate. You'll want to ship. Don't, until the confidence interval tells you it's real.

The bootstrap above resamples the per-case deltas 2,000 times with replacement and computes the mean each time. The 2.5th and 97.5th percentiles of that distribution give you a 95% confidence interval on the true mean delta.

Three reads on the CI:

mean +0.024  CI [+0.011, +0.037]   real improvement, ship
mean +0.024  CI [-0.005, +0.052]   noise, more cases needed
mean -0.018  CI [-0.041, +0.004]   probably a regression, hold

The interval that crosses zero is the one to respect. A 2-point lift with a confidence interval from -0.5 to +4.5 isn't a 2-point lift. It's a number you can't act on. Either gather more eval cases or accept that you don't know yet.

Why bootstrap and not a t-test? The bootstrap doesn't assume normality. LLM score distributions are bimodal (right or wrong, with mass at 0 and 1), left-skewed (mostly right), or weirder. The t-test will technically run, but the CI you get back is misleading on small samples. Bootstrap handles all of that without you having to think about it.

The 2,000 iterations cost essentially nothing; it's CPU, not API calls. The 100 cases × 2 prompts × 3 samples = 600 model calls is what costs money. Budget for that.

Per-slice deltas: surface the slice that regresses

The overall mean lies. A prompt change can be +3 points on average and -8 points on the slice your biggest customer cares about. The average tells you to ship. The slice tells you not to.

The script above bins by case["slice"], a label you put on each eval case when you write it. Common slice dimensions worth tracking separately:

Customer tier: enterprise vs SMB; their expectations differ
Language: English vs the other languages you support
Length bucket: short prompts vs long ones; behavior changes at context boundaries
Task subtype: extraction vs summarization vs classification, if your prompt covers multiple
Difficulty: easy / medium / hard, labeled by your team

The output you actually want from CI looks like this:

overall:                +0.024  [+0.011, +0.037]  ship-candidate
slice: enterprise       -0.061  [-0.092, -0.029]  REGRESSION
slice: smb              +0.048  [+0.031, +0.066]
slice: lang=de           0.000  [-0.012, +0.013]
slice: lang=en          +0.029  [+0.014, +0.045]
slice: long_context     -0.018  [-0.047, +0.011]
slice: short_context    +0.041  [+0.025, +0.058]

Now the conversation isn't "average improved by 2 points." It's "we won on SMB and short context, we regressed on enterprise and long context. Is the trade worth it?" That's a decision a human can make. A single average isn't.

Surface the worst regression at the top. The CI report should highlight any slice where the CI is fully below zero. That's the slice that needs eyes on it before merge.

Wiring it into CI: comment on PR

A test suite that only runs locally is a test suite that's not in your loop. The point is the merge gate. Here's the GitHub Actions shape:

# .github/workflows/prompt-diff.yml
name: prompt-diff
on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/cases.jsonl"

jobs:
  diff:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install anthropic
      - name: Resolve baseline and candidate prompts
        run: |
          git show origin/main:prompts/system.md > /tmp/baseline.md
          cp prompts/system.md /tmp/candidate.md
      - name: Run diff script
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python evals/diff.py \
          --baseline /tmp/baseline.md \
          --candidate /tmp/candidate.md \
          --cases evals/cases.jsonl \
          --out report.md
      - name: Comment on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: report.md

Three things make this work in practice.

The paths filter scopes the workflow to runs that actually touched a prompt or the eval set. You don't want this firing on every JavaScript change.

The sticky-comment action edits the same comment on subsequent runs instead of stacking new ones. After three pushes, you don't want three reports cluttering the PR. You want one that reflects the current state.

The baseline is pulled from origin/main, not from disk. That guarantees you're comparing the merge target, not whatever branch happened to be checked out locally before the PR.

For the merge-gate logic itself, set the job's exit code on the worst-slice CI. If any slice's CI upper bound is below zero, fail the job. The PR comment still posts; the merge button still shows red. Override is a manual [skip-prompt-diff] label, used sparingly.

The gotcha: prompt cache invalidation changes latency

This one bit a team I talked to last quarter. The candidate prompt had identical content to the baseline except for a reordered paragraph. Quality scores moved by 0.5%, well inside the CI. Ship-it call.

Latency p95 doubled.

The reorder broke the prompt cache prefix. Same content, different cache key, every request paying the full encode cost. The eval script measured quality, not latency, and missed it entirely.

The fix is to report cost and latency as separate axes in the diff. Same script, capture resp.usage.cache_read_input_tokens and resp.usage.cache_creation_input_tokens, sum them per request, report the per-request cost delta alongside the quality delta. Latency too: wall-clock per request, p50 and p95.

The report turns into a 2D decision:

quality:  +0.024  [+0.011, +0.037]
cost:     +312%   (cache miss on every request)
p95 lat:  +840ms
verdict:  HOLD, quality win wiped out by cost and latency

The right shape for any prompt change is: did quality move, did cost move, did latency move? Three numbers, three confidence intervals, one ship/hold call. Anything less and you're optimizing on a single axis while quietly losing on the others.

Treat the prompt as code, version it, diff it, run the script on every PR, and put the report where the reviewer will see it. Your customer slices will thank you. Your finance team won't even notice, which is what good infrastructure feels like.

What's the worst slice regression you've ever caught with an eval suite, and what would have shipped without it?

If this was useful

Prompt diff testing is one of the workflows the Prompt Engineering Pocket Guide covers in the chapter on shipping prompts safely. The book walks through eval set construction, judge design, slice taxonomies, and the CI patterns that keep prompt changes from regressing in production. If the script above is the floor, that chapter is the ceiling.