Kartik N V J K

Posted on Jun 23 • Originally published at futureagi.com

How I A/B test LLM prompts without fooling myself

#llm #ai #prompts #testing

A while back I was building a support assistant and hit a simple-sounding question: is this new version of the prompt actually better than the old one? So I did the obvious thing. I wrote thirty test cases, ran both prompts, saw the new one score a little higher, and shipped it.

I felt great about it for half a day, until the support queue filled with complaints and I was rolling the change back from a Slack thread that evening.

The bump in the score was never real. The test was far too small to tell a tiny genuine improvement apart from luck, so the number was noise the whole time. That one lesson is behind almost everything below. This is the guide I swear by now, in plain language, and I hope it saves you the same evening.

First, figure out how small a change your test can even see

A small test can only catch large differences. If the real improvement is small, thirty examples cannot tell it apart from the random wobble of which examples you happened to pick. Run it again and it often flips.

So before testing anything, I now answer two questions:

What is the smallest improvement that would actually be worth shipping?
Do I have enough examples to see an improvement that small?

The catch is how fast the numbers grow. To catch an improvement half as big, you need about four times as many examples. In practice, thirty examples can barely see anything under a ten-point swing. A four-point improvement took me a few hundred examples. A two-point one took over a thousand. Below a hundred, you are measuring your scorer's mood, not your prompt.

Test both versions on the very same inputs

My next mistake looked harmless: I gave version A one batch of questions and version B a different batch, then compared averages. But some questions are just harder. If B drew the easier batch, it looks better even when it is worse.

The fix is simple. Run both versions through the exact same questions and compare them one question at a time. The difficulty cancels out, because both faced the same hard ones, and all that is left is the real gap between the prompts.

This let me trust results from a few hundred examples that would otherwise have needed four times as many. The only cost is running every question twice, which is far cheaper than collecting four times the questions.

Put a range on the improvement, not a single average

A single average never told me enough, even with a "statistically significant" stamp. That stamp only says the difference probably is not exactly zero. It says nothing about how big it is, and the size is what decides whether I ship.

So I report a range: the smallest and largest the true improvement is likely to be.

If the range still includes zero, I do not have a real win, however nice the average looks.
If the whole range sits above the smallest improvement worth shipping, I ship.

One easy mistake to avoid: build the range from the per-question differences, keeping each question's two scores paired together. If you instead compare the two overall averages as if they were unrelated, you throw away the benefit of testing on the same inputs and the range balloons.

Score each answer alone, or judge the two side by side

There are two ways to score, and they answer different questions.

Score each answer on its own against a checklist (did it stay grounded, did it fully answer). Use this when you need an absolute number to hold against a bar you set.
Put both answers side by side and ask which is better. Use this when the quality is fuzzy, like tone or helpfulness, or when your scorer drifts run to run. Comparing two things directly is steadier than scoring each in a vacuum.

Simple rule: if the question is "is B better than A," judge side by side. If it is "does B clear the bar," score each one alone.

When you cannot wait for a full test, let the traffic pick the winner

Once I had three versions to compare, on a low-traffic feature, a normal test would have taken weeks. And the whole time, two of the three were probably worse and still going out to real people. Waiting a month while users got worse answers did not sit right.

So I used something with an odd name: a bandit. A normal test gives every version an equal share of traffic and waits until the end to call a winner, so half your users keep getting the weaker prompt the whole time. A bandit does not wait. It watches results as they come in and quietly sends more people to whatever is pulling ahead, and fewer to the rest. The longer it runs, the fewer users ever see the weak versions.

The name is just a leftover from gambling. A slot machine was a one-armed bandit, and this is the old puzzle of picking between machines when you do not yet know which pays out best. Swap machines for prompt versions and that was my situation.

The catch is that it gives up a clean number. Because it starves the weaker versions of traffic on purpose, you cannot say at the end that B beat A by exactly this much. So I only reach for a bandit when I have three or more versions, when even the weak ones are still usable, and when waiting on a full test costs more than a fuzzy final verdict. When I need a result I can defend to leadership, or it is just two versions head to head, I go back to the normal test.

Sometimes it is better to generate the variants, not hand-write one

For a long time my loop was: think hard, write one new version, test it, keep it if it won. The blind spot is that one hand-written version is a tiny slice of all the ways you could phrase the thing.

Once I had a scorer I trusted and labeled examples, I let a tool do the exploring. It generates many variations and searches for the ones that score well, far more than I would write by hand. Then I take its best candidate and run the same careful comparison against what is already in production. The tool covers ground I could not, and the A/B still makes the final call.

The mistakes that quietly ruin these tests

A few traps I fell into, so you do not have to:

An average with no range. A small bump on thirty examples is noise wearing a result's clothes.
A different batch of questions per version. You cannot tell a better prompt from an easier batch.
Swapping the scorer mid-test. That is a different test, not the same one. Lock it down first.
Peeking and stopping the moment it looks good. That manufactures wins that were never there. Decide when you will look before you run it.
Watching five "main" numbers at once. With five, there is about a one in four chance one looks like a win by pure luck. Pick one number and one bar up front.
Trusting the scorer before checking the scorer. If your automatic judge does not match careful human judgement, it can flip the result on its own noise.

That is the whole guide, and none of it is clever, which is sort of the point. The hard part was never running the test. It was learning when the result is real, and that comes down almost every time to whether I had enough examples to see what I was looking for. I hope having it in one place saves you an evening or two. If you have run these comparisons yourself, I would love to hear where your offline results and your live numbers parted ways. For me it is almost always how often the new prompt starts refusing things it should have answered.

Top comments (1)

Viktor • Jun 30

This is the post I wish I could hand everyone shipping prompt changes. The "below a hundred you're measuring your scorer's mood" line is exactly it.

One thing I'd add, because it's the trap that got me even after I fixed sample size: a bigger paired test still fools you if the scorer itself is noisy or biased, and that noise doesn't always average out. If your judge is an LLM, it can swing ten points on an answer you only reworded, and worse, it can be biased toward the style of whichever prompt you were tuning, so the bias correlates with the variable you're testing and no sample size cancels it. So now before I trust any A/B number, I sanity-check the judge first: feed it the same answer paraphrased a few ways and confirm the score barely moves. If it wobbles, I'm measuring the judge, not the prompt.

And the part your rollback story already taught: an offline win on a curated set is a hypothesis, not a result. The set is never the prod distribution. A guarded online rollout with a real metric was the only thing that actually settled it for us.