Raj Kundalia

Posted on Jun 7

Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%

#ai #algorithms #datascience #llm

Understanding Wilson Score, confidence intervals, and the mysterious 1.96.

Originally published on Medium: Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%

I was running a controlled experiment measuring how instruction placement in LLM prompts affects agent behavior. After collecting results across three placement variants, I wanted to know: is the difference I'm seeing real, or just noise from a small sample size?

Link for the aforementioned experiment: WIP.

While looking into ways to answer that question, I came across the Wilson Score interval. I saw an equation and a figure 1.96 and I could not grasp it immediately. I spent some time to figure things out and wrote a small piece on it.

The good news: the idea behind Wilson Score is much simpler than the formula.

The Problem

Imagine two restaurants:
Restaurant A: 95 positive reviews out of 100
Restaurant B: 19 positive reviews out of 20

Both have a 95% positive rating. Should they rank equally?
Most people would say no. We trust Restaurant A more because it has much more evidence behind its score.

This is exactly the problem Wilson Score tries to solve.

Why Plain Percentages Fail

A naive ranking system only looks at percentages:
1 positive review out of 1 = 100%
1000 positive reviews out of 1000 = 100%

Clearly these are not equally trustworthy. A single review tells us almost nothing. A thousand reviews tell us a lot.

Wilson Score rewards both quality and evidence.

What Is Your Observed Rate?

Before going further, there is one simple idea to establish.
When you collect reviews, you end up with two numbers: how many were positive, and how many total. Divide one by the other and you get your observed rate - the percentage of positive reviews you actually saw.

95 positive reviews out of 100 → observed rate = 95 ÷ 100 = 0.95 (or 95%)
19 positive reviews out of 20 → observed rate = 19 ÷ 20 = 0.95 (or 95%)

Both restaurants have the same observed rate. The difference is that one has much more evidence behind it.

In the Wilson Score formula, this observed rate is written as p - just shorthand so the formula doesn't have to spell it out every time. But all it ever means is: the percentage you actually measured.

The Core Idea: Your Observation Is Just One Possibility

Here is the thing most explanations skip over.

When you see 19 out of 20 positive reviews, you naturally say "that restaurant is 95% good." But what you actually observed is just one possible outcome from many.

Imagine you could rewind time and collect reviews again. Maybe this time you'd get 17 out of 20. Or 18. Or 20 out of 20. All of those are realistic results from the same restaurant, just from a different lucky or unlucky sample. The fewer reviews you have, the more those outcomes can vary.
So the honest question isn't "what did I observe?" It's "given what I observed, what is the range of real quality levels that could have produced this?"

That range is called a confidence interval.

A Confidence Interval Is Just Honesty About Uncertainty

Instead of saying "it's exactly 95%", you say:
"Based on the evidence we have, the real quality of this restaurant is unlikely to be exactly 95%. There is a range of realistic answers around it."

That range reflects how uncertain you are based on how little evidence you have.

And "95% confidence" simply means: if you ran this experiment 100 times, 95 of those intervals would contain the real answer. It's not about the rating itself - it's about how trustworthy your estimate is.

Where Does 1.96 Come From?

This was the part that confused me initially.

Think of it as a dial that controls how wide your range is. The wider your range, the more confident you can be that the truth falls inside it.

Multiplier Confidence
1.65 90% - narrower range, less sure
1.96 95% - the standard choice
2.58 99% - wider range, very sure

Mathematicians worked out that if you move 1.96 standard deviations to the left and right of the center of a bell curve, you capture roughly 95% of the area under that curve. That's why 1.96 became the standard multiplier for 95% confidence intervals.

Two Different Meanings of 95%

This distinction matters.

When you say a restaurant's rating is 95%, you mean the observed percentage of positive reviews.

When you say Wilson Score at 95% confidence, you mean you're using a confidence level that corresponds to 1.96 as your multiplier.

These are completely different things:
One is the observed rating.
The other is how much you trust your estimate of that rating.

What Wilson Score Really Asks

Most people think Wilson Score is trying to calculate the true rating. It is not.

Instead, it asks:
Given the amount of evidence we have, what is a conservative lower estimate of the true rating?

For example:
95 positive reviews out of 100 → Wilson lower bound ≈ 88.8%
19 positive reviews out of 20 → Wilson lower bound ≈ 76.4%

Both have a 95% observed rating. But Wilson trusts the first one much more because it's backed by a larger sample.

Wait, How Did 95% Become 88.8%?

Wilson Score is intentionally conservative.

The observed rating is still 95%. But because you have only a finite number of reviews, there's uncertainty around that number. Wilson subtracts an uncertainty penalty based on the sample size, the confidence level, and the observed rating.

The result is a lower bound that says:
Based on the evidence we have, we are reasonably confident the true rating is at least 88.8%.

The smaller the sample size, the larger the penalty. That's why 19/20 gets pushed down to roughly 76.4%.

Why Ranking Systems Use the Lower Bound

Wilson Score actually produces a full interval - a lower and upper bound.

For 19 out of 20 reviews, that range is roughly 76% to 99%.
For 95 positive reviews out of 100:
Lower bound ≈ 88.8%
Upper bound ≈ 97.8%

In other words, the true positive rate is plausibly somewhere inside that range. Notice how this range is much narrower than the range we'd get from only 20 reviews. More evidence means less uncertainty.

So why do ranking systems focus only on the lower bound?
Because the lower bound answers the most useful question:
What's the minimum quality I'm comfortable believing this item has?

Using the upper bound would often favor items with very few reviews. A restaurant with 1 out of 1 positive reviews has an upper bound of nearly 100% - clearly misleading. The lower bound keeps that restaurant ranked conservatively until more evidence comes in.

The Mental Model

Forget the formula. Think of Wilson Score as:
Observed Rating − Uncertainty Penalty

The penalty becomes larger when:
The sample size is small
You want higher confidence
There is less evidence available

That's why a product with 95/100 reviews ranks above a product with 19/20 reviews, even though both show 95%.

Final Thought
The biggest insight is this: Wilson Score is not measuring quality. It is measuring quality adjusted for confidence.

A high percentage with very little evidence is treated cautiously. A high percentage with lots of evidence is trusted.

And that mysterious 1.96? It's simply the number that says: "Let's be 95% confident before we make claims." Nothing magical about it. Just a dial set to the most common standard.

The more reviews you collect, the smaller your uncertainty penalty, and the closer your Wilson Score gets to your observed rating. Evidence earns trust. That's really all there is to it.

Back to My Experiment

Link to the page of my experiment: WIP.

In my case, I wasn't ranking restaurants. I was measuring whether placing an instruction in the system prompt vs the user prompt vs the tool description made a real difference in how often an LLM followed it.

For each placement, I got a compliance rate - say, system prompt got 76% compliance across 50 runs, user prompt got 62%.

The raw percentages tell me which placement looked better. But Wilson Score tells me something more useful:
"Is the gap between 76% and 62% real - or could it just be luck from 50 runs?"

Here is how to read the result:
If the Wilson intervals of two placements do not overlap → the difference is real. One placement genuinely works better.
If they do overlap → you cannot confidently say one is better. You need more runs.

So in plain English, Wilson Score told me: "You ran 50 trials. System prompt got 76% compliance. The true compliance rate is somewhere between X% and Y% with 95% confidence. If that range does not overlap with the user prompt's range, system prompt is genuinely better - not just luckier."

That is what I actually needed to know. Not a ranking. Not a score. Just: is this difference real?

Further Reading
If you want to go deeper - including the actual formula - the Wikipedia article on the Wilson score interval is a good next step.

To statisticians and experts in the field: Please comment if there is a mistake in my explanation.

DEV Community