DEV Community

Beamlaka
Beamlaka

Posted on

Why Your Non-Significant Benchmark Result Might Be a Power Problem (Not a Model Problem)

In Week 11, Tenacious-Bench reported:

Delta A = -2.34 pts, 95% CI [-11.09, +6.20], p = 0.71 (not significant)
Delta B = +22.18 pts, 95% CI [+14.43, +29.82], p = 0.0 (reported significant)
At first glance, this looks straightforward: one result is meaningful, one is not.
But this interpretation can be wrong if the benchmark is underpowered for the effect sizes we actually care about.

This post answers two practical questions:

With 216 binary pass/fail tasks, what size improvement can this benchmark reliably detect at 80% power?
Is reporting p = 0.0 valid when bootstrapping with 2,000 samples?
1) The key statistical gap: significance without power is incomplete
A p-value tells you whether observed data are unusual under a null model.
It does not tell you whether your benchmark was large enough to detect a small-but-real improvement.

So p = 0.71 can mean either:

there is truly no effect, or
there is a small effect, but your benchmark has low detection power.
Those are very different decisions for model iteration.

2) MDE at current benchmark size (216 tasks)
Using a standard two-proportion planning approximation with:

baseline pass rate ≈ 74%
alpha = 0.05 (two-sided)
power = 0.80
n = 216 tasks
the minimum detectable effect (MDE) is about +10.9 percentage points.

That is the core result.

If your practical target is +3 to +5 points, 216 tasks is too small for reliable detection.

3) Reinterpreting Delta A (p = 0.71)
Given this power profile, Delta A is better interpreted as inconclusive for small effects, not “definitive no improvement.”

Approximate detection probabilities at n=216:

true +3 pt effect -> ~11% detection chance
true +5 pt effect -> ~23% detection chance
true +8 pt effect -> ~52% detection chance
So failure to reject at +3/+5 is expected most of the time.
This is exactly why a non-significant p-value should not be read as proof of no effect when power is low.

4) How large Tenacious-Bench v0.2 should be
At the same baseline and test settings, target task counts are approximately:

+3 pt detection -> 3,226 tasks
+5 pt detection -> 1,128 tasks
+8 pt detection -> 420 tasks
Design implication:

If +5 is your minimum meaningful lift, v0.2 should be around ~1.1k+ tasks.
If +3 matters, v0.2 needs multi-thousand scale.
If you only care about large lifts (~+8), 400+ can be enough.
5) Correcting the bootstrap p-value (p = 0.0)
With finite bootstrap/Monte Carlo resampling, p = 0.0 is not valid.

Use corrected empirical p-value:

p

r
+
1
B
+
1
p=
B+1
r+1

where:

B
B = number of resamples
r
r = count of resamples at least as extreme as observed statistic
For

B

2000
B=2000,

r

0
r=0:

p

1
2001

0.00050
p=
2001
1

≈0.00050
Correct reporting:

bootstrap p ≈ 0.0005, or
bootstrap p <= 1/2001, or
bootstrap p < 0.001
Not correct: p = 0.0.

6) Suggested report rewrite
A defensible rewrite is:

“Under a standard two-proportion planning approximation (baseline ~74%), a 216-task benchmark has an 80%-power MDE of about +10.9 points. Therefore, Delta A = -2.34 pts (95% CI [-11.09, +6.20], p = 0.71) is inconclusive for small practical gains, not definitive evidence of no effect. To detect +3/+5/+8 point gains at 80% power, v0.2 would require approximately 3,226 / 1,128 / 420 tasks. For Delta B with 2,000 bootstrap samples, report p ≈ 0.0005 (or p < 0.001), not p = 0.0.”

Final takeaway
Evaluation gives you a score difference.
Statistics tells you whether your benchmark could detect the difference you care about.

For Tenacious-Bench, the Day 4 conclusion is simple and actionable:

keep reporting CIs and p-values,
but add MDE + power-based sample-size planning as a first-class benchmark design step.
That turns “not significant” from an ambiguous label into a decision-ready result.

Top comments (0)