Natnael Alemseged

Posted on May 8

Why Pairing Your Bootstrap Is Necessary — And When It Stops Helping

#machinelearning #statistics #llm #evaluation

A colleague's paired_bootstrap function resamples one set of 48 task indices and applies it to both the trained LoRA
scores and the baseline scores. The question: what mathematical property makes that the correct procedure — and would an
unpaired bootstrap have changed the reviewer-facing conclusion?

The short answer: pairing is correct by experimental design. When the two score vectors have positive covariance,
pairing reduces the model-based standard error; in this specific data the correlation is near-zero (r = 0.167), so the
paired and unpaired bootstrap CIs are practically identical — and neither changes the reviewer-facing conclusion.

Here is why, from first principles.

The experimental design justification: why pairing is valid at all

The 48 held-out tasks were not drawn independently for the baseline and then re-drawn independently for the trained
LoRA. The same 48 tasks were evaluated under both systems. Each task is a repeated measurement on the same subject —
this is a within-subject design (as opposed to a between-subject design where each group sees different samples),
and it is what makes pairing the correct procedure.

If the 48 baseline tasks and the 48 trained-LoRA tasks were different tasks drawn from the same population, unpaired
bootstrap would be correct. But here, resampling index 13 means "draw task 13 for both models together." Resampling each
vector independently breaks that structure and estimates uncertainty for a different experiment: baseline and LoRA
evaluated on unrelated task samples.

This distinction matters before any formula. The formula follows the design; the design is what you defend to the
reviewer.

The variance-reduction mechanism: the math behind why pairing helps

Once you have established that pairing is correct, the question is how much it helps. The bootstrap works by
resampling your data with replacement thousands of times to estimate the sampling distribution of a statistic — here,
the mean lift between two systems (Efron & Hastie, 2016). The standard error of the
mean paired lift is:

SE_paired = sqrt((Var(A) + Var(B) − 2·Cov(A, B)) / n)

where A is the baseline binary score vector, B is the trained-LoRA binary score vector, and n = 48.

The unpaired standard error treats A and B as independent, so the covariance term drops:

SE_unpaired = sqrt((Var(A) + Var(B)) / n)

The key distinction: a paired design estimates E[B - A] — expected within-task lift. An unpaired design estimates
E[B] - E[A] as if the two means came from unrelated samples. Same point estimate, different uncertainty model.

Pairing helps in proportion to the covariance between the two score vectors. If tasks where the baseline passes tend
also to be tasks where the trained model passes, the covariance is large and positive, the numerator shrinks, and the
paired SE is meaningfully smaller. If the two models fail and pass on largely different tasks — low covariance —
pairing buys almost nothing in precision, even though it remains the correct design.

The actual numbers

From the held-out evaluation traces:

	Trained LoRA passes	Trained LoRA fails
Baseline passes	15	1
Baseline fails	26	6

Baseline: 16 passes, 32 fails → pass rate p_A = 0.333
Trained LoRA: 41 passes, 7 fails → pass rate p_B = 0.854
Pearson r(A, B) = 0.167
Var(A) = 0.333 · 0.667 = 0.222 ; Var(B) = 0.854 · 0.146 = 0.125 ; Va + Vb = 0.347
Cov(A, B) = 0.167 · sqrt(0.222 · 0.125) ≈ 0.028

The task-level difference vector makes the paired structure visible:

+1 on 26 tasks where trained LoRA passes and baseline fails
-1 on 1 task where baseline passes and trained LoRA fails
0 on 21 tasks where both systems agree

The paired bootstrap resamples this population of task-level differences. The unpaired bootstrap destroys these
relationships by drawing baseline and trained outcomes independently.

Plugging in:

SE_paired   = sqrt((0.347 − 2·0.028) / 48) = sqrt(0.291 / 48) ≈ 0.0779
SE_unpaired = sqrt(0.347 / 48)             ≈ 0.0850

The paired SE is about 8.4% smaller — real but modest, because the covariance is small relative to
Var(A) + Var(B).

Empirical simulation

import numpy as np

rng = np.random.default_rng(42)
n_boot = 100_000

# Task-level binary outcomes ordered by contingency cell:
# both-pass (15), baseline-only-pass (1), trained-only-pass (26), both-fail (6)
baseline = np.array([1] * 15 + [1] + [0] * 26 + [0] * 6)
trained = np.array([1] * 15 + [0] + [1] * 26 + [0] * 6)

paired = []
unpaired = []
for _ in range(n_boot):
    idx = rng.integers(0, 48, 48)
    i_a = rng.integers(0, 48, 48)
    i_b = rng.integers(0, 48, 48)
    paired.append((trained[idx] - baseline[idx]).mean())
    unpaired.append(trained[i_b].mean() - baseline[i_a].mean())

print(np.percentile(paired, [2.5, 97.5]) * 100)
print(np.percentile(unpaired, [2.5, 97.5]) * 100)

Output:

Paired CI:   [+35.4, +68.8] percentage points  — width 33.3 pp
Unpaired CI: [+35.4, +66.7] percentage points  — width 31.3 pp

The two CIs are essentially identical — exactly what the near-zero covariance predicts. The SE formula says pairing
should modestly reduce the model-based standard error, but percentile bootstrap CIs (the 2.5th and 97.5th
percentiles of the resampled distribution) on binary-difference data are not symmetric ±1.96·SE intervals. Their tails
shift independently because the empirical distribution is discrete and skewed. The slight width inversion is not a
contradiction: pairing is still the right design, but here it does not buy meaningful precision.

Does this change the reviewer conclusion?

The reviewer-facing claim is: "The LoRA adapter's lift is statistically significant above zero."

The critical boundary is whether the CI lower bound stays positive. Both paired and unpaired bootstrap give a lower
bound of +35.4 percentage points — far above zero. Neither variant threatens the significance verdict. A CI
of [−2, +54] would change the conclusion; [+12, +40] would not. The actual data stays nowhere near the dangerous
boundary regardless of which bootstrap method is used.

Pairing is correct by experimental design, and in this experiment it makes no difference to the reviewer conclusion —
because the near-zero correlation means pairing provides almost no variance reduction.

Adjacent concepts worth connecting

When does pairing matter most? (Dror et al., 2017) When tasks are
heterogeneous in difficulty and both models are sensitive to that difficulty. If hard tasks fail both models and easy
tasks pass both, r(A,B) is large and paired bootstrapping can cut CI width sharply. In this data, the dominant pattern
is trained LoRA passing where baseline fails — pushing correlation toward zero and making pairing nearly irrelevant for
variance reduction.

When does low correlation arise in LLM evals? Near-zero r(A,B) signals a large capability gap: the stronger model
succeeds on tasks too hard for the weaker one, so their pass/fail patterns decorrelate. That is good news for the
trained model's lift, but it means paired bootstrapping loses its statistical efficiency advantage precisely when the
lift is largest.

Pointers

Papers:

Efron, B. & Hastie, T. (2016). Computer Age Statistical Inference, Ch. 11 — Bootstrap Confidence Intervals — authoritative treatment of bootstrap CIs for paired designs. Freely available via Stanford.
Dror, R. et al. ( 2017). Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets (TACL) — canonical reference for paired bootstrap and permutation tests in NLP evaluation.

Tool: NumPy default_rng + bootstrap loop — reproducible in a Colab cell with no additional dependencies.

Follow-on: For a valid one-sided p-value (not just a CI), use a paired permutation test: randomly flip the sign of
each task-pair's difference and count how often the null mean exceeds the observed mean. The bootstrap percentile CI
lower bound being positive is consistent with significance but is not a p-value.

Top comments (1)

Alex Morgan • May 9

The paired bootstrap problem is one I haven't seen written about clearly before. The key point you're making — that the paired design has higher statistical power because it cancels task-level variance — is exactly why this matters for eval design. Most people reach for an independent samples test by default and then wonder why they can't detect meaningful improvements without huge sample sizes. Bookmarking this for the next time someone asks why their 100-sample eval set keeps showing non-significant results.