Your p-value answered a question you didn't ask.

#datascience #julialang #statistics

You ran the A/B test. It came back p = 0.08. "Not significant." So you killed the feature and moved on. You may have just buried something that works — not because the math lied, but because it answered a different question than the one in your head.

This is the most common statistics mistake I see in data teams, and it's not about being bad at math. It's about a quiet mismatch between the question you're asking and the question the test is answering. Let's fix that, with code, and keep the examples generic — we'll use a plain conversion test, the kind every team runs.

What a p-value actually is (read this twice)

Here's the precise definition, and then the plain-English one.

Precise: a p-value is the probability of observing data at least as extreme as yours, assuming there is no real effect at all.

Plain: it answers "if nothing were going on, how surprised should I be by results this big?" A small p-value means "pretty surprised — random noise doesn't usually produce a gap this large."

Now read what it does not say. It does not tell you the probability that your effect is real. It does not tell you the probability you're wrong. Those are the questions you actually care about — and the p-value answers neither. It answers a question about a hypothetical world where the effect is zero. That's the mismatch, right there.

Let's run one.

import numpy as np
from scipy import stats

# Variant A: 120 conversions out of 2400.  Variant B: 145 out of 2400.
a_conv, a_n = 120, 2400
b_conv, b_n = 145, 2400

p_a, p_b = a_conv / a_n, b_conv / b_n          # 5.0% vs ~6.04%
p_pool   = (a_conv + b_conv) / (a_n + b_n)

# two-proportion z-test, computed by hand so nothing is hidden
se = np.sqrt(p_pool * (1 - p_pool) * (1/a_n + 1/b_n))
z  = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print(f"A: {p_a:.3%}   B: {p_b:.3%}")
print(f"z = {z:.2f},  p = {p_value:.3f}")

You'll get something like p ≈ 0.15. By the sacred 0.05 threshold, "not significant." Most teams stop here and conclude B doesn't work.

That conclusion is wrong, or at least unsupported. Watch.

"Not significant" means "not enough evidence," not "no effect"

"Not significant" does not mean "B is the same as A." It means "we don't have enough evidence to rule out luck as one possible explanation." Those are wildly different statements, and conflating them is how good ideas get killed.

Here's the proof that it's about sample size, not reality. Keep the conversion rates identical — 5.0% vs 6.04%, the exact same real-world effect — and just collect more data:

for scale in [1, 4, 10]:
    a_n2, b_n2 = a_n * scale, b_n * scale
    a_c2, b_c2 = round(p_a * a_n2), round(p_b * b_n2)
    p_pool = (a_c2 + b_c2) / (a_n2 + b_n2)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/a_n2 + 1/b_n2))
    z  = (p_b - p_a) / se
    p  = 2 * (1 - stats.norm.cdf(abs(z)))
    print(f"sample x{scale:<2}  n={a_n2:>6}  p={p:.4f}")

Same effect every time. But the p-value marches from "not significant" to "highly significant" purely because the sample grew:

sample x1   n=  2400  p=0.15
sample x4   n=  9600  p=0.01
sample x10  n= 24000  p=0.0003

The effect never changed. Only the evidence did. So "not significant" was never a verdict about whether B works — it was a verdict about whether you'd collected enough data to tell. And "not enough data" is a collection problem, a sampling problem, a pipeline problem, long before it's a statistics problem. If your experiment is underpowered, no test will save you — the answer was decided when you chose the sample size.

What the Bayesian crowd does instead

This is where the other camp has a genuine point. Instead of asking "how surprised would I be if the effect were zero," the Bayesian approach asks the question you actually wanted answered all along: "given this data, how probable is it that B beats A, and by how much?"

The cleanest version, for conversion rates, uses a Beta distribution — and it's only a few lines because conversions are just successes-out-of-trials:

rng = np.random.default_rng(0)

# Beta(1,1) is a flat "I know nothing yet" prior.
# Update it with what we observed: conversions and non-conversions.
post_a = rng.beta(1 + a_conv, 1 + a_n - a_conv, 200_000)
post_b = rng.beta(1 + b_conv, 1 + b_n - b_conv, 200_000)

prob_b_better = (post_b > post_a).mean()
lift          = (post_b - post_a) / post_a

print(f"P(B beats A)          = {prob_b_better:.1%}")
print(f"median relative lift  = {np.median(lift):.1%}")
print(f"95% credible interval = "
      f"[{np.percentile(lift, 2.5):.1%}, {np.percentile(lift, 97.5):.1%}]")

This might tell you P(B beats A) ≈ 92%. Read that against where we started. The frequentist test said "not significant, p = 0.15" and you killed the feature. The Bayesian view says "there's about a 92% chance B is better" — which is a completely different business decision, from the same data.

Notice the other gift: a credible interval. A Bayesian 95% credible interval means what everyone wrongly thinks a confidence interval means — "there's a 95% probability the true lift is in this range." It directly describes your uncertainty about the effect size, which is usually the actual thing you're trying to decide on.

You don't have to pick a religion

I'm not here to recruit you to a camp. Frequentist methods are fast, standard, and fine — when you respect what they're telling you. The failure isn't using a p-value. The failure is reading "not significant" as "no effect" and treating a sample-size problem as a verdict from nature.

So, practically:

→ Power your test before you run it. Decide the smallest effect worth caring about and compute the sample size that can actually detect it. Half of all "inconclusive" tests were doomed at the planning stage — underpowered before a single user saw the variant.

→ Report the effect size and its uncertainty, not just a yes/no. "B is up ~1 point, 95% interval roughly [−0.3, +2.3]" tells a decision-maker far more than "not significant," which tells them almost nothing.

→ When the cost of waiting is high, ask the Bayesian question directly. "How likely is B to win, and by how much" is often the decision you're actually making — so compute that, instead of a threshold on a hypothetical null world.

→ Remember the test only sees the data your pipeline fed it. Garbage sampling, a logging bug, a biased assignment — the math will faithfully analyze bad data and hand you a confident wrong answer. The statistics are downstream of the data engineering, always.

The principle: know which question your number is answering before you let it make your decision. A p-value is a fine tool pointed at one specific question — just rarely the one you were actually asking.

Last time a test came back "not significant" — did you conclude the idea was dead, or that you just didn't have enough data yet? Those lead to opposite roadmaps, and most teams pick the first one without noticing they had a choice.

This touches statistical methods that get misused in high-stakes settings, so treat the code here as a teaching starting point, not a turnkey decision system for anything that actually matters — validate it against your own data and context.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I build the pipelines and experiment plumbing that feed analyses like these. That work lives at vf-insights.com.