DEV Community

Cover image for How long should we run A/B test?
Artem Terekhov
Artem Terekhov

Posted on • Edited on

How long should we run A/B test?

Most teams don't struggle with the math — they struggle with the question before the math. You open a calculator, plug in some numbers, get a sample size. But where did those numbers come from? Usually: a gut feeling, a copied benchmark, or whatever the LLM suggested.

The thing is, there's a clear framework for this. And once you see it, the calculator stops being a black box.


The framework: MDE is a product decision, not a statistical parameter

Everything in sample size calculation flows from one number: the Minimum Detectable Effect — the smallest improvement you want to be able to detect.

Most people treat MDE as a technical input. It's not. It's the answer to a product question:

"What's the smallest lift that would justify shipping this variant?"

If your team spent two sprints building a feature, it's probably not worth shipping unless it moves conversion by at least 5%. That's your MDE. Not 1%, not 0.5% — because even if those lifts are real, they don't change your decision.

This reframe matters because of one mathematical reality:

Required sample size scales with 1MDE2\frac{1}{MDE^2} .

Halve your MDE, and you need four times the sample. Want to detect a 5% relative lift instead of 10%? Four times the visitors. 1% instead of 10%? One hundred times.

Everything else — significance level, statistical power — adjusts sample size linearly. MDE adjusts it quadratically. It's the only input that actually matters at order-of-magnitude scale.


The formula

It's just for the reference, bear with me.

For a standard two-proportion z-test:

n=(Zα/2+Zβ)2(p1(1p1)+p2(1p2))(p1p2)2 n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \cdot (p_1(1-p_1) + p_2(1-p_2))}{(p_1 - p_2)^2}
  • Zα/2=1.96Z_{\alpha/2} = 1.96 at standard α=0.05\alpha = 0.05 (95% confidence) — controls false positive rate
  • Zβ=0.84Z_{\beta} = 0.84 at 80% power — probability of detecting a real effect when one exists
  • p1p_1 — your baseline conversion rate, from actual data, 2-4 weeks minimum
  • p2p_2 — expected variant rate
  • p2p1p_2 - p_1 — your MDE

For continuous metrics (revenue, session duration), the formula changes:

n=(Zα/2+Zβ)22σ2δ2 n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}

Where σ\sigma is the standard deviation of your metric and δ\delta is the minimum detectable difference in means. Revenue data with heavy right tails can require 2-10x more sample than a comparable binary metric. Worth a separate conversation.


Three ways to get MDE wrong

1. Set MDE from wishful thinking, not from a product decision

"We want to catch even a 1% relative lift — that's a lot of revenue at our scale."

Sounds reasonable. Let's see what it actually means for a checkout page converting at 3%:

p1=0.03, p2=0.0303    n5,100,000 p_1 = 0.03,\ p_2 = 0.0303 \implies n \approx 5{,}100{,}000

And that's per variant! At 10,000 visitors/day — that's roughly 1,000 days.

The problem isn't the math. The problem is that MDE was chosen based on what would be nice to know, not on what would change the decision. If a 1% lift wouldn't justify the engineering cost of the variant, why are you designing an experiment to detect it?

The fix: anchor MDE to your decision threshold. Ask the team — "at what lift would we definitely ship this?" Start there.

2. Don't check if that MDE is feasible at their baseline

A 10% relative lift sounds like a consistent benchmark. It's not — it means completely different things at different baselines.

Baseline Rate Relative MDE Absolute MDE Sample per Variant At 1K visitors/day
2% 10% 0.2pp ~80,700 ~162 days
5% 10% 0.5pp ~31,200 ~63 days
10% 10% 1.0pp ~14,750 ~30 days
5% 20% 1.0pp ~8,200 ~17 days
10% 20% 2.0pp ~3,850 ~8 days

Same relative MDE, five times the sample difference. Low-conversion pages (free-to-paid at 1-2%) are a completely different ballgame from high-traffic surfaces (add-to-cart at 15%).

If the math says 160 days — the answer isn't "let's just run it." The answer is: increase your MDE, find a higher-traffic surface, or pick a different metric entirely.

3. Confuse sample size with calendar time (pure hygiene, but worth stating clearly)

Sample size is observations. Test duration is calendar days. These are related but different constraints.

The formula gives you nn — required visitors per variant. To convert to days:

days=ndaily visitors per variant \text{days} = \frac{n}{\text{daily visitors per variant}}

But that's your floor, not your answer.

User behavior varies by day of week. E-commerce converts differently on Tuesday vs. Sunday. B2B SaaS looks different on Monday vs. Friday. If your formula says three days and you run Tuesday to Thursday, you're sampling from a biased slice of the week.

The rule: always run for at least one full week, ideally two — regardless of what the sample size formula says. If you need eight days of data, run for fourteen.

Same logic applies to pay cycles, promotional windows, or recurring traffic spikes — make sure your test window covers a representative period, not just a convenient one.


Pre-flight checklist

  1. Define your MDE before touching a calculator — what's the smallest lift that would justify shipping?
  2. Check feasibility at your baseline and traffic — if it's more than 4-6 weeks, revisit the MDE or the surface
  3. Use the right formula for your metric — proportions for binary, variance-based for continuous, delta method for ratios
  4. Convert sample size to calendar time with a weekly buffer — minimum one week, ideally two
  5. Don't peek — or use a sequential testing framework if you have to. Repeated checks without correction inflate your false positive rate well beyond 5%

Remember the first principles, you can catch up with details any time

To calculate all that you can use any of thousand A/B test calculators out there. I even made one myself just to have it handy (and be confident re what's happening under the hood). So feel free to use mine if you want one:

Remember: the math is straightforward once you have the right inputs. The hard part is getting those inputs right — and that's a product conversation, not a statistics one.


Curious how others think about MDE in practice — share it in the comments

Top comments (0)