A/B Testing Statistical Framework
Complete A/B testing toolkit with sample size calculators, frequentist and Bayesian significance tests, sequential testing support, and automated results reporting. Built for analysts who need statistically rigorous experiment design without heavyweight platforms.
Key Features
- Sample Size Calculator — compute required sample per variant given baseline rate, MDE, power, and significance level
- Frequentist Significance Tests — z-test and chi-squared tests for proportions and means with confidence intervals
- Bayesian A/B Analysis — Beta-Binomial posterior with credible intervals and probability-to-beat-control
- Sequential Testing — alpha-spending functions (O'Brien-Fleming, Pocock) for early stopping
- Segmented Analysis — break results by device, geo, or any custom dimension
- Power Analysis Charts — visualize trade-offs between sample size, MDE, and power
- Results Report Generator — export formatted summaries to Markdown or HTML
- Pre-Deployment Checklist — validate experiment setup before launch
Quick Start
from src.calculator import sample_size_calculator
from src.significance import run_ztest
# 1. Calculate required sample size
n = sample_size_calculator(
baseline_rate=0.12,
minimum_detectable_effect=0.02, # absolute lift
power=0.80,
significance_level=0.05,
)
print(f"Required sample per variant: {n:,}") # ~3,623
# 2. After collecting data, test significance
result = run_ztest(
control_visitors=4000, control_conversions=480,
variant_visitors=4000, variant_conversions=552,
)
print(f"p-value: {result.p_value:.4f}")
print(f"Lift: {result.relative_lift:.1%}")
print(f"Significant: {result.is_significant}")
Usage Examples
Bayesian Analysis
from src.bayesian import BayesianABTest
test = BayesianABTest(prior_alpha=1, prior_beta=1)
test.add_control(visitors=5000, conversions=600)
test.add_variant(visitors=5000, conversions=672)
summary = test.summarize()
print(f"P(variant > control): {summary.prob_variant_wins:.1%}")
print(f"Expected lift: {summary.expected_lift:.2%}")
print(f"95% credible interval: [{summary.ci_low:.2%}, {summary.ci_high:.2%}]")
Sequential Testing with Early Stopping
from src.sequential import SequentialTest, SpendingFunction
seq = SequentialTest(
max_looks=5,
overall_alpha=0.05,
spending_function=SpendingFunction.OBRIEN_FLEMING,
)
# At each interim analysis (e.g., 20%, 40%, 60%, 80%, 100% of data):
for look, (ctrl, var) in enumerate(interim_results, 1):
decision = seq.analyze(look, ctrl, var)
if decision.stop_early:
print(f"Stop at look {look}: {decision.conclusion}")
break
Segment-Level Breakdown
from src.segments import segmented_analysis
results = segmented_analysis(
data=experiment_df,
variant_col="variant",
metric_col="converted",
segment_col="device_type",
)
for seg in results:
print(f"{seg.name}: lift={seg.lift:.2%}, p={seg.p_value:.4f}")
Configuration
Edit config.example.yaml to set organization defaults:
defaults:
significance_level: 0.05 # Two-tailed alpha
power: 0.80 # 1 - beta
minimum_detectable_effect: 0.02
test_type: "two-sided" # or "one-sided"
bayesian:
prior_alpha: 1 # Beta prior parameter
prior_beta: 1 # Uninformative prior
simulations: 100000 # Monte Carlo draws
sequential:
max_looks: 5
spending_function: "obrien_fleming"
reporting:
output_format: "markdown" # "markdown" or "html"
include_charts: true
Best Practices
- Set MDE before the test starts — never peek at results and adjust your threshold
- Run tests to full sample size unless using sequential testing with alpha correction
- Use Bonferroni or Holm correction when testing multiple variants or metrics
- Log-transform revenue metrics — they are rarely normally distributed
- Check sample ratio mismatch (SRM) — if observed split deviates from expected, the experiment is compromised
- Document every experiment — use the results report generator for consistent records
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Sample size unreasonably large | MDE is too small relative to baseline | Increase MDE or accept lower power |
| p-value exactly 0.0 | Integer overflow in large samples | Use log-space computation in significance.py
|
| Bayesian and frequentist disagree | Different priors or assumptions | Align prior with historical data; check test type matches |
| SRM detected | Traffic allocation bug or bot filtering | Investigate logging and assignment logic before trusting results |
Requirements
- Python 3.10+
- Standard library only (
math,statistics,collections)
This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [A/B Testing Statistical Framework] with all files, templates, and documentation for $39.
Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.
Top comments (0)