DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

A/B Testing Statistical Framework

A/B Testing Statistical Framework

Complete A/B testing toolkit with sample size calculators, frequentist and Bayesian significance tests, sequential testing support, and automated results reporting. Built for analysts who need statistically rigorous experiment design without heavyweight platforms.

Key Features

  • Sample Size Calculator — compute required sample per variant given baseline rate, MDE, power, and significance level
  • Frequentist Significance Tests — z-test and chi-squared tests for proportions and means with confidence intervals
  • Bayesian A/B Analysis — Beta-Binomial posterior with credible intervals and probability-to-beat-control
  • Sequential Testing — alpha-spending functions (O'Brien-Fleming, Pocock) for early stopping
  • Segmented Analysis — break results by device, geo, or any custom dimension
  • Power Analysis Charts — visualize trade-offs between sample size, MDE, and power
  • Results Report Generator — export formatted summaries to Markdown or HTML
  • Pre-Deployment Checklist — validate experiment setup before launch

Quick Start

from src.calculator import sample_size_calculator
from src.significance import run_ztest

# 1. Calculate required sample size
n = sample_size_calculator(
    baseline_rate=0.12,
    minimum_detectable_effect=0.02,  # absolute lift
    power=0.80,
    significance_level=0.05,
)
print(f"Required sample per variant: {n:,}")  # ~3,623

# 2. After collecting data, test significance
result = run_ztest(
    control_visitors=4000, control_conversions=480,
    variant_visitors=4000, variant_conversions=552,
)
print(f"p-value: {result.p_value:.4f}")
print(f"Lift: {result.relative_lift:.1%}")
print(f"Significant: {result.is_significant}")
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Bayesian Analysis

from src.bayesian import BayesianABTest

test = BayesianABTest(prior_alpha=1, prior_beta=1)
test.add_control(visitors=5000, conversions=600)
test.add_variant(visitors=5000, conversions=672)

summary = test.summarize()
print(f"P(variant > control): {summary.prob_variant_wins:.1%}")
print(f"Expected lift: {summary.expected_lift:.2%}")
print(f"95% credible interval: [{summary.ci_low:.2%}, {summary.ci_high:.2%}]")
Enter fullscreen mode Exit fullscreen mode

Sequential Testing with Early Stopping

from src.sequential import SequentialTest, SpendingFunction

seq = SequentialTest(
    max_looks=5,
    overall_alpha=0.05,
    spending_function=SpendingFunction.OBRIEN_FLEMING,
)

# At each interim analysis (e.g., 20%, 40%, 60%, 80%, 100% of data):
for look, (ctrl, var) in enumerate(interim_results, 1):
    decision = seq.analyze(look, ctrl, var)
    if decision.stop_early:
        print(f"Stop at look {look}: {decision.conclusion}")
        break
Enter fullscreen mode Exit fullscreen mode

Segment-Level Breakdown

from src.segments import segmented_analysis

results = segmented_analysis(
    data=experiment_df,
    variant_col="variant",
    metric_col="converted",
    segment_col="device_type",
)
for seg in results:
    print(f"{seg.name}: lift={seg.lift:.2%}, p={seg.p_value:.4f}")
Enter fullscreen mode Exit fullscreen mode

Configuration

Edit config.example.yaml to set organization defaults:

defaults:
  significance_level: 0.05       # Two-tailed alpha
  power: 0.80                    # 1 - beta
  minimum_detectable_effect: 0.02
  test_type: "two-sided"         # or "one-sided"

bayesian:
  prior_alpha: 1                 # Beta prior parameter
  prior_beta: 1                  # Uninformative prior
  simulations: 100000            # Monte Carlo draws

sequential:
  max_looks: 5
  spending_function: "obrien_fleming"

reporting:
  output_format: "markdown"      # "markdown" or "html"
  include_charts: true
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Set MDE before the test starts — never peek at results and adjust your threshold
  2. Run tests to full sample size unless using sequential testing with alpha correction
  3. Use Bonferroni or Holm correction when testing multiple variants or metrics
  4. Log-transform revenue metrics — they are rarely normally distributed
  5. Check sample ratio mismatch (SRM) — if observed split deviates from expected, the experiment is compromised
  6. Document every experiment — use the results report generator for consistent records

Troubleshooting

Issue Cause Fix
Sample size unreasonably large MDE is too small relative to baseline Increase MDE or accept lower power
p-value exactly 0.0 Integer overflow in large samples Use log-space computation in significance.py
Bayesian and frequentist disagree Different priors or assumptions Align prior with historical data; check test type matches
SRM detected Traffic allocation bug or bot filtering Investigate logging and assignment logic before trusting results

Requirements

  • Python 3.10+
  • Standard library only (math, statistics, collections)

This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [A/B Testing Statistical Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)