Ritwika Kancharla

Posted on Mar 3

A/B Testing LLM Systems

#testing #systems #llm #ai

When Your New Model "Looks Better" but the Metrics Disagree

You swapped in a new embedding model. Responses feel sharper. Your team is excited. You ship it.

Two weeks later, task completion is down 8%. You have no idea why, and no way to trace it back to the change.

This is the most common way LLM improvements go wrong. The new version looks better in demos, passes the vibe check, and fails silently in production. A/B testing is how you stop guessing and start knowing.

Why LLM A/B Testing Is Harder Than Normal A/B Testing

In a standard web A/B test, you change a button color and measure clicks. The metric is immediate, unambiguous, and causally close to the change.

LLM systems have three properties that make this harder:

Evaluation lag. Whether a response was actually helpful often isn't clear until the user does (or doesn't) complete their task — which might be minutes or sessions later.

Multi-component pipelines. Changing the embedding model affects retrieval quality, which affects generation quality, which affects user behavior. The signal is distributed across the whole pipeline, not just one component.

High variance outputs. The same query can produce meaningfully different responses across runs, which means you need more samples to detect real signal over noise.

None of these are insurmountable. They just mean you need to be more deliberate about experimental design than you would be for a UI test.

Part 1: What You're Actually Testing

Before writing any code, be precise about the hypothesis. LLM A/B tests fall into a few categories:

Change Type	Example	Primary Metric
Embedding model	`text-embedding-3-small` → `text-embedding-3-large`	Retrieval MRR, NDCG
Chunk size / strategy	500 chars → 1000 chars with overlap	Faithfulness, relevance
Reranker	No reranking → cross-encoder reranking	Precision@5
Generation model	GPT-4o-mini → GPT-4o	Faithfulness, task completion
Prompt change	Added chain-of-thought instruction	Citation accuracy, response quality
Temperature	0.4 → 0.2 for factual queries	Hallucination rate

Define your primary metric before you run the test. If you measure 12 things and declare victory on whichever one improved, you're doing p-hacking, not science.

Part 2: Traffic Splitting

The infrastructure for LLM A/B testing is simpler than most people expect. You need a router that assigns users to variants consistently and logs which variant served each request.

import hashlib
from enum import Enum

class Variant(str, Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

def assign_variant(user_id: str, experiment_id: str, treatment_pct: float = 0.10) -> Variant:
    """
    Deterministic assignment — same user always gets same variant.
    Hash-based so no state required.
    """
    key = f"{experiment_id}:{user_id}"
    hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
    bucket = (hash_val % 1000) / 1000  # 0.000 to 0.999

    return Variant.TREATMENT if bucket < treatment_pct else Variant.CONTROL

Deterministic assignment matters. If the same user sees control on Monday and treatment on Thursday, their behavior becomes uninterpretable. Hash-based assignment is stateless and consistent across restarts.

Start with 10% treatment traffic. You can ramp up once you've verified nothing is obviously broken.

class ExperimentRouter:
    def __init__(self, experiment_id: str, control_system, treatment_system):
        self.experiment_id = experiment_id
        self.control = control_system
        self.treatment = treatment_system

    def route(self, user_id: str, query: str) -> dict:
        variant = assign_variant(user_id, self.experiment_id)

        system = self.treatment if variant == Variant.TREATMENT else self.control
        result = system.query(query)

        # Tag every response with its variant for analysis
        result["experiment_id"] = self.experiment_id
        result["variant"] = variant
        result["user_id"] = user_id

        return result

Part 3: What to Measure

Automated Metrics (Available Immediately)

These you can compute on every request:

from dataclasses import dataclass

@dataclass
class RequestMetrics:
    experiment_id: str
    variant: str
    user_id: str
    query: str

    # Retrieval
    mrr: float
    ndcg_5: float

    # Generation
    faithfulness: float
    citation_accuracy: float
    response_length: int

    # Latency
    total_latency_ms: float
    ttft_ms: float          # Time to first token

    # Reliability
    guardrail_passed: bool
    error: bool

def collect_metrics(result: dict, golden_labels: dict = None) -> RequestMetrics:
    return RequestMetrics(
        experiment_id=result["experiment_id"],
        variant=result["variant"],
        user_id=result["user_id"],
        query=result["query"],
        mrr=mean_reciprocal_rank(
            result["retrieved_ids"],
            golden_labels.get("relevant_ids", [])
        ) if golden_labels else None,
        faithfulness=faithfulness_score(result["response"], result["sources"]),
        citation_accuracy=citation_accuracy(result["response"], result["sources"])["accuracy"],
        response_length=len(result["response"]),
        total_latency_ms=result["latency_ms"],
        ttft_ms=result["ttft_ms"],
        guardrail_passed=result["guardrail_passed"],
        error=result.get("error", False)
    )

Behavioral Metrics (Require Follow-Through)

These are harder to collect but closer to what actually matters:

class BehaviorTracker:
    """Track what users do after receiving a response."""

    def log_followup(self, user_id: str, experiment_id: str, event: str):
        """
        Events worth tracking:
        - "thumbs_up" / "thumbs_down"
        - "clicked_source"
        - "copied_response"
        - "asked_followup"       # Might mean confused or engaged
        - "task_completed"       # Best signal, hardest to measure
        - "session_abandoned"    # Bad signal
        """
        log({
            "user_id": user_id,
            "experiment_id": experiment_id,
            "variant": get_last_variant(user_id, experiment_id),
            "event": event,
            "timestamp": datetime.utcnow().isoformat()
        })

Explicit feedback (thumbs up/down) has low response rates but high signal. Implicit signals (follow-up questions, session length, task completion) have high volume but require careful interpretation. Collect both.

Part 4: Statistical Analysis

This is where most teams go wrong. They run an experiment for a week, eyeball the numbers, and declare a winner. Here's how to do it properly.

Sample Size Calculation

Calculate required sample size before you start, not after you see the results:

from scipy import stats
import numpy as np

def required_sample_size(
    baseline_mean: float,
    minimum_detectable_effect: float,  # Smallest improvement worth caring about
    alpha: float = 0.05,               # False positive rate
    power: float = 0.80                # Probability of detecting a real effect
) -> int:
    """
    How many samples per variant do you need?
    """
    effect_size = minimum_detectable_effect / baseline_mean

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta  = stats.norm.ppf(power)

    # Cohen's formula for proportions (simplified)
    n = (2 * ((z_alpha + z_beta) ** 2) * baseline_mean * (1 - baseline_mean)) / (minimum_detectable_effect ** 2)

    return int(np.ceil(n))

# Example: faithfulness baseline 0.82, want to detect 3% improvement
n = required_sample_size(
    baseline_mean=0.82,
    minimum_detectable_effect=0.03
)
print(f"Need {n} samples per variant ({n * 2} total)")
# → Need ~1,400 samples per variant

If you'd need 50,000 samples to detect a 0.5% improvement, either increase traffic to the experiment or reconsider whether 0.5% is worth detecting.

Significance Testing

def analyze_experiment(control_metrics: list, treatment_metrics: list) -> dict:
    control_arr   = np.array(control_metrics)
    treatment_arr = np.array(treatment_metrics)

    # Two-sided t-test
    t_stat, p_value = stats.ttest_ind(treatment_arr, control_arr)

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (control_arr.std() ** 2 + treatment_arr.std() ** 2) / 2
    )
    cohens_d = (treatment_arr.mean() - control_arr.mean()) / pooled_std

    # Confidence interval on the difference
    diff = treatment_arr.mean() - control_arr.mean()
    se   = np.sqrt(control_arr.var() / len(control_arr) +
                   treatment_arr.var() / len(treatment_arr))
    ci   = stats.norm.interval(0.95, loc=diff, scale=se)

    return {
        "control_mean":   control_arr.mean(),
        "treatment_mean": treatment_arr.mean(),
        "absolute_change": diff,
        "relative_change": diff / control_arr.mean(),
        "p_value":   p_value,
        "significant": p_value < 0.05,
        "cohens_d":  cohens_d,
        "ci_95":     ci,
        "practical_significance": abs(cohens_d) > 0.2  # Small effect threshold
    }

Statistical significance and practical significance are different things. A p-value of 0.001 tells you the effect is real. Cohen's d tells you whether it's big enough to matter. You want both.

The Full Analysis Report

def experiment_report(experiment_id: str, results_df) -> dict:
    control   = results_df[results_df["variant"] == "control"]
    treatment = results_df[results_df["variant"] == "treatment"]

    metrics_to_test = [
        "faithfulness",
        "citation_accuracy",
        "mrr",
        "total_latency_ms",
        "guardrail_pass_rate"
    ]

    report = {
        "experiment_id": experiment_id,
        "sample_sizes": {
            "control": len(control),
            "treatment": len(treatment)
        },
        "metrics": {}
    }

    for metric in metrics_to_test:
        if metric in results_df.columns:
            report["metrics"][metric] = analyze_experiment(
                control[metric].dropna().tolist(),
                treatment[metric].dropna().tolist()
            )

    # Overall recommendation
    significant_improvements = [
        m for m, r in report["metrics"].items()
        if r["significant"] and r["relative_change"] > 0 and r["practical_significance"]
    ]
    significant_regressions = [
        m for m, r in report["metrics"].items()
        if r["significant"] and r["relative_change"] < 0 and r["practical_significance"]
    ]

    report["recommendation"] = _make_recommendation(
        significant_improvements,
        significant_regressions
    )

    return report

def _make_recommendation(improvements: list, regressions: list) -> str:
    if regressions:
        return f"DO NOT SHIP — regressions detected in: {', '.join(regressions)}"
    if improvements:
        return f"SHIP — significant improvements in: {', '.join(improvements)}"
    return "INCONCLUSIVE — no significant changes detected. Extend experiment or increase traffic."

Part 5: Common Mistakes

Stopping early. You ran the experiment for 3 days, faithfulness is up 4%, p=0.04. You ship it. This is p-hacking. Decide your stopping criteria — sample size or duration — before you start, and don't peek at results until you hit it.

Novelty effect. Users behave differently with new things. A new UI or response style might get better engagement for a week just because it's different. Run experiments for at least two full weeks for behavioral metrics.

Segment blindness. An overall improvement can hide a regression in a specific segment. Always break down results by query type, user cohort, and difficulty level.

def segment_analysis(results_df) -> dict:
    breakdowns = {}

    for segment_col in ["query_type", "user_cohort", "difficulty"]:
        if segment_col not in results_df.columns:
            continue

        breakdowns[segment_col] = {}
        for segment, group in results_df.groupby(segment_col):
            ctrl = group[group["variant"] == "control"]["faithfulness"].tolist()
            trt  = group[group["variant"] == "treatment"]["faithfulness"].tolist()

            if len(ctrl) > 30 and len(trt) > 30:  # Minimum for reliable stats
                breakdowns[segment_col][segment] = analyze_experiment(ctrl, trt)

    return breakdowns

Measuring the wrong thing. Faithfulness going up doesn't mean users are happier. Keep at least one behavioral metric (thumbs up rate, task completion) in every experiment so you're always connected to what actually matters.

Part 6: Decision Framework

When the experiment ends, you need a clear process for what to do with the results:

Experiment complete
        ↓
Did we hit the required sample size?
        ├── No → Extend or abort (don't analyze yet)
        └── Yes ↓
Any significant regressions?
        ├── Yes → Do not ship. Investigate why.
        └── No ↓
Any significant improvements on primary metric?
        ├── No → Inconclusive. Bigger change needed, or effect too small to matter.
        └── Yes ↓
Does improvement hold across segments?
        ├── No → Mixed results. Consider partial rollout or further investigation.
        └── Yes → Ship. Set new baseline. Document learnings.

The "set new baseline" step is critical and usually skipped. After you ship, update baseline.json so your regression detector compares against the new normal, not the old one.

Putting It Together: The Experiment Lifecycle

class Experiment:
    def __init__(self, experiment_id: str, hypothesis: str, primary_metric: str,
                 minimum_detectable_effect: float, control, treatment):
        self.id = experiment_id
        self.hypothesis = hypothesis
        self.primary_metric = primary_metric
        self.router = ExperimentRouter(experiment_id, control, treatment)

        # Pre-calculate required sample size
        baseline = load_baseline_metric(primary_metric)
        self.required_n = required_sample_size(baseline, minimum_detectable_effect)

        print(f"Experiment '{experiment_id}' initialized.")
        print(f"Hypothesis: {hypothesis}")
        print(f"Required samples per variant: {self.required_n}")

    def is_ready_to_analyze(self) -> bool:
        n_control   = count_samples(self.id, "control")
        n_treatment = count_samples(self.id, "treatment")
        return min(n_control, n_treatment) >= self.required_n

    def analyze(self) -> dict:
        if not self.is_ready_to_analyze():
            raise RuntimeError("Not enough samples yet. Don't peek.")

        results_df = load_experiment_results(self.id)
        report = experiment_report(self.id, results_df)
        report["segment_analysis"] = segment_analysis(results_df)

        return report

    def ship(self):
        """Call after analysis confirms improvement."""
        promote_treatment_to_production(self.id)
        update_baseline(self.primary_metric)
        archive_experiment(self.id)
        print(f"Experiment '{self.id}' shipped. Baseline updated.")

The Honest Truth About LLM A/B Testing

Most of the time, experiments are inconclusive. The new model is marginally better on some metrics, marginally worse on others, and you genuinely can't tell if shipping it is the right call.

That's useful information. It means the change doesn't matter enough to deploy the operational risk of switching. Save the deployment for changes that move the needle clearly.

The teams that improve their LLM systems fastest aren't the ones running the most experiments — they're the ones running experiments with clear hypotheses, adequate sample sizes, and the discipline to ship nothing when the data says nothing.

Previous: Stop Eyeballing Your RAG Outputs. Start Measuring Quality.

Next up: Hybrid search — combining vector and BM25 for queries where pure semantic search falls flat.