Building Reliable Experimentation Systems: From Assignment to Decision Quality

#marketplaces #datascience #experimentation #abtesting

Abstract

Running experiments in a high-velocity marketplace environment involves a range of real-world challenges — from sample imbalance and session leakage to assignment logic and infrastructure limitations. This paper outlines hands-on practices used to improve experimentation reliability and decision-making speed. It highlights how assignment methods, cross-functional alignment, and strategic analysis play a critical role in producing valid, actionable results at scale.

1. Introduction

Experimentation plays a central role in product development for large-scale marketplaces. Rapid iteration depends on the ability to validate features, user experiences, and optimizations with measurable impact.

Hundreds of experiments run concurrently across platforms (App, Web), supported by automation tools that enable faster deployment, more frequent testing, and consistent metric evaluation. Each experiment is designed with clearly defined qualifying conditions. Assignment logic varies based on feature scope — some use persistent login-based identifiers, while others rely on cookie-based logic for anonymous users. These differences introduce risks that require careful attention to session management and assignment consistency.

2. Experiment Assignment Mechanism

Assignment logic is powered by a centralized backend service that evaluates eligibility criteria and computes variant assignment in real time. Logged-in users are assigned via user_id, stored in a persistent feature-flag platform. Anonymous users rely on a browser cookie (xpa_id) generated client-side and resolved via SDK.

The assignment system ensures deterministic bucketing and supports overrides, sticky assignment, and exclusion logic. To prevent cross-device inconsistencies, assignments for authenticated users are synced across platforms using an ID resolution layer and embedded in the page payload during server-side rendering (SSR).

All assignment events are logged for traceability, including timestamp, source (auth vs. cookie), and treatment group. Fail-safes are in place to prevent double exposure, mid-flight switching, or collisions between overlapping tests.

2.1 Session Leakage

When the cookie is delayed or fails to trigger (e.g., due to late SDK loading or tag misfiring), the assignment engine may not assign users properly. This leads to leakage — where users fall outside the experimental population or are treated as unassigned. Leakage can distort group-level metrics or cause exposure inconsistencies across sessions.

2.2 Cookie Clearing

Users who clear their cookies may be treated as new visitors upon return, potentially getting reassigned to a different variant. This undermines assignment durability for anonymous flows and introduces noise, especially in long-running or re-engagement experiments.

2.3 Misassignment & Bugs

Occasionally, users are assigned incorrectly due to ID mismatches, logic errors, or corrupted assignment metadata. In such cases, the safest path is to pause the experiment, fix the logic, and restart with fresh exposure. Using pre-fix data risks generating false positives or invalid insights.

3. Sample Ratio Mismatch (SRM)

Experiments are monitored for sample imbalance using real-time Chi-square analysis. SRM checks are embedded into the experimentation dashboard and alerting stack, triggering alerts if group deviation exceeds 2% or the p-value drops below 0.05.

To detect latent assignment issues, these checks run across key slices such as mobile web, Android, and iOS traffic. The system also tracks assignment failure rates, missed exposure events, and abnormal traffic patterns — such as spikes caused by internal users or bots. A daily data validation pipeline compares observed user counts and event volumes against expected baselines to catch assignment drift early.

SRM is diagnosed using a Chi-square goodness-of-fit test, which compares observed user counts to expected proportions based on the experiment’s planned split. Results are surfaced in dashboards segmented by platform and device. When SRM is detected, root cause analysis includes validating hash logic, reviewing experiment configuration, and checking for delays in SDK initialization or variant rendering.

4. Platform Bias and Randomization

Platform-specific user behavior can influence experiment results, particularly in marketplaces with large mobile and desktop user bases. To address this, experiments are stratified at assignment time based on platform, device type, and occasionally geography.

Mobile-only or app-first initiatives are launched separately from web to preserve consistency in traffic characteristics. Randomization keys (e.g., user_id, device_id) are carefully chosen to ensure stable bucketing across platforms and prevent users from switching buckets due to login state or cookie resets.

Dashboards visualize platform distribution and variant assignment in real time, helping teams identify imbalance early. Any platform-level inconsistencies flagged post-launch trigger validation checks on event logging, eligibility filters, and rollout gating.

5. Advanced Experimentation: Multi-Tenant & Bandits

5.1 Multi-Tenant Experiments

Shared surfaces like the homepage or search page often have multiple teams launching experiments concurrently. To avoid interaction effects or double exposure, coordination mechanisms are used:

— Mutually exclusive namespaces to prevent conflicting experiments on the same page

— Layered orthogonal randomization to isolate effects while allowing parallel testing

— Dynamic segment exclusions based on traffic source or page context

Pre-launch, experiments go through conflict checks in a centralized config system that enforces mutual exclusivity and rollout gating.

5.2 Bandit Testing

Bandit testing is used for features where dynamic allocation adds value — such as ranking algorithms, personalization models, or ad creatives. These use cases benefit from the ability to continuously learn and shift traffic toward higher-performing experiences without waiting for a fixed test window to conclude.

We primarily rely on Thompson Sampling to manage exploration-exploitation tradeoffs. The configuration includes several controls to ensure stability while allowing the model to learn:

— Traffic caps are applied during early rollout to prevent overreaction to noisy initial data.

— Dynamic reallocation is used to shift more traffic toward variants with stronger posterior performance.

— Stop-loss logic limits exposure to underperforming variants and helps maintain user experience quality.

Bandits are evaluated using Bayesian inference, with posterior distributions calculated for each variant’s reward probability (e.g., click or conversion). Before launching any bandit, we apply the same QA rigor as standard A/B tests — including instrumentation validation, rollout gating, and edge-case review.

While bandits are not ideal for every feature, they’ve proven valuable in use cases where response curves evolve over time or when minimizing opportunity cost is critical without sacrificing measurement quality.

6. Best Practices and Infrastructure

Before launching any experiment, we run a dry pass to verify assignment logic, ensure tracking instrumentation is firing correctly, and confirm that backend eligibility filters are functioning as expected. This pre-launch validation helps catch misconfigurations early and prevents data corruption in live traffic.

Once an experiment is live, we monitor variant-specific health metrics — such as load time, API error rates, and event logging fidelity — to catch regressions that could affect user experience or measurement accuracy. These diagnostics are automated and integrated into our QA and alerting workflows.

On the analysis side, statistical tests are embedded directly into our metric pipelines:

— For binary metrics like click-through or conversion rate, we use two-proportion z-tests or bootstrapping, depending on volume and skew.

— For continuous metrics like GMV per session, we apply Welch’s t-tests or bootstrapped confidence intervals to handle variance and outliers.

— Ratio metrics (e.g., add-to-cart per visit) are evaluated using delta bootstrapping with outlier filtering to avoid distortion from long-tail behavior.

Dashboards expose not just raw lifts but also confidence intervals, p-values, and segment-level breakdowns, enabling product and analytics teams to make decisions with both speed and statistical rigor.

7. Conclusion

Robust experimentation in marketplace environments depends on consistent assignment logic, clean instrumentation, and scalable analysis infrastructure. Attention to detail — whether in handling SRM, managing platform bias, or coordinating multiple tests without interference — is critical to ensuring reliable outcomes. By embedding technical validation into every stage of the process, teams can confidently launch, monitor, and optimize experiments that drive product and business impact.