Running A/B tests on top of edge feature flags

#javascript #cloudflare #typescript #devops

Once you have feature flags, an A/B test is a small step further: a flag with more than one variant, plus honest measurement. Here is how we do it at the edge, and the two bugs that quietly invalidate experiments.

I build this for Zenovay (web analytics). This assumes you already read flag config at the edge with no extra latency.

A flag is on/off. An experiment is a bucket.

The only new pieces are: assigning each user to a variant consistently, and logging exposure so you can measure.

Deterministic assignment (bug #1 if you get it wrong)

Never use Math.random to pick a variant. The same user would flicker between variants on every request, which destroys the experiment and the user experience. Hash a stable id instead, so a given user always lands in the same bucket.

async function bucket(userId: string, experiment: string, variants: string[]) {
  const data = new TextEncoder().encode(`${experiment}:${userId}`);
  const digest = await crypto.subtle.digest("SHA-256", data);
  // take 4 bytes of the hash as an unsigned int
  const n = new DataView(digest).getUint32(0);
  const bucketFraction = n / 0xffffffff;          // 0..1, stable for this user
  const index = Math.floor(bucketFraction * variants.length);
  return variants[index];
}

// usage at the edge
const variant = await bucket(userId, "checkout_copy_v1", ["control", "treatment"]);

Including the experiment name in the hash matters: it means a user is not correlated across different experiments. Without it, anyone in "treatment" for experiment A tends to be in "treatment" for B too, which confounds everything.

Log exposure, not just conversion (bug #2)

You must record that a user was actually exposed to a variant, at the moment they were exposed. If you only look at who converted, you cannot compute a rate, because you do not know the denominator per variant.

// fire once, when the variant is actually shown
function logExposure(userId: string, experiment: string, variant: string) {
  sendBeacon("/exposure", { userId, experiment, variant, ts: Date.now() });
}

The classic mistake is assigning a variant but only logging conversions. Then "treatment converted 40, control converted 30" tells you nothing without exposure counts.

Measuring the result

With exposure and conversion events, the rate per variant is straightforward.

select
  e.variant,
  count(distinct e.user_id) as exposed,
  count(distinct c.user_id) as converted,
  round(100.0 * count(distinct c.user_id) / count(distinct e.user_id), 2) as rate
from exposures e
left join conversions c
  on c.user_id = e.user_id
 and c.event_at >= e.event_at        -- only conversions after exposure
group by e.variant;

Note the join condition: only count a conversion if it happened after the user was exposed. A conversion before exposure is not caused by the variant.

When not to roll your own

Do this yourself for simple, low-stakes tests. Reach for a real experimentation platform when you need sequential testing, guardrail metrics, automatic significance, or non-engineers launching tests.

The hard part of A/B testing is not assignment — it is the statistics and not fooling yourself. The code above gives you rates, not confidence.

Disclosure: I build Zenovay, which ties experiment exposure to downstream revenue so you can see which variant made money, not just which got clicks.

Do you stop tests on significance or on a fixed sample size? Stopping the moment it looks significant is the most common way to ship a false winner.