Edward Glush

Posted on Feb 10

A/B Testing Is a Leaky Abstraction (And It’s Holding Modern Websites Back)

#webdev #software #abtesting #analytics

As developers, we’re trained to spot leaky abstractions.

When a model simplifies reality too much, edge cases pile up, behaviour diverges, and the system starts lying to you.

After years of building analytics pipelines, experimentation engines, and real-time behavioural systems — most recently while building an adaptive growth platform called Zyro — I’ve come to a slightly uncomfortable conclusion:

Traditional A/B testing is a leaky abstraction for modern user behaviour.

It worked.
It was necessary.
But it no longer maps cleanly to how traffic behaves today.

The Original Assumption Behind A/B Testing

Classic A/B testing assumes:

A stable audience
Homogeneous intent
Static behaviour
A single global optimum

You create variants, split traffic evenly, wait for statistical significance, then deploy a “winner”.

That model made sense when:

traffic sources were limited
users arrived with similar context
sessions were isolated

None of that is true anymore.

Modern Traffic Is Heterogeneous by Default

Today, two visitors landing on the same URL can be fundamentally different:

One comes from Google with a research mindset
One comes from TikTok with zero patience
One arrives from ChatGPT already primed with context
One is a returning user carrying session history

Yet most experiments treat them identically.

From a systems perspective, that’s already a red flag.

You’re averaging over fundamentally different distributions.

Why “Global Winners” Are Usually Local Losers

Here’s the part most dashboards won’t show you.

When you declare a global winner, you often do so by flattening variance:

Variant A performs slightly better overall
But performs worse for specific cohorts
Those losses get hidden inside aggregates

So you ship something that’s “better on average” while quietly degrading performance for high-intent traffic segments.

This isn’t a statistics problem.

It’s a modelling problem.

Behaviour Happens Before Conversions

Most analytics systems anchor on terminal events:

purchase
signup
submit

But behaviour doesn’t start there.

The meaningful signals appear earlier:

hesitation
comparison
repeated scrolling
policy checking
copying product details
bouncing between tabs

These are not noise.

They are state transitions.

And state transitions are what systems should respond to.

This insight didn’t come from dashboards.
It came from building systems that had to react to behaviour as it was happening, not after the fact.

The Shift From Experiments to Control Systems

While building Zyro, we stopped thinking in terms of “experiments” and started thinking in terms of control systems.

A control system:

observes signals continuously
reacts in real time
adapts based on feedback
never “finishes” optimisation

In this model:

A/B testing becomes one input
Not the governing mechanism

Multi-armed bandits, intent scoring, and source-aware routing aren’t replacements for testing.

They’re what naturally emerges when optimisation is treated as a live system instead of a batch process.

Why This Is a Developer Problem (Not a Marketing One)

Most growth tooling is designed for dashboards, not systems.

Developers care about:

feedback loops
latency
reliability
failure modes
state awareness

Once you view optimisation through that lens, it becomes obvious why manual experimentation struggles:

it’s slow
it’s coarse-grained
it reacts after the fact
it assumes stationarity where none exists

Modern systems don’t wait for significance.

They adapt.

The Direction This Is Heading

The next generation of websites won’t run “tests” in the traditional sense.

They’ll behave more like adaptive systems:

source-aware rendering
intent-weighted decisions
continuous learning
server-enriched signals
automated feedback into acquisition channels

This isn’t theory.

It’s the architecture you end up with when you try to make optimisation actually match how users behave.

Final Thought

A/B testing didn’t fail.

It succeeded so well that we stopped questioning its assumptions.

But as traffic becomes more fragmented and behaviour more stateful, optimisation needs to evolve from experiments into systems.

When an abstraction starts leaking, the fix isn’t more tooling.

It’s a better model.

And that’s the kind of problem developers are very good at solving.

DEV Community