For the last two years, I’ve worked in an environment where A/B testing is a primary rollout mechanism — a completely new experience for me. Recently, Nicolas Gallagher tweeted 37 words that pretty accurately summed up at least half of my experience with A/B testing, and triggered me to write an article about it.
The promise of A/B testing
A/B testing, or multivariate testing, is a mechanism to compare two (or more) versions of the same feature or page, and compare the statistics of those versions to see which one performs better. Ideally, this leads to more (data-)informed decision making, and enables fast feedback loops and continuous improvement. The classical example (and the first time I heard about A/B testing): Google once tested 40 shades of blue, to see which color made users convert the most.
At a high-level, this is how it works (or, how it should work):
- A hypothesis is formed about user behavior (often based on psychological theories, like Fear of missing out). To test this hypothesis, we create an experiment, where we test the current implementation (often called “Control”), and one or more variations of a new implementation. Before the test starts, a decision is made on what metrics the experiment will be evaluated.
- A user accesses the website or application, and specifically the feature you are testing.
- They are then bucketed, meaning they are assigned to a variant of the experiment you are testing. In which bucket the user lands, determines which version of the experiment they get to see and use. They stay in this bucket for the duration of the experiment.
- User behavior events are recorded, and stored for statistical analysis. Usually, the experiments are active for at least a week or longer. As soon as they are completed, the results are analyzed, and in most cases, the best performing variant of the experiment is rolled out to all users.
Setting up an experiment
Now, suppose you’d want to implement an A/B test. What would you need? Here are some key features you could be looking for in any solution:
-
Bucketing: Users need to be distributed in buckets. At its core, it’s essentially a
Math.random() > 0.5
. Often, this split is 50/50 (or evenly split among variants), but if you run a riskier experiment, you might do something like a 80/10/10 split. You can also use audience or environment targeting to determine whether a user is eligible to enter an experiment. Remote configuration: It helps to be able to turn experiments off and on with a configuration that is separated from code. Deploying a new version of an application is a cumbersome process in most cases, and the people managing experiments are often not engineers themselves. Ideally, product managers/marketeers can manage the configuration via a separate process that is friendlier for them to use than rolling out a new release. At the very least, you’ll want to manage which experiments are enabled (because, you know, things break), as well as things like start and end dates, or targeting. - Tracking: you’ll also need to get data about user behavior in our experiment, so you can evaluate which variant performed better. How much time do our users spend on the page? Do they convert to the next step in the funnel? How many shopping items do they add to their basket? You’ll need to hook into these events, record it and persist it in a central location for…
- Analysis: once you have that data, and the experiment concludes, you need to analyze the results. Some basic numbers: how many users were in variant B or C? How did they perform on the metrics you have picked? What is the statistical significance of the test? Maybe you want to segment on device type, or audience data too. You can import stats from a database and use a basic excel sheet, or you can use a tool like Google Analytics that for instance allows you to query sequences, which can be very useful in analyzing user behavior. Visual editors: Tools like Optimizely offer a visual editor that allows you to click your way to designing a new experiment. Super useful if you don’t have direct access to an engineering team (if you do have that, there are likely better options).
Implementation approaches
To my (admittedly limited) knowledge, there are at least five ways to implement A/B testing:
Canary releases: If you want to test a new variation of your website, you can deploy a new version per variant that you want to test (a feature branch perhaps), and then route a subset of your users (with sticky sessions) to that new deployment. To be able to use this, you have to have a well-managed infrastructure and release pipeline, especially when you want to run multiple tests in parallel and need many different deployments and the routing complexities that come with it. Likely you’ll need a decent amount of traffic, too. The upsides seem clear, though. For instance, any failed experiment does not introduce technical debt (code never lands on master, and deployments are just deleted). Another benefit is that this enforces that a user can only be in one experiment at a time; multiple experiments introduce both technical challenges and uncertainty about experiments influencing each other.
Split URLs: Historically recommended by Google to prevent SEO issues, you can use URLs to route users to different experiments. As an example:
/amazing-feature/test-123/b
. The benefit of this approach specifically is that you will not negatively impact any SEO value a given URL on your domain has while you’re experimenting with different designs.Server-side: Users are bucketed on the server when a page is requested. A cookie is then set to ensure the user is “stuck” in this bucket, and it’s used to render an interface with whatever experiments the user is in. You can pretty much do whatever you want: A/B tests, multivariate tests, feature toggles, parallel experiments: it’s all up to you. For the user, this is one of the best options, because the performance impact is negligible. However, because you use cookies, the benefits of a CDN are limited. Cookies introduce variation in requests (especially if users can enter multiple experiments), and it will lead to cache misses, leaving you without the protection of a CDN.
Client-side: If you don’t have access to the server, or you want to have maximum flexibility, client-side A/B testing is also an option. In this scenario, either no interface is rendered, or the original interface, and as soon, or slightly before this happens, the experiments are activated, and the interface is augmented based on whatever variant the user is in. This choice often makes sense when you do not have access to an engineering team, and are using external tools to run experiments. However, it’s often the worst choice in terms of performance. As an example, let’s look at how client-side Optimizely is implemented: you embed a
blocking
script, which forces the browser to wait with displaying anything on screen until this script is downloaded, compiled and executed. Additionally, the browser will de-prioritize all other resources (that might be needed to progressively enhance your website) in order to load the blocking script as fast as possible. To top it off, the browser both has to preconnect to another origin if you do not self-host the script, and it can only be cached for a couple of minutes (that is, if you want to be able to turn off a conversion-destroying experiment as quick as possible). With synthetic tests on mobile connections, I’ve seen Optimizely delay critical events between 1–2 seconds. Use with caution!On the edge: If you have a CDN in front of your website, you can use the power of edge workers to run experiments. I’ll refer to Scott Jehl for details about it, but the gist of it is that your server renders all variations of your interface, your CDN caches this response, and then when a user loads your website, the cached response is served, after the edge worker removes the HTML that is not applicable to the user that requests it. A very promising approach if you care about performance, because you get the benefits of a CDN without any impact on browser performance.
The reality of A/B testing
Turns out, A/B testing is hard. It can be very valuable, and I think you owe it to your bottom line to measure. However, it’s not a silver bullet, and you have to tailor your approach to the type of company you are (or want to be). Here’s what I learned at a mid-sized company with roughly 50–100k users a day:
Isolate experiments as much as possible
At my current employer, we are implementing experiments in parallel, and the implementation is always on production as soon as it has been verified, regardless of whether it will be used or not (basically, a feature toggle). This is mostly due to our tech choices: we have a server-side rendered, re-hydrated, single-page application, which makes it hard to use the canary strategy (because you never go back to a router or load-balancer). Besides that, we cannot afford the luxury of one experiment per user across the platform, due to a lack of traffic. In practice, this means that experiments have side-effects. There are two issues here at play.
Firstly, concurrently implemented experiments make any reasonable expectation of end-to-end test coverage impossible: even a small amount like 10 A/B tests creates 100 variations of your application. To test all those different variations, our tests would take 250 hours instead of 15 minutes. So, we disable all experiments during our tests. Which means that any experiment can — and eventually will — break critical (and non-critical) user functionality. Additionally, besides the cache problem I mentioned earlier, it also makes it much harder to reliably reproduce bugs from your error reporting systems (and reproduction is hard enough to begin with!).
Secondly, running multiple experiments across a user’s journey will lead to uncertainty about test results. Suppose you have an experiment on a product page, and one on search. If the experiment on search has a big impact on the type of traffic you send to the product page, the results from the product experiment will be skewed.
The best isolation strategy I can think of is canary releases and feature branches. In my wild, lucid dreams, this is how it works: when you start an experiment, you create a branch that will contain the changes for a variant. You open a pull request, and a test environment with that change is deployed. Once it passes review, it is then deployed to the production, and the router configuration is updated to sent a certain amount of traffic to the variant that you want to test. You have to look at expected usage, general traffic and a desired duration of the test to determine what traffic split makes sense. Suppose that you estimate 20% of traffic for a week would be enough, it would then be common to exclude 80% of traffic for the test, and split the remaining 20% evenly over an instance that is running the current version of your website, and a version that is running a variant of the experiment.
I can imagine orchestrating this requires a significant engineering effort though, especially when you want to automatically turn experiments on and of, or when you want to use more advanced targeting. You have to have enough traffic here, and at this point you will see benefits from splitting up your website in smaller deployable units. For instance, you might want to consider splitting up your frontend in micro-frontends.
If you cannot properly isolate experiments, you could try to accept that not all problems in life are or should be solvable. If you’re more of a control freak (like me), you might want to consider mutually exclusive experiments — meaning that a user that is in experiment X cannot be in experiment Y at the same time. This will help eliminate behavioral side-effects. If you need more testing confidence, you can opt-in for lower level testing, like unit or component testing. Or, you can deal with 250 hours plus pipelines, whatever floats your boat.
Stick to high standards
One oft-repeated mantra around A/B testing is “it’s just a test, we’ll fix it later”. The idea here is that you build an MVP and gauge interest, and if there is, you build a better designed, better implemented version as a final version. In practice I have not seen that work, presumably for two reasons: the first is that the incentive to fix things disappear after they are shipped. This applies to all parties: engineering, design and product. It’s already proven to be an uplift, and spending time re-designing or re-factoring will feel unneeded. And things that feel unneeded — even if they are needed — will not happen, especially in the pressure cooker that is product engineering. The second reason is that re-implementing an experiment, even redesigning it, could have impact on any formerly assumed uplift. To be absolutely sure, you’d have to run another experiment, now with the production-ready implementation. Ain’t nobody got time for that, chief. And here’s the thing: the type of environment that needs to take shortcuts for the implementation of an experiment is also unlikely to allocate time to refactor or/and re-run a successful experiment.
What happens? You accumulate tech debt. Often not something that is clearly scoped, and quantitatively described. Debt that is hard to put a number on, and hard to make the case for it to be addressed. Debt that will creep up on you, until finally, everybody gives up and pulls out the Rewrite hammer. (I’ll refer back to Nicolas’ tweet at this point).
Different standards are not just unwise, they are also confusing. It’s hard enough to align engineers on one standard, but two? Impossible. Brace yourself for endless back-and-forths in code reviews. (On a personal note — as if the former declarations were not just that — lower standards are uninspiring as hell, too. But maybe that’s just me.)
Aim for impact
VWO, a conversion optimization platform, estimates that around 1 in 7 experiments fail. The common refrain in the CRO world is that failing is okay, as long as you learn from it. The assumption here is that knowing what doesn’t work, is as valuable as knowing what does.
However, that does not mean you should start experimenting with things that are just guesswork, and/or can be figured out by common sense, experience, or qualitative research. Every one of those options is cheaper than throwing away over 85% of the capacity of 100k-a-year designers or developers — especially if you take churn rate into account, which will inevitably happen if your employees feel like all of their contributions are meaningless.
How do you keep morale high and let contributors feel like they’re being valued? Sure, buy-in, and emphasizing learning, help. But for me, big bets are the most inspiring. They allow me to fully use my experience and skill set to make a difference. Now, what a big bet is depends on the type of company you are, but I wouldn’t consider reposition or copy experiments to be in that category. A good indication of setting the bar too low is many inconclusive experiments (or experiments that have to run for a long time to be significant). If that’s the case, you’re placing too many small bets.
Now, back to that tweet…
Admittedly I used Nicolas’ tweet mostly to have a nice, stingy introduction to a rather boring topic, but it carries a potent truth: data, or the requirement of data, often leads to inertia. Data does not have all the answers. It does not replace a vision, and it is not a strategy.
Define a vision, then use A/B tests to validate your progress in reaching that vision. Not the other way around.
Top comments (1)
Thanks for sharing these insights!