DEV Community

Cover image for All the Math you need to conduct an A/B test
Simon Pfeiffer for Codesphere Inc.

Posted on

All the Math you need to conduct an A/B test

If you’re an innovator you are by definition in uncharted waters. There is no textbook on how to build your startup because if the route to success was already figured out, then the world wouldn’t need your product.

This means that you can’t google a lot of the questions you may have about your users, the market, and how they might react to changes in your website, product, or advertising. As such, one of the only ways to find out this information is through field experiments: That’s where A/B testing comes in.

A/B testing is a method of comparing two variations of something in order to see which performs better according to some metric. The most common way this is used is to see how two different versions of a website affect the rate at which users sign up.


Procedure

The first step is to identify what your success metric is. What are you trying to increase? The amount of time a user spends on your platform? The rate at which they sign up? The click through rate on an advertisement? Figure out what you are trying to optimize and make sure you have a method of accurately tracking it.

Next is to identify what you want to vary. Is it the color of an advertisement? Is it the location of your sign up button? Is it the adjective you use to describe your product? The variations in your test should be identical with exception to this difference.

Additionally identify what is your control variation(How something is now) and what is your experiment(The change you are considering making).

The next step is by far the hardest. You need the tracking infrastructure to measure how your success metric varies across your control and experimental variations. If you are using user tracking like Mixpanel, Fullstory, or Logrocket, then this is pretty easy. Otherwise, you may need to natively track this information on your website.

Additionally, you need the infrastructure to randomly assign users to either the control or experimental variations. This should be absolutely random, and should split users evenly between our two versions.

Finally, you need to decide ahead of time the duration of your test. Are you going to run this for a week? Are you going to run this until you have a sample size of 1000 users? Deciding this beforehand can prevent you from stopping the test at a time that biases the results.


Random Sampling and Biases

The backbone of any statistical experiment is random sampling. The idea behind random sampling is that it would be incredibly difficult to find out how every single possible customer might react to the change you are considering making. The next best thing, therefore, is to randomly pick a smaller group of possible customers, and ask them.

If our sample is truly randomly selected from the general population of possible customers, then the results we see in our A/B test should be an accurate estimation of how our customer might react to the change.

Let’s say we are testing how much a person in New York City is willing to spend on a hotdog. A random sample might go all across the city and ask random people. An experiment with sampling bias, would just go to wall street and ask people.

If you just go to Wall Street, however, you are going to find that people are willing to spend more money, and will thus overestimate the average that New Yorkers are willing to spend. Thus our sample was not an accurate representation of the general population of New Yorkers, and our test is not incredibly effective.

That being said, completely removing sampling bias is an unrealistic goal. What is actually feasible and what is incredibly important is to be able to describe what your sampling bias is.

Where does your sample come from? Are they from the portion of doctors that use twitter, and hence were able to click on your twitter ad? Are they from the portion of chefs that live in San Francisco and saw your billboard?


Statistical Significance

So let’s say that we finished conducting our test, and we got results like this:

Image description

In the above example, our A/B test was seeing how a change in our website affected the rate at which people sign up.

As you can see above, with the same amount of visitors, our variant produced 4 more signups. So we should make the change to our variant, right?

Not exactly. Once you have the results of your A/B test, you need to check for statistical significance.

A good way to grasp the idea of statistical insignificance is to take the example of flipping a coin. If we were to flip a fair coin 1,000 times, we would expect to get heads 500 times, and tails 500 times.

But let’s say we want to test whether a coin is fair by seeing if we get an even amount of heads and tails. We flip the coin 4 times, and get 3 heads, and 1 tail. If we just made a conclusion directly from this information, we might say that a coin has a 75% chance of landing heads each time.

But we only tested the coin 4 times! The idea behind statistical significance is that when we have a small sample size, there is a certain amount of variation that is purely a coincidence.

In the coin example, the fact we got 3 tails was likely just a coincidence. In our website A/B test example, the fact that there were 4 more signups might also have just been a coincidence.

So how do we figure out if the results of our test are significant, or just purely a coincidence? That’s where hypothesis testing comes into play. Specifically, with A/B tests for conversion rates, we tend to use a Chi-Squared Test.


Chi-Squared Test

The basics behind the Chi-Squared test is to see if variations between two options are actually significant, or purely a result of chance.

The idea is to identify what we call a null-hypothesis. In our example:

Null-hypothesis: The variation we tested to our website would not cause a different amount of people to sign up

In other words, the null hypothesis states that our variation and control would result in the same outcome.

In the above example, we observed an average conversion rate of about 15%. So if the variations were truly equivalent, we would expect each version of the website to cause 38(0.15 * 250) signups for the 250 visitors.

A Chi-Squared test then considers the evidence we collected and the values we would expect if the null hypothesis is true, and asks whether we should reject the null-hypothesis. In other words, is the difference observed so significant, that the null-hypothesis could not be true?

The chi-squared test returns a p-value which is the probability of observing such a difference(or greater), given that the null-hypothesis is true. If we get a probability that is extremely low, say 0.05, we might choose to reject the null-hypothesis and conclude that our variation does indeed increase the success metric.

Image description

How you actually calculate the p value is going to depend on the workflow you use, but here are tutorials for excel and google sheets.


How to Learn More

This was an extremely high-level overview of statistical hypothesis testing. Unless you are looking to delve deeper into statistics and data science, I would recommend focusing on the math that allows hypothesis testing to work and rather devote most of your time into understanding biases.

Data is only as good as the way it is collected. There are very few environments were data collection can be close to perfect, and for startup founders, I can guarantee that you are not in that environment.

Understanding how conclusions from data and experiments can be wrong, is just as important as being able to use it in the first place.

Got any questions? Drop them down below and we’d be happy to answer them!

As always, happy coding from the Codesphere team, the swiss army knife that every development team needs.

Top comments (2)

Collapse
 
rouilj profile image
John P. Rouillard

This is simple enough that it can be solved by hand and also explain the needed math. Can somebody check my solution to this, it's been a good decade since I had to do this.

First step set up the observed results matrix:

observed signup no signup total
control 36 214 250
variant B 40 210 250
total 76 424 500

Then we calculate the expected results table:

expected signup no signup
control 38 212
variant B 38 212

where each expected value for a column/row is:

total for the column * (total for the row / overall total)

For the control/signup cell, we calculate 76 * (250/500). Since the same number of trials were done for control and variant B, the expected values are the same. Remember we are trying to discount the null hypothesis that variant B and the control experiments have the same result.

Next we calculate χ2 by summing:

(observed - expected)2/ expected)

for each of the 4 outcomes.

For the signup/control cell we get: (36-38)2/38 or 4/38 = .1053.

This results in the values:

value signup no signup
control 0.1053 0.0189
variant B 0.1053 0.0189

Note that the values are the same in each column. The reason is the difference between rows in the original table is 4 and the expected value is 1/2 way between the values. So the signup case we square 2 while no signup case squares -2. Only the denominators are different.

Summing all 4 values gives us: χ2 = 0.2483

The last thing to do is get a p value. To do this we need to determine the degrees of freedom (df) for our experiment. This is calculated by multiplying the

(number of triails - 1)*(number of outcomes - 1)

We have 2 trials (control, variant B) and 2 outcomes (signup or no signup). So we have df = 1.

Now look at a chi-squared table (p values across the top, df down the side) for df=1 with a value of 0.2483.

p/df 0.995 0.99 0.975 0.95 0.9 0.5 0.2 0.1
1 0.0000397 0.000157 0.000982 0.00393 0.0158 0.455 1.642

Since 0.2483 falls between 0.0158(p=0.9) and 0.455(p=0.5) our p value is greater than 0.5. Since we need a p value less than 0.05 to reject the null hypothesis, variant B is no better than the control and we should not change to it.

Q.E.D. (perhaps) 8-)

Collapse
 
simoncodephere profile image
Simon Pfeiffer

Ya that's exactly right! You summed the squared errors over the expected values and then found the corresponding p-value to that Chi-Squared Test Statistic