Hypothesis Testing Applied to A/B Testing
There is a lot of decisions to make during an A/B test; most of them are made during the conception stage. In this article, I am going to focus on the steps following the data collection stage. You will not hear about sample size nor power in this article (or maybe just a little).
This means our starting point is a dataset. The story behind this dataset will vary a lot depending on the product/experiment but, generally speaking, it is quite simple. It contains observations within two randomly assigned populations: A and B. Because this data was collected with intent, those observations have at least one metric associated with them.
We might want to know how many people clicked on a link, how long they stayed on the page after that, or how many of them subscribed to the service our company offers.
This is a fundamental concept in A/B testing. These two populations are different because they have been through two different types of process/experience. The purpose of the A/B test is to understand how that process influenced these populations, and in what proportion.
But how can we be sure the process A is more efficient than the process B? From a statistical point of view, we can try to answer this question by testing a hypothesis.
1. Hypotheses
The null hypothesis might have been defined in more precise terms during the conception of the experiment. But most of the time, we want to know if population A is different enough from population B. Therefore, our hypotheses usually are:
- The null hypothesis, the status quo: there is no significant difference between population A and population B.
- The alternative hypothesis, on the other hand, claims there is a difference between these two populations.
Note: However the hypotheses are formulated, they have to be mathematically exclusive.
As usual with hypothesis testing, the hypotheses are accompanied by two types of errors. Unfortunately, once the data is collected, there is very little we can do to alter them. It is therefore worth reading any documentation coming with the dataset. There might be indications about the critical value α, or the confidence interval (1-α), or other valuable pieces of information.
In case no information is provided, α=0.05 can be assumed (in a commercial/customer relationship context, at least).
2. Statistical Testing
To choose the right test, we will need to know a few things:
- The kind of data we are dealing with. Do we have categorical, arbitrary, ordinal data?
- What we are trying to measure. In an A/B test setup, we usually try to quantify the difference between 2 populations. But we could be interested in the correlation between those.
- Is our data ranked? Should the test be parametric, or not?
- Are samples independent or paired?
Once we can answer these questions, we can pick the correct test for our experiment:
2.1 Quick overview of different tests
There are variations to theses tests which might be more or less suited to the dataset we want to analyse but, statistical tests can roughly be categorised as follow:
Z-test
For this test, we need to know the mean of our population, which is rarely the case. It assumes the samples are independents and normally distributed. This test answers the following question: how far from the population mean is the observed sample? The given answer is in terms of an amount of 𝜎 away from the population mean.
Pearson’s Correlation
It assesses the strength of the correlation.
Spearman’s Rank Correlation
It assesses the relationship between two ranked variables.
Wilcoxon
The signed-rank and the match-paired are a non-parametric statistical test that compares two paired groups.
T-test
Like a z-test, it compares the means of two samples. Unlike the z-test, it is not necessary to know the parameters of the population.
One Sample T-test
It compares a known mean to a null hypothesis value.
Paired T-test (or Dependent T-test)
It compares the difference between paired observations. It is a t-test run on the difference of a population measured under two different conditions.
Independent T-test (compares means of two groups)
It compares the mean of two independent groups and tries to establish the statistical difference between those two groups.
Mann-Whitney U
It compares two independent samples, samples coming from non-normal distributions. The sample sizes can be less than 30.
ANalysis Of VAriance
This test is usually not used in an A/B context as it requires three or more groups, but it can come in handy. Technically an ANOVA uses the F-test. The F-test is the ratio: (between-group variance)/(within-group variance). This test compares the mean of more than three populations:
- One-way, the test is mean based.
- Two-way, the test is based on two independent variables.
Note: To know which variable is dominant, we will have to run a posthoc test after running a two-way ANOVA.
Chi-square test
This test compares categorical variables.
3. Interpreting the Test Results
The tests are going to return a p-value that we need to compare to the significance level, α.
The p-value is the probability of finding an effect at least as extreme as the one observed when the null hypothesis is true. If the p-value is lower than the α, we can reject the null hypothesis in favour of the alternative hypothesis.
4. Posthoc Tests
We can get a better contrast in our answer by running extra tests. However, we need to keep in mind that now that we have seen the data, our decisions are most likely to be biased.
Tukey
This test identifies the difference between means that is greater than the expected standard deviation. It assumes the sample means are normally distributed.
Bonferroni
This correction tackles multiple comparisons issues as the more data are observed, the more chance of discerning a rare event increases. The Bonferroni correction controls the expected Type I error per family. It is conservative if there are a lot of tests and-or if the tests are positively correlated. Keep in mind your type II error will increase as the statistical power reduces.
Newman-Keuls
It identifies the sample means that are significantly different from each other. We can use this test when three or more differences have been identified by an ANOVA. As it uses different critical values for pair comparisons, this test is more likely to reveal significant differences between group means (and gets more exposure to Type I error by the same). It is more powerful but less conservative than Tukey’s test.
Scheffé
This method adjusts the significance level in a linear regression analysis to account for multiple comparisons. This method is more appropriate than Tukey when all contrasts are of interest, but Tukey is more precise (in terms of confidence interval) when fewer comparisons are made.