A/B testing, also often called split testing, allows businesses to experiment with different versions of a webpage or marketing asset to find out which one performs higher when it comes to user engagement, click-through rates, and, most significantly, conversion rates.
Conversion rates — the share of holiday makers who complete a desired motion, resembling making a purchase order or signing up for a newsletter — are sometimes the important thing metrics that determine the success of online campaigns. By fastidiously testing variations of a webpage, businesses could make data-driven decisions that significantly improve these rates. Whether it’s tweaking the colour of a call-to-action button, changing the headline, or rearranging the layout, A/B testing provides actionable insights that may transform the effectiveness of your online presence.
On this post, I’ll show tips on how to do Bayesian A/B testing for taking a look at conversion rates. We can even have a look at a more complicated example where we are going to have a look at the differences in changes of customer behavior after an intervention. We can even have a look at the differences when comparing this approach to a frequentist approach and what the possible benefits or disadvantages are.
Let’s say we would like to enhance upon our e-commerce website. We achieve this by exposing two groups of consumers to 2 versions of our website where we e.g. change a button. We then stop this experiment after having exposed a certain number of holiday makers to each these versions. After that, we get a binary array with a 1 indicating conversion and a 0 if there was no conversion.
We will summarize the info in a contingency table that shows us the (relative) frequencies.
contingency = np.array([[obsA.sum(), (1-obsA).sum()], [obsB.sum(), (1-obsB).sum()]])
In our case, we showed each variation to 100 customers. In the primary variation, 5 (or 5%) converted, and within the second variation 3 converted.
Frequentist Setting
We’ll do a statistical test to measure if this result is critical or resulting from likelihood. On this case, we are going to use a Chi2 test which compares the observed frequencies to those that may be expected if there have been no true differences between the 2 versions (the null hypothesis). For more information, one can have a look at this blog post that goes into more detail.
On this case, the p-value doesn’t fall under the brink for significance (e.g. 5%) and due to this fact we cannot reject the null hypothesis that the 2 variants differ of their effect on the conversion rate.
Now, there are some pitfalls when using the Chi2 test that could make the insights gained from it erroneous. Firstly, it is rather sensitive to the sample size. With a big sample size even tiny differences will grow to be significant whereas with a small sample size, the test may fail to detect differences. This is very the case if the calculated expected frequencies for any of the fields are smaller than five. On this case, one has to make use of another test. Moreover, the test doesn’t provide information on the magnitude or practical significance of the difference. When conducting multiple A/B tests concurrently, the probability of finding not less than one significant result resulting from likelihood increases. The Chi2 test doesn’t account for this multiple comparisons problem, which may result in false positives if not properly controlled (e.g., through Bonferroni correction).
One other common pitfall occurs when interpreting the outcomes of the Chi2 test (or any statistical test for that matter). The p-value gives us the probability of observing the info, on condition that the null hypothesis is true. It doesn’t make a press release in regards to the distribution of conversion rates or their difference. And it is a major problem. We cannot make statements resembling “the probability that the conversion rate of variant B is 2% is X%” because for that we would wish the probability distribution of the conversion rate (conditioned on the observed data).
These pitfalls highlight the importance of understanding the restrictions of the Chi2 test and using it appropriately inside its constraints. When applying this test, it’s crucial to enrich it with other statistical methods and contextual evaluation to make sure accurate and meaningful conclusions.
Bayesian Setting
After taking a look at the frequentist way of coping with A/B testing, let’s have a look at the Bayesian version. Here, we’re modeling the data-generating process (and due to this fact the conversion rate) directly. That’s, we’re specifying a likelihood and a previous that could lead on to the observed end result. Consider this as specifying a ‘story’ for the way the info might have been created.
On this case, I’m using the Python package PyMC for modeling because it has a transparent and concise syntax. Contained in the ‘with’ statement, we specify distributions that we will mix and that give rise to a data-generating process.
with pm.Model() as ConversionModel:
# priors
pA = pm.Uniform('pA', 0, 1)
pB = pm.Uniform('pB', 0, 1)delta = pm.Deterministic('delta', pA - pB)
obsA = pm.Bernoulli('obsA', pA, observed=obsA)
obsB = pm.Bernoulli('obsB', pB, observed=obsB)
trace = pm.sample(2000)
We’ve pA and pB that are the chances of conversion in groups A and B respectively. With pm.Uniform we specify our prior belief about these parameters. That is where we could encode prior knowledge. In our case, we’re being neutral and allowing for any conversion rate between 0 and 1 to be equally likely.
PyMC then allows us to attract samples from the posterior distribution which is our updated belief in regards to the parameters after seeing the info. We now obtain a full probability distribution for the conversion probabilities.
From these distributions, we will directly read quantities of interest resembling credible intervals. This enables us to reply questions resembling “What’s the likelihood of a conversion rate between X% and Y%?”.
The Bayesian approach allows for way more flexibility as we are going to see later. Interpreting the outcomes can also be more straightforward and intuitive than within the frequentist setting.
We’ll now have a look at a more complicated example of A/B testing. Let’s say we expose subjects to some intervention firstly of the commentary period. This could be the A/B part where one group gets intervention A and the opposite intervention B. We then have a look at the interaction of the two groups with our platform in the following 100 days (perhaps something just like the variety of logins). What we would see is the next.
We now need to know if these two groups show a meaningful difference of their response to the intervention. How would we solve this with a statistical test? Frankly, I don’t know. Someone would need to give you a statistical test for exactly this scenario. The choice is to again come back to a Bayesian setting, where we are going to first give you a data-generating process. We’ll assume, that every individual is independent and its interactions with the platform are normally distributed. They’ve a switch point where they alter their behavior. This switch point occurs just once but can occur at any given time limit. Before the switch point, we assume a mean interaction intensity of mu1 and after that an intensity of mu2. The syntax might look a bit complicated especially if you’ve got never used PyMC before. In that case, I might recommend testing their learning material.
with pm.Model(coords={
'ind_id': ind_id,
}) as SwitchPointModel:sigma = pm.HalfCauchy("sigma", beta=2, dims="ind_id")
# draw a switchpoint from a uniform distribution for every individual
switchpoint = pm.DiscreteUniform("switchpoint", lower=0, upper=100, dims="ind_id")
# priors for the 2 groups
mu1 = pm.HalfNormal("mu1", sigma=10, dims="ind_id")
mu2 = pm.HalfNormal("mu2", sigma=10, dims="ind_id")
diff = pm.Deterministic("diff", mu1 - mu2)
# create a deterministic variable for the
intercept = pm.math.switch(switchpoint < X.T, mu1, mu2)
obsA = pm.Normal("y", mu=intercept, sigma=sigma, observed=obs)
trace = pm.sample()
The model can then show us the distribution for the switch point location in addition to the distribution of differences before and after the switch point.
We will take a more in-depth have a look at those differences with a forest plot.
We will nicely see how the differences between Group A (id 0 through 9) and Group B (10 through 19) are very much different where group B shows a much greater response to the intervention.
Bayesian Inference offers lots of flexibility with regards to modeling situations through which we would not have lots of data and where we care about modeling uncertainty. Moreover, we now have to make our assumptions explicit and take into consideration them. In additional easy scenarios, frequentist statistical tests are sometimes simpler to make use of but one has to pay attention to the assumptions that come together with them.
All code utilized in this text could be found on my GitHub. Unless otherwise stated, all images are created by the creator.