Bayesian AB Testing with Pyro

Artificial Intelligence

Bayesian AB Testing with Pyro

admin

November 17, 2023

AB Testing Using Pyro

Consider an organization that has designed a latest website landing page and desires to grasp the impact this can have on conversion, i.e. do visitors proceed their web session on the web site after landing on the page? In test group A, website visitors will likely be shown the present landing page. In test group B, website visitors will likely be shown the brand new landing page. In the remaining of the article, I’ll consult with test group A because the control group, and group B because the treatment group. The business is sceptical concerning the change and has opted for an 80/20 split in session traffic. The entire number of tourists and the overall variety of page conversions for every test group are summarised below.

The null hypothesis of the AB test is that there will likely be no change in page conversion for the 2 test groups. Under the frequentist framework, this is able to be expressed as the next for a two-sided test, where r_c and r_t are the page conversion rates within the control and treatment groups, respectively.

A significance test would then seek to either reject or fail to reject the null hypothesis. Under the Bayesian framework, we express the null hypothesis barely in a different way by asserting the identical prior for every of the test groups.

Let’s pause and description exactly what is going on during our test. The variable we’re concerned about is the page conversion rate. This is solely calculated by taking the variety of distinct converted visitors over the overall number of tourists. The event that generates this rate is whether or not the visitor clicks through the page. There are only two possible outcomes here for every visitor, either the visitor clicks through the page and converts, or doesn’t. A few of you may recognise that for every distinct visitor, that is an example of a Bernoulli trial; there’s one trial and two possible outcomes. Now, after we collect a set of those Bernoulli trials, we’ve got a binomial distribution. When the random variable X has a binomial distribution, we give it the next notation:

Binomial Distribution Notation

Where n is the number of tourists (or the variety of Bernoulli trials), and p is the probability of the event on each trial. p is what we’re concerned about here, we wish to grasp what the probability of a visitor converting on the page is in each test group. We’ve observed some data, but as mentioned within the previous section, we first must define our prior. As at all times in Bayesian statistics, we want to define this prior as a probability distribution. As mentioned before, this probability distribution is a characterisation of our uncertainty. Beta distributions are commonly used for modelling probabilities, because it is defined between the intervals of [0,1]. Moreover, using a beta distribution as our prior for a binomial likelihood function gives us the helpful property of conjugacy, which implies our posterior will likely be generated from the identical distribution as our prior. We are saying that the beta distribution is a conjugate prior. A beta distribution is defined by two parameters, alpha, and confusingly, beta.

Beta Distribution Notation

With access to historical data, we are able to assert an informed prior. We don’t necessarily need historical data, we could use our intuition to tell our understanding, but for now let’s assume we’ve got neither (later on this tutorial we’ll use informed priors, but to exhibit the impact, I’ll start with the uninformed). Let’s assume we’ve got no understanding of the conversion rate on the corporate’s site, and subsequently define our prior as Beta(1,1). This is known as a flat prior. The probability distribution of this function looks just like the graph below, similar to a uniform distribution defined between the intervals [0,1]. By asserting a Beta(1,1) prior, we are saying that each one possible values of the page conversion rate are equally probable.

We now have all the knowledge we want, the priors, and the info. Let’s jump into the code. The code provided herein will provide a framework to start with AB testing using Pyro; it subsequently neglects some features of the package. To assist optimise your code further and take full advantage of Pyro’s capabilities, I like to recommend referring to the official documentation.

First, we want to import our packages. The ultimate line is nice practice, particularly when working in notebooks, clearing the shop of parameters we’ve got built up.

import pyro
import pyro.distributions as dist
from pyro.infer import NUTS, MCMC
import torch
from torch import tensor
import matplotlib.pyplot as plt
import seaborn as sns
from functools import partial
import pandas as pdpyro.clear_param_store()

Models in Pyro are defined as regular Python functions. This is useful because it makes it intuitive to follow.

def model(beta_alpha, beta_beta):
def _model_(traffic: tensor, number_of_conversions: tensor):
# Define Stochastic Primatives
prior_c = pyro.sample('prior_c', dist.Beta(beta_alpha, beta_beta))
prior_t = pyro.sample('prior_t', dist.Beta(beta_alpha, beta_beta))
priors = torch.stack([prior_c, prior_t])
# Define the Observed Stochastic Primatives
with pyro.plate('data'):
observations = pyro.sample('obs', dist.Binomial(traffic, priors),
obs = number_of_conversions)
return partial(_model_)

Just a few things to interrupt down and explain here. First, we’ve got a function wrapped inside an outer function, the outer function returns the partial function of the inner function. This permits us to alter our priors, without having to alter the code. I actually have referred to the variables defined within the inner function as primitives, consider primitives as variables within the model. We’ve two varieties of primitives within the model, stochastic and observed stochastic. In Pyro, we do not need to explicitly define the difference, we simply add the obs argument to the sample method when it’s an observed primitive and Pyro interprets it accordingly. Observed primitives are contained throughout the context manager pyro.plate(), which is best practice and makes our code look cleaner. Our stochastic primitives are our two priors, characterised by Beta distributions, governed by the alpha and beta parameters that we pass in from the outer function. As previously mentioned, we assert the null hypothesis by defining these as equal. We then stack these two primitives together using tensor.stack(), which performs an operation akin to concatenating a Numpy array. It will return a tensor, the info structure required for inference in Pyro. We’ve defined our model, now let’s move onto the inference stage.

As previously mentioned, this tutorial will use MCMC. The function below will take the model that we’ve got defined above and the variety of samples we wish to make use of to generate our posterior distribution as a parameter. We also pass our data into the function, as we did for the model.

def run_infernce(model, number_of_samples, traffic, number_of_conversions):
kernel = NUTS(model)mcmc = MCMC(kernel, num_samples = number_of_samples, warmup_steps = 200)
mcmc.run(traffic, number_of_conversions)
return mcmc

The primary line inside this function defines our kernel. We use the NUTS class to define our kernel, which stands for No-U-Turn Sampler, an autotuning version of Hamiltonian Monte Carlo. This tells Pyro how you can sample from the posterior probability space. Again, it’s beyond the scope of this text to dive deeper into this topic, but for now, it’s sufficient to know that NUTS allows us to sample from the probability space intelligently. The kernel is then used to initialise the MCMC class on the second line, specifying it to make use of NUTS. We pass the number_of_samples argument within the MCMC class which is the variety of samples used to generate the posterior distribution. We assign the initialised MCMC class to the mcmc variable and call the run() method, passing our data as parameters. The function returns the mcmc variable.

That is all we want; the next code defines our data and calls the functions we’ve got just made using the Beta(1,1) prior.

traffic = torch.tensor([5523., 1379.])
conversions =torch.tensor([2926., 759.])
inference = run_infernce(model(1,1), number_of_samples = 1000, 
traffic = traffic, number_of_conversions = conversions)

The primary element of the traffic and conversions tensors are the counts for the control group, and the second element in each tensor is the counts for the treatment group. We pass the model function, with the parameters to control our prior distribution, alongside the tensors we’ve got defined. Running this code will generate our posterior samples. We run the next code to extract the posterior samples and pass them to a Pandas dataframe.

posterior_samples = inference.get_samples()
posterior_samples_df = pd.DataFrame(posterior_samples)

Notice the column names of this dataframe are the strings we passed after we defined our primitives within the model function. Each row in our dataframe comprises samples drawn from the posterior distribution, and every of those samples represents an estimate of the page conversion rate, the probability value p that governs our Binomial distribution. Now we’ve got returned the samples, we are able to plot our posterior distributions.

Results

An insightful method to visualise the outcomes of the AB test with two test groups is by a joint kernel density plot. It allows us to visualise the density of samples within the probability space across each distributions. The graph below could be produced from the dataframe we’ve got just built.

The probability space contained within the graph above could be divided across its diagonal, anything above the road would indicate regions where the estimation of the conversion rate is higher within the treatment group than the control and vice versa. As illustrated within the plot, the samples drawn from the posterior are densely populated within the region which might indicate the conversion rate is higher within the treatment group. It will be significant to spotlight that the posterior distribution for the treatment group is wider than the control group, reflecting the next degree of uncertainty. This can be a results of observing less data within the treatment group. Nevertheless, the plot strongly indicates that the treatment group has outperformed the control group. By collecting an array of samples from the posterior and taking the element-wise difference, we are able to say that the probability that the treatment group outperforms the control group is 90.4%. This figure suggests that 90.4% of the samples drawn from the posterior will likely be populated above the diagonal within the joint density plot above.

These results were achieved through the use of a flat (uninformed) prior. Using an informed prior may help improve the model, particularly when the supply of observed data is proscribed. A helpful exercise is to explore the results of using different priors. The plot below shows the Beta(2,2) probability density function and the joint plot it produces after we rerun the model. We will see that using the Beta(2,2) prior produces a really similar posterior distribution for each test groups.

The samples drawn from the posterior suggest there’s a 91.5% probability that the treatment group performs higher than the control. Due to this fact, we do consider with the next degree of certainty that the treatment group is best than the control versus using a flat prior. Nonetheless, in this instance the difference is negligible.

There may be one other thing I would love to spotlight about these results. After we ran the inference, we told Pyro to generate 1000 samples from the posterior. That is an arbitrary number, choosing a special variety of samples can change the outcomes. To focus on the effect of accelerating the variety of samples, I ran an AB test where the observations from the control and treatment groups were the identical, each with an overall conversion rate of fifty%. Using a Beta(2,2) prior generates the next posterior distributions as we incrementally increase the variety of samples.

After we run our inference with just 10 samples, the posterior distribution for the control and treatment groups are relatively wide and adopt different shapes. Because the variety of samples that we draw increases, the distributions converge, eventually generating nearly similar distributions. Moreover, we observe two properties of statistical distributions, the central limit theorem and the law of enormous numbers. The central limit theorem states that the distribution of sample means converges towards a standard distribution because the variety of samples increases, and we are able to see that within the plot above. Moreover, the law of enormous numbers states that because the sample size grows, the sample mean converges towards the population mean. We will see that the mean of the distributions in the underside right tile is roughly 0.5, the conversion rate observed in each of the test samples.

LEAVE A REPLY Cancel reply