Which Final result Matters?
Here is a typical scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were chosen for a campaign they usually received Treatment A. One other sample was chosen to receive Treatment B. “A” may very well be a communication or offer and “B” may very well be no communication or no offer. “A” may very well be 10% off and “B” may very well be 20% off. Two groups, two different treatments, where A and B are two discrete treatments, but without lack of generality to greater than 2 treatments and continuous treatments.
So, the campaign runs and results are made available. With our backend system, we are able to track which of those units took the motion of interest (e.g. made a purchase order) and which didn’t. Further, for those who did, we log the intensity of that motion. A standard scenario is that we are able to track purchase amounts for those who purchased. This is usually called a mean order amount or revenue per buyer metric. Or 100 different names that each one mean the identical thing — for those who purchased, how much did they spend, on average?
For some use-cases, the marketer is excited about the previous metric — the acquisition rate. For instance, did we drive more (potentially first time) buyers in our acquisition campaign with Treatment A or B? Sometimes, we’re excited about driving the revenue per buyer higher so we put emphasis on the latter.
More often though, we’re excited about driving revenue in a value effective manner and what we actually care about is the revenue that the campaign produced . Did treatment A or B drive more revenue? We don’t at all times have balanced sample sizes (perhaps as a result of cost or risk avoidance) and so we divide the measured revenue by the variety of candidates that were treated in each group (call these counts N_A and N_B). We would like to match this measure between the 2 groups, so the usual contrast is solely:
That is just the mean revenue for Treatment A minus mean revenue for Treatment B, where that mean is taken over your complete set of targeted units, irrespective in the event that they responded or not. Its interpretation is likewise straightforward — what’s the common revenue per promoted unit increase going from Treatment A versus Treatment B?
After all, this last measure accounts for each of the prior: the response rate multiplied by the mean revenue per responder.
Uncertainty?
How much a buyer spends is very variable and a pair large purchases in a single treatment group or the opposite can skew the mean significantly. Likewise, sample variation may be significant. So, we wish to know how confident we’re on this comparison of means and quantify the “significance” of the observed difference.
So, you throw the information in a t-test and stare on the p-value. But wait! Unfortunately for the marketer, the overwhelming majority of the time, the acquisition rate is comparatively low (sometimes VERY low) and hence there are lots of zero revenue values — often the overwhelming majority. The t-test assumptions could also be badly violated. Very large sample sizes may come to the rescue, but there may be a more principled technique to analyze this data that is helpful in multiple ways, that will probably be explained.
Example Dataset
Lets start with the sample dataset to makes things practical. One among my favorite direct marketing datasets is from the KDD Cup 98.
url="https://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip"
filename="cup98LRN.txt"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
pdf_data = pd.read_csv(filename, sep=',')
pdf_data = pdf_data.query('TARGET_D >=0')
pdf_data['TREATMENT'] = np.where(pdf_data.RFA_2F >1,'A','B')
pdf_data['TREATED'] = np.where(pdf_data.RFA_2F >1,1,0)
pdf_data['GT_0'] = np.where(pdf_data.TARGET_D >0,1,0)
pdf_data = pdf_data[['TREATMENT', 'TREATED', 'GT_0', 'TARGET_D']]
Within the code snippet above we’re downloading a zipper file (the educational dataset specifically), extracting it and reading it right into a Pandas data frame. The character of this dataset is campaign history from a non-profit organization that was in search of donations through direct mailings. There is no such thing as a treatment variants inside this dataset, so we’re pretending as a substitute and segmenting the dataset based on the frequency of past donations. We call this indicator (as the specific and create because the binary indicator for ‘A’ ). Consider this the outcomes of a randomized control trial where a portion of the sample population was treated with a suggestion and the rest weren’t. We track each individual and accumulate the quantity of their donation.
So, if we examine this dataset, we see that there are about 95,000 promoted individuals, generally distributed equally across the 2 treatments:

Treatment A has a bigger response rate but overall the response rate within the dataset is simply around 5%. So, we’ve got 95% zeros.

For those who donated, Treatment A appears to be related to a lower average donation amount.

Combining together everyone that was targeted, Treatment A appears to be related to a better average donation amount — the upper response rate outweighs the lower donation amount for responders— but not by much.

Finally, the histogram of the donation amount is shown here, pooled over each treatments, which illustrates the mass at zero and a right skew.

A numerical summary of the 2 treatment groups quantifies the phenomenon observed above — while Treatment A appears to have driven significantly higher response, those who were treated with A donated less on average after they responded. The online of those two measures, the one we’re ultimately after — the general mean donation per targeted unit – appears to still be higher for Treatment A. How confident we’re in that finding is the topic of this evaluation.

Gamma Hurdle
One technique to model this data and answer our research query by way of the difference between the 2 treatments in generating the common donation per targeted unit is with the Gamma Hurdle distribution. Much like the more well-known Zero Inflated Poisson (ZIP) or NB (ZINB) distribution, it is a mixture distribution where one part pertains to the mass at zero and the opposite, within the cases where the random variable is positive, the gamma density function.

Here π represents the probability that the random variable y is > 0. In other words its the probability of the gamma process. Likewise, (1- π) is the probability that the random variable is zero. When it comes to our problem, this pertains to the probability that a donation is made and if that’s the case, it’s value.
Lets start with the component parts of using this distribution in a regression – logistic and gamma regression.
Logistic Regression
The logit function is the link function here, relating the log odds to the linear combination of our predictor variables, which with a single variable reminiscent of our binary treatment indicator, looks like:

Where π represents the probability that the final result is a “positive” (denoted as 1) event reminiscent of a purchase order and (1-π) represents the probability that the final result is a “negative” (denoted as 0) event. Further, π which is the qty of interest above, is defined by the inverse logit function:

Fitting this model may be very easy, we’d like to search out the values of the 2 betas that maximize the likelihood of the information (the final result y)— which assuming N iid observations is:

We could use any of multiple libraries to quickly fit this model but will exhibit PYMC because the means to construct an easy Bayesian logistic regression.
With none of the conventional steps of the Bayesian workflow, we fit this easy model using MCMC.
import pymc as pm
import arviz as az
from scipy.special import expit
with pm.Model() as logistic_model:
# noninformative priors
intercept = pm.Normal('intercept', 0, sigma=10)
beta_treat = pm.Normal('beta_treat', 0, sigma=10)
# linear combination of the treated variable
# through the inverse logit to squish the linear predictor between 0 and 1
p = pm.invlogit(intercept + beta_treat * pdf_data.TREATED)
# Individual level binary variable (respond or not)
pm.Bernoulli(name="logit", p=p, observed=pdf_data.GT_0)
idata = pm.sample(nuts_sampler="numpyro")
az.summary(idata, var_names=['intercept', 'beta_treat'])

If we construct a contrast of the 2 treatment mean response rates, we discover that as expected, the mean response rate lift for Treatment A is 0.026 larger than Treatment B with a 94% credible interval of (0.024 , 0.029).
# create a brand new column within the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = expit(idata.posterior.intercept + idata.posterior.beta_treat) - expit(idata.posterior.intercept)
az.plot_posterior(
idata,
var_names=['TREATMENT A - TREATMENT B']
)

Gamma Regression
The following component is the gamma distribution with considered one of it’s parametrizations of it’s probability density function, as shown above:

This distribution is defined for strictly positive random variables and if utilized in business for values reminiscent of costs, customer demand spending and insurance claim amounts.
Because the mean and variance of gamma are defined by way of α and β in keeping with the formulas:

for gamma regression, we are able to parameterize by α and β or by μ and σ. If we make μ defined as a linear combination of predictor variables, then we are able to define gamma by way of α and β using μ:

The gamma regression model assumes (on this case, the inverse link is one other common option) the log link which is meant to “linearize” the connection between predictor and final result:

Following almost the exact same methodology as for the response rate, we limit the dataset to only responders and fit the gamma regression using PYMC.
with pm.Model() as gamma_model:
# noninformative priors
intercept = pm.Normal('intercept', 0, sigma=10)
beta_treat = pm.Normal('beta_treat', 0, sigma=10)
shape = pm.HalfNormal('shape', 5)
# linear combination of the treated variable
# through the exp to make sure the linear predictor is positive
mu = pm.Deterministic('mu',pm.math.exp(intercept + beta_treat * pdf_responders.TREATED))
# Individual level binary variable (respond or not)
pm.Gamma(name="gamma", alpha = shape, beta = shape/mu, observed=pdf_responders.TARGET_D)
idata = pm.sample(nuts_sampler="numpyro")
az.summary(idata, var_names=['intercept', 'beta_treat'])

# create a brand new column within the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = np.exp(idata.posterior.intercept + idata.posterior.beta_treat) - np.exp(idata.posterior.intercept)
az.plot_posterior(
idata,
var_names=['TREATMENT A - TREATMENT B']
)

Again, as expected, we see the mean lift for Treatment A to have an expected value equal to the sample value of -7.8. The 94% credible interval is (-8.3, -7.3).
The components, response rate and average amount per responder shown above are about so simple as we are able to get. But, its a clear-cut extension so as to add additional predictors so as to 1) estimate the Conditional Average Treatment Effects (CATE) after we expect the treatment effect to differ by segment or 2) reduce the variance of the common treatment effect estimate by conditioning on pre-treatment variables.
Hurdle Model (Gamma) Regression
At this point, it needs to be pretty straightforward to see where we’re progressing. For the hurdle model, we’ve got a conditional likelihood, depending on if the particular remark is 0 or greater than zero, as shown above for the gamma hurdle distribution. We are able to fit the 2 component models (logistic and gamma regression) concurrently. We get without cost, their product, which in our example is an estimate of the donation amount per targeted unit.
It might not be difficult to suit this model with using a likelihood function with a switch statement depending on the worth of the final result variable, but PYMC has this distribution already encoded for us.
import pymc as pm
import arviz as az
with pm.Model() as hurdle_model:
## noninformative priors ##
# logistic
intercept_lr = pm.Normal('intercept_lr', 0, sigma=5)
beta_treat_lr = pm.Normal('beta_treat_lr', 0, sigma=1)
# gamma
intercept_gr = pm.Normal('intercept_gr', 0, sigma=5)
beta_treat_gr = pm.Normal('beta_treat_gr', 0, sigma=1)
# alpha
shape = pm.HalfNormal('shape', 1)
## mean functions of predictors ##
p = pm.Deterministic('p', pm.invlogit(intercept_lr + beta_treat_lr * pdf_data.TREATED))
mu = pm.Deterministic('mu',pm.math.exp(intercept_gr + beta_treat_gr * pdf_data.TREATED))
## likliehood ##
# psi is pi
pm.HurdleGamma(name="hurdlegamma", psi=p, alpha = shape, beta = shape/mu, observed=pdf_data.TARGET_D)
idata = pm.sample(cores = 10)
If we examine the trace summary, we see that the outcomes are the exact same for the 2 component models.

As noted, the mean of the gamma hurdle distribution is π * μ so we are able to create a contrast:
# create a brand new column within the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = ((expit(idata.posterior.intercept_lr + idata.posterior.beta_treat_lr))* np.exp(idata.posterior.intercept_gr + idata.posterior.beta_treat_gr)) -
((expit(idata.posterior.intercept_lr))* np.exp(idata.posterior.intercept_gr))
az.plot_posterior(
idata,
var_names=['TREATMENT A - TREATMENT B']
The mean expected value of this model is 0.043 with a 94% credible interval of (-0.0069, 0.092). We could interrogate the posterior to see what quantity of times the donation per buyer is predicted to be higher for Treatment A and another decision functions that made sense for our case — including adding a fuller P&L to the estimate (i.e. including margins and value).

Notes: Some implementations parameterize the gamma hurdle model otherwise where the probability of zero is π and hence the mean of the gamma hurdle involves (1-π) as a substitute. Also note that on the time of this writing there appears to be an issue with the nuts samplers in PYMC and we needed to fall back on the default python implementation for running the above code.
Summary
With this approach, we get the identical inference for each models individually and the additional advantage of the third metric. Fitting these models with PYMC allows us all the advantages of Bayesian evaluation — including injection of prior domain knowledge and a full posterior to reply questions and quantify uncertainty!
Credits:
- All images are the authors, unless otherwise noted.
- The dataset used is from the KDD 98 Cup sponsored by Epsilon. https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html (CC BY 4.0)