Bayesian Linear Regression: A Complete Beginner’s guide

-

A workflow and code walkthrough for constructing a Bayesian regression model in STAN

Note: Try my previous article for a practical discussion on why Bayesian modeling could also be the appropriate selection in your task.

This tutorial will concentrate on a workflow + code walkthrough for constructing a Bayesian regression model in STAN, a probabilistic programming language. STAN is widely adopted and interfaces together with your language of selection (R, Python, shell, MATLAB, Julia, Stata). See the installation guide and documentation.

I’ll use Pystan for this tutorial, just because I code in Python. Even if you happen to use one other language, the overall Bayesian practices and STAN language syntax I’ll discuss here doesn’t vary much.

For the more hands-on reader, here’s a link to the notebook for this tutorial, a part of my Bayesian modeling workshop at Northwestern University (April, 2024).

Let’s dive in!

Lets learn how one can construct a straightforward linear regression model, the bread and butter of any statistician, the Bayesian way. Assuming a dependent variable Y and covariate X, I propose the next easy model-

Y = α + β * X + ϵ

Where ⍺ is the intercept, β is the slope, and ϵ is a few random error. Assuming that,

ϵ ~ Normal(0, σ)

we will show that

Y ~ Normal(α + β * X, σ)

We are going to learn how one can code this model form in STAN.

Generate Data

First, let’s generate some fake data.

#Model Parameters
alpha = 4.0 #intercept
beta = 0.5 #slope
sigma = 1.0 #error-scale
#Generate fake data
x = 8 * np.random.rand(100)
y = alpha + beta * x
y = np.random.normal(y, scale=sigma) #noise
#visualize generated data
plt.scatter(x, y, alpha = 0.8)
Generated data for Linear Regression (Image from code by Creator)

Now that we’ve some data to model, let’s dive into how one can structure it and pass it to STAN together with modeling instructions. This is completed via the model string, which generally accommodates 4 (occasionally more) blocks- data, parameters, model, and generated quantities. Let’s discuss each of those blocks intimately.

DATA block

data {                    //input the information to STAN
int N;
vector[N] x;
vector[N] y;
}

The data block is probably the best, it tells STAN internally what data it should expect, and in what format. As an example, here we pass-

N: the scale of our dataset as type int. The part declares that N≥0. (Though it is clear here that data length can’t be negative, stating these bounds is sweet standard practice that could make STAN’s job easier.)

x: the covariate as a vector of length N.

y: the dependent as a vector of length N.

See docs here for a full range of supported data types. STAN offers support for a wide selection of types like arrays, vectors, matrices etc. As we saw above, STAN also has support for encoding limits on variables. Encoding limits is beneficial! It leads to higher specified models and simplifies the probabilistic sampling processes operating under the hood.

Model Block

Next is the model block, where we tell STAN the structure of our model.

//easy model block 
model {
//priors
alpha ~ normal(0,10);
beta ~ normal(0,1);

//model
y ~ normal(alpha + beta * x, sigma);
}

The model block also accommodates a vital, and infrequently confusing, element: prior specification. Priors are a quintessential a part of Bayesian modeling, and have to be specified suitably for the sampling task.

See my previous article for a primer on the role and intuition behind priors. To summarize, the prior is a presupposed functional form for the distribution of parameter values — often referred to, simply, as prior belief. Though priors don’t should exactly match the ultimate solution, they need to allow us to sample from it.

In our example, we use Normal priors of mean 0 with different variances, depending on how sure we’re of the supplied mean value: 10 for alpha (very unsure), 1 for beta (somewhat sure). Here, I supplied the overall belief that while alpha can take a wide selection of various values, the slope is usually more contrained and won’t have a big magnitude.

Hence, in the instance above, the prior for alpha is ‘weaker’ than beta.

As models get more complicated, the sampling solution space expands, and supplying beliefs gains importance. Otherwise, if there isn’t a strong intuition, it is sweet practice to only supply less belief into the model i.e. use a weakly informative prior, and remain flexible to incoming data.

The shape for y, which you may have recognized already, is the usual linear regression equation.

Generated Quantities

Lastly, we’ve our block for generated quantities. Here we tell STAN what quantities we wish to calculate and receive as output.

generated quantities {    //get quantities of interest from fitted model
vector[N] yhat;
vector[N] log_lik;
for (n in 1:N) alpha + x[n] * beta, sigma);
//probability of information given the model and parameters

}

Note: STAN supports vectors to be passed either directly into equations, or as iterations 1:N for every element n. In practice, I’ve found this support to alter with different versions of STAN, so it is sweet to try the iterative declaration if the vectorized version fails to compile.

Within the above example-

yhat: generates samples for y from the fitted parameter values.

log_lik: generates probability of information given the model and fitted parameter value.

The aim of those values shall be clearer after we speak about model evaluation.

Altogether, we’ve now fully specified our first easy Bayesian regression model:

model = """
data { //input the information to STAN
int N;
vector[N] x;
vector[N] y;
}

All that continues to be is to compile the model and run the sampling.

#STAN takes data as a dict
data = {'N': len(x), 'x': x, 'y': y}

STAN takes input data in the shape of a dictionary. It will be important that this dict accommodates all of the variables that we told STAN to expect within the model-data block, otherwise the model won’t compile.

#parameters for STAN fitting
chains = 2
samples = 1000
warmup = 10
# set seed
# Compile the model
posterior = stan.construct(model, data=data, random_seed = 42)
# Train the model and generate samples
fit = posterior.sample(num_chains=chains, num_samples=samples)The .sample() method parameters control the Hamiltonian Monte Carlo (HMC) sampling process, where —
  • num_chains: is the variety of times we repeat the sampling process.
  • num_samples: is the variety of samples to be drawn in each chain.
  • warmup: is the variety of initial samples that we discard (because it takes a while to achieve the overall vicinity of the answer space).

Knowing the appropriate values for these parameters is determined by each the complexity of our model and the resources available.

Higher sampling sizes are in fact ideal, yet for an ill-specified model they may prove to be just waste of time and computation. Anecdotally, I’ve had large data models I’ve had to attend per week to complete running, only to search out that the model didn’t converge. Is is vital to begin slowly and sanity check your model before running a full-fledged sampling.

Model Evaluation

The generated quantities are used for

  • evaluating the goodness of fit i.e. convergence,
  • predictions
  • model comparison

Convergence

Step one for evaluating the model, within the Bayesian framework, is visual. We observe the sampling draws of the Hamiltonian Monte Carlo (HMC) sampling process.

Model Convergence: visually evaluating the overlap of independent sampling chains (Image from code by Creator)

In simplistic terms, STAN iteratively draws samples for our parameter values and evaluates them (HMC does way more, but that’s beyond our current scope). For a great fit, the sample draws must converge to some common general area which might, ideally, be the worldwide optima.

The figure above shows the sampling draws for our model across 2 independent chains (red and blue).

  • On the left, we plot the general distribution of the fitted parameter value i.e. the posteriors. We expect a normal distribution if the model, and its parameters, are well specified. (Why is that? Well, a traditional distribution just implies that there exist a certain range of best fit values for the parameter, which speaks in support of our chosen model form). Moreover, we should always expect a substantial overlap across chains IF the model is converging to an optima.
  • On the appropriate, we plot the actual samples drawn in each iteration (simply to be extra sure). Here, again, we want to see not only a narrow range but additionally lots of overlap between the draws.

Not all evaluation metrics are visual. Gelman et al. [1] also propose the Rhat diagnostic which essential is a mathematical measure of the sample similarity across chains. Using Rhat, one can define a cutoff point beyond which the 2 chains are judged too dissimilar to be converging. The cutoff, nonetheless, is difficult to define on account of the iterative nature of the method, and the variable warmup periods.

Visual comparison is hence a vital component, no matter diagnostic tests

A frequentist thought you might have here is that, “well, if all we’ve is chains and distributions, what’s the actual parameter value?” This is strictly the purpose. The Bayesian formulation only deals in distributions, NOT point estimates with their hard-to-interpret test statistics.

That said, the posterior can still be summarized using credible intervals just like the High Density Interval (HDI), which incorporates all of the x% highest probability density points.

95% HDI for beta (Image from code by Creator)

It will be important to contrast Bayesian credible intervals with frequentist confidence intervals.

  • The credible interval gives a probability distribution on the possible values for the parameter i.e. the probability of the parameter assuming each value in some interval, given the information.
  • The boldness interval regards the parameter value as fixed, and estimates as a substitute the boldness that repeated random samplings of the information would match.

Hence the

Bayesian approach lets the parameter values be fluid and takes the information at face value, while the frequentist approach demands that there exists the one true parameter value… if only we had access to all the information ever

Phew. Let that sink in, read it again until it does.

One other vital implication of using credible intervals, or in other words, allowing the parameter to be variable, is that the predictions we make capture this uncertainty with transparency, with a certain HDI % informing the very best fit line.

95% HDI line of best fit (Image from code by Creator)

Model comparison

Within the Bayesian framework, the Watanabe-Akaike Information Metric (WAIC) rating is the widely accepted selection for model comparison. An easy explanation of the WAIC rating is that it estimates the model likelihood while regularizing for the variety of model parameters. In easy words, it could account for overfitting. This can be major draw of the Bayesian framework — one does not necessarily need to hold-out a model validation dataset. Hence,

Bayesian modeling offers a vital advantage when data is scarce.

The WAIC rating is a comparative measure i.e. it only holds meaning compared across different models that attempt to clarify the identical underlying data. Thus in practice, one can keep adding more complexity to the model so long as the WAIC increases. If sooner or later on this strategy of adding maniacal complexity, the WAIC starts dropping, one can call it a day — any more complexity won’t offer an informational advantage in describing the underlying data distribution.

Conclusion

To summarize, the STAN model block is just a string. It explains to STAN what you’ll give to it (model), what’s to be found (parameters), what you think that is happening (model), and what it should provide you with back (generated quantities).

When turned on, STAN easy turns the crank and offers its output.

The true challenge lies in defining a correct model (refer priors), structuring the information appropriately, asking STAN exactly what you wish from it, and evaluating the sanity of its output.

Once we’ve this part down, we will delve into the true power of STAN, where specifying increasingly complicated models becomes just a straightforward syntactical task. In truth, in our next tutorial we are going to do exactly this. We are going to construct upon this easy regression example to explore Bayesian Hierarchical models: an industry standard, state-of-the-art, defacto… you name it. We are going to see how one can add group-level radom or fixed effects into our models, and marvel at the convenience of adding complexity while maintaining comparability within the Bayesian framework.

Subscribe if this text helped, and to stay-tuned for more!

References

[1] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari and Donald B. Rubin (2013). Bayesian Data Evaluation, Third Edition. Chapman and Hall/CRC.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x