A Gentle Introduction to Bayesian Deep Learning

Artificial Intelligence

A Gentle Introduction to Bayesian Deep Learning

admin

August 5, 2023

A Gentle Introduction to Bayesian Deep Learning

Welcome to the exciting world of Probabilistic Programming! This text is a mild introduction to the sector, you simply need a basic understanding of Deep Learning and Bayesian statistics.

By the tip of this text, it’s best to have a basic understanding of the sector, its applications, and the way it differs from more traditional deep learning methods.

If, like me, you might have heard of Bayesian Deep Learning, and also you guess it involves bayesian statistics, but you do not know exactly the way it is used, you might be in the precise place.

One among the important limitation of Traditional deep learning is that regardless that they’re very powerful tools, they don’t provide a measure of their uncertainty.

Chat GPT can say false information with blatant confidence. Classifiers output probabilities which can be often not calibrated.

Uncertainty estimation is a vital aspect of decision-making processes, especially within the areas similar to healthcare, self-driving cars. We wish a model to have the option to have the option to estimate when its very unsure about classifying a subject with a brain cancer, and on this case we require further diagnosis by a health worker. Similarly we wish autonomous cars to have the option to decelerate when it identifies a latest environment.

As an example how bad a neural network can estimates the chance, let’s take a look at a quite simple Classifier Neural Network with a softmax layer in the long run.

The softmax has a really comprehensible name, it’s a Soft Max function, meaning that it’s a “smoother” version of a max function. The rationale for that’s that if we had picked a “hard” max function just taking the category with the best probability, we’d have a zero gradient to all the opposite classes.

With a softmax, the probability of a category could be near 1, but never exactly 1. And since the sum of probabilities of all classes is 1, there continues to be some gradient flowing to the opposite classes.

Nevertheless, the softmax function also presents a problem. It outputs probabilities which can be poorly calibrated. Small changes within the values before applying the softmax function are squashed by the exponential, causing minimal changes to the output probabilities.

This often ends in overconfidence, with the model giving high probabilities for certain classes even within the face of uncertainty, a characteristic inherent to the ‘max’ nature of the softmax function.

Comparing a standard Neural Network (NN) with a Bayesian Neural Network (BNN) can highlight the importance of uncertainty estimation. A BNN’s certainty is high when it encounters familiar distributions from training data, but as we move away from known distributions, the uncertainty increases, providing a more realistic estimation.

Here’s what an estimation of uncertainty can appear like:

Traditional NN vs Bayesian NN, Image by Creator

You possibly can see that once we are near the distribution we have now observed during training, the model may be very certain, but as we move farther from the known distribution, the uncertainty increases.

There’s one central Theorem to know in Bayesian statistics: The Bayes Theorem.

The prior is the distribution of theta we predict is the almost definitely before any statement. For a coin toss for instance we could assume that the probability of getting a head is a gaussian around p = 0.5
If we wish to place as little inductive bias as possible, we could also say p is uniform between [0,1].
The likelihood is given a parameter theta, how likely is that we got our observations X, Y
The marginal likelihood is the likelihood integrated over all theta possible. It is named “marginal” because we marginalized theta by averaging it over all probabilities.

The important thing idea to know in Bayesian Statistics is that you simply start from a previous, it is your best guess of what the parameter may very well be (it’s a distribution). And with the observations you make, you adjust your guess, and also you obtain a posterior distribution.

Note that the prior and posterior aren’t a punctual estimations of theta but a probability distribution.

As an example this:

On this image you possibly can see that the prior is shifted to the precise, however the likelihood rebalances our prior to the left, and the posterior is somewhere in between.

Bayesian Deep Learning is an approach that marries two powerful mathematical theories: Bayesian statistics and Deep Learning.

The essential distinction from traditional Deep Learning resides within the treatment of the model’s weights:

In traditional Deep Learning, we train a model from scratch, we randomly initialize a set of weights, and train the model until it converges to a latest set of parameters. We learn a single set of weights.

Conversely, Bayesian Deep Learning adopts a more dynamic approach. We start with a previous belief concerning the weights, often assuming they follow a traditional distribution. As we expose our model to data, we adjust this belief, thus updating the posterior distribution of the weights. In essence, we learn a probability distribution over the weights, as an alternative of a single set.

During inference, we average predictions from all models, weighting their contributions based on the posterior. This implies, if a set of weights is very probable, its corresponding prediction is given more weight.

Let’s formalize all of that:

Inference in Bayesian Deep Learning integrates over all potential values of theta (weights) using the posterior distribution.

We may see that in Bayesian Statistics, integrals are in every single place. This is definitely the principal limitation of the Bayesian framework. These integrals are often intractable (we do not all the time know a primitive of the posterior). So we have now to do very computationally expensive approximations.

Advantage 1: Uncertainty estimation

Arguably essentially the most outstanding good thing about Bayesian Deep Learning is its capability for uncertainty estimation. In lots of domains including healthcare, autonomous driving, language models, computer vision, and quantitative finance, the power to quantify uncertainty is crucial for making informed decisions and managing risk.

Advantage 2: Improved training efficiency

Closely tied to the concept of uncertainty estimation is improved training efficiency. Since Bayesian models are aware of their very own uncertainty, they will prioritize learning from data points where the uncertainty — and hence, potential for learning — is highest. This approach, generally known as Lively Learning, results in impressively effective and efficient training.

Demonstration of the effectiveness of Lively Learning, Image from Creator

As demonstrated within the graph below, a Bayesian Neural Network using Lively Learning achieves 98% accuracy with just 1,000 training images. In contrast, models that don’t exploit uncertainty estimation are inclined to learn at a slower pace.

Advantage 3: Inductive Bias

One other advantage of Bayesian Deep Learning is the effective use of inductive bias through priors. The priors allow us to encode our initial beliefs or assumptions concerning the model parameters, which could be particularly useful in scenarios where domain knowledge exists.

Consider generative AI, where the thought is to create latest data (like medical images) that resemble the training data. For instance, should you’re generating brain images, and also you already know the overall layout of a brain — white matter inside, grey matter outside — this information could be included in your prior. This implies you possibly can assign a better probability to the presence of white matter in the middle of the image, and gray matter towards the edges.

In essence, Bayesian Deep Learning not only empowers models to learn from data but in addition enables them to begin learning from some extent of information, relatively than ranging from scratch. This makes it a potent tool for a big selection of applications.

White Matter and Gray Matter, Image by Creator

Plainly Bayesian Deep Learning is incredible! So why is it that this field is so underrated? Indeed we frequently discuss Generative AI, Chat GPT, SAM, or more traditional neural networks, but we almost never hear about Bayesian Deep Learning, why is that?

Limitation 1: Bayesian Deep Learning is slooooow

The important thing to know Bayesian Deep Learning is that we “average” the predictions of the model, and at any time when there may be a mean, there may be an integral over the set of parameters.

But computing an integral is usually intractable, which means that there isn’t any closed or explicit form that makes the computation of this integral quick. So we will’t compute it directly, we have now to approximate the integral by sampling some points, and this makes the inference very slow.

Imagine that for every data point x we have now to average out the prediction of 10,000 models, and that every prediction can take 1s to run, we find yourself with a model that shouldn’t be scalable with a considerable amount of data.

In a lot of the business cases, we want fast and scalable inference, for this reason Bayesian Deep Learning shouldn’t be so popular.

Limitation 2: Approximation Errors

In Bayesian Deep Learning, it’s often vital to make use of approximate methods, similar to Variational Inference, to compute the posterior distribution of weights. These approximations can result in errors in the ultimate model. The standard of the approximation will depend on the selection of the variational family and the divergence measure, which could be difficult to decide on and tune properly.

Limitation 3: Increased Model Complexity and Interpretability

While Bayesian methods offer improved measures of uncertainty, this comes at the price of increased model complexity. BNNs could be difficult to interpret because as an alternative of a single set of weights, we now have a distribution over possible weights. This complexity might result in challenges in explaining the model’s decisions, especially in fields where interpretability is essential.

There’s a growing interest for XAI (Explainable AI), and Traditional Deep Neural Networks are already difficult to interpret since it is difficult to make sense of the weights, Bayesian Deep Learning is even tougher.

Whether you might have feedback, ideas to share, wanna work with me, or just wish to say hello, please fill out the shape below, and let’s start a conversation.

Say Hello 🌿

Don’t hesitate to depart a clap or follow me for more!

Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553), 452–459. Link
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Link
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 1050–1059). Link
Louizos, C., Welling, M., & Kingma, D. P. (2017). Learning sparse neural networks through L0 regularization. arXiv preprint arXiv:1712.01312. Link
Neal, R. M. (2012). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media. Link