An Intuitive Guide to MCMC (Part I): The Metropolis-Hastings Algorithm

-

Bayesian statistics you’ve likely encountered MCMC. While the remaining of the world is fixated on the newest LLM hype, Markov Chain Monte Carlo stays the quiet workhorse of high-end quantitative finance and risk management. It’s the tool of selection when “guessing” isn’t enough and you’ll want to rigorously map out uncertainty.

Despite the intimidating acronym, is a mix of two straightforward concepts:

  • A Markov Chain is a stochastic process where the following state of the system depends entirely on its current state and never on the sequences of events that preceded it. This property is normally known as.
  • A Monte Carlo method simply refers to any algorithm that relies on repeated random sampling to acquire numerical results.

On this series, we’ll present the core algorithms utilized in MCMC frameworks. We primarily deal with those used for Bayesian methods.

We start with : the foundational algorithm that enabled the sector’s earliest breakthroughs. But before diving into the mechanics, let’s discuss the issue MCMC methods help solve.

The Problem

Suppose we wish to have the ability to sample variables from a probability distribution which we all know the density formula for. In this instance we use the usual normal distribution. Let’s call a function that may sample from it rnorm.

For rnorm to be considered helpful, it must generate values (x) over the long term that match the possibilities of our goal distribution. In other words, if we were to let rnorm run (100,000) times and if we were to gather these values and plot them by the frequency they appeared (a histogram), the form would resemble the usual normal distribution.

We start with the formula for the unnormalised density of the conventional distribution:

[p(x) = e^{-frac{x^2}{2}}]

This function returns a density for a given (x) as an alternative of a probability. To get a probability, we want to normalise our density function by a relentless, in order that the overall area under the curve integrates (sums) to (1).

To search out this constant we want to integrate the density function across all possible values of (x).

[C = int^infty_{-infty}e^{-frac{x^2}{2}},dx]

There isn’t any closed-form solution to the indefinite integral of (e^{-x^2}). Nevertheless, mathematicians solved the definite integral (from (-infty) to (infty)) by moving to polar coordinates (because apparently, turning a (1D) problem to a (2D) one makes it easier to unravel) , and realising the overall area is (sqrt{2pi}).

Due to this fact, to make the realm under the curve sum to (1), the constant have to be the :

[C = frac{1}{sqrt{2pi}}]

That is where the well-known normalisation constant (C) for the conventional distribution comes from.

OK great, we’ve got the mathematician-given constant that makes our distribution a sound probability distribution. But we still must have the ability to sample from it.

Since our scale is continuous and infinite the probability of getting a particular number (e.g. (x = 1.2345…) to infinite precision) is definitely zero. It is because a single point has no width, and due to this fact incorporates no ‘area’ under the curve.

As a substitute, we must speak by way of ranges  

In other words, we want to seek out the realm under the curve between (a) and (b). And as you’ve appropriately guessed, to calculate this area we typically must integrate our normalised density formula – the resulting integrated function is often known as the Cumulative Distribution Function ((CDF)).

Since we cannot solve the integral, we cannot derive the (CDF) and so we’re stuck again. The clever mathematicians solved this too and have been capable of use trigonometry (specifically the Box-Muller transform) to convert uniform random variables to normal random variables.

But here is the catch: In the true world of Bayesian statistics and machine learning, we cope with complex multi-dimensional distributions where:

  •  We cannot solve the integral analytically.
  • Due to this fact we cannot find the normalisation constant (C)
  • Finally, without the integral, we cannot calculate the CDF, so     standard sampling fails.

We’re stuck with an unnormalised formula and no method to calculate the overall area. That is where MCMC is available in. MCMC methods allow us to sample from these distributions ever needing to unravel that unattainable integral.

Introduction

A is uniquely defined by its transition probabilities (P(xrightarrow x’)).

For instance in a system with (4) states:

[P(xrightarrow x’) = begin{bmatrix} 0.5 & 0.3 & 0.05 & 0.15 0.2 & 0.4 & 0.1 & 0.3 0.4 & 0.4 & 0 & 0.2 0.1 & 0.8 & 0.05 & 0.05 end{bmatrix}]

The probability of going from any state (x) to (x’) is given by entry (i rightarrow j) within the matrix.

Have a look at the third row, for example: ([0.4,0.4,0,0.2]).

It tells us that if the system is currently in State (3), it has a (40%) probability of moving to State (1), a (40%) probability of moving to State (2), a (0%) probability of staying in State (3), and a (20%) probability of moving to State (4).

The matrix has mapped every possible path with its corresponding probabilities. Notice that every row (i) sums to (1) in order that our transition matrix represents valid probabilities.

A Markov process also requires an initial state distribution (pi_0) ().

For instance, this might appear to be:

[pi_0 = begin{bmatrix} 0.4 & 0.15 & 0.25 & 0.2 end{bmatrix}]

This simply means the probability of ranging from state (1) is (0.4), state (2) is (0.15) and so forth.

To search out the probability distribution of where the system will likely be after step one (t_0 + 1), we multiply the initial distribution with the transition probabilities:

[ pi_1 = pi_0P]

The matrix multiplication effectively gives us the probability of all of the routes we will take to get to a state (j) by summing up all the person probabilities of reaching (j) from different starting states (i).

Why this works

Through the use of matrix multiplication we’re exploring every possible path to a destination and summing their likelihood.

Notice that the operation also beautifully preserves the requirement that the sum of the possibilities of being in a state will at all times equal (1).

Stationary Distribution

A properly constructed Markov process reaches a state of equilibrium because the variety of steps (t) approaches infinity:

[pi^* P = pi^*]

Such a state is often known as .

(pi^*) is often known as the and represents a time at which the probability distribution a transition ((pi^*P)) is an identical to the probability distribution the transition ((pi^*)).

The existence of such a state happens to be the inspiration of each MCMC method.

When sampling a goal distribution using a stochastic process, we aren’t asking but somewhat . To reply that, we want to introduce long run predictability into the system.

This guarantees that there exists a theoretical state (t) at which the possibilities “settle” down as an alternative of moving in random for all eternity. The purpose at which they “settle” down is the purpose at which we hope we will start sampling from our goal distribution.

Thus, to have the ability to effectively estimate a probability distribution using a Markov process we want to make sure that:

  • there a stationary distribution.
  • this stationary distribution is , otherwise we could have multiple states of equilibrium in an area distant from our goal distribution.

The mathematical constraints imposed by the algorithm make a Markov process satisfy these conditions, which is central to all MCMC methods. How that is achieved may vary.

Uniqueness

Generally, to ensure the individuality of the stationary distribution, we want to satisfy three conditions. Expand the section below to see them:

The Holy Trinity
  • Irreducible: A system is irreducible if at any state (x) there’s a non-zero probability of any point (x’) within the sample space being visited. Simply put, you possibly can get from any state A to any state B, eventually.
  • Aperiodic: The system must not return to a specific state in fixed intervals. A sufficient condition for aperiodicity is that there exists at the very least one state where the probability of staying is non-zero.
  • Positive Recurrent: A state (x) is positive recurrent if ranging from that state the system is guaranteed to return to it and the common variety of steps it takes to return to is finite. That is assured by us modelling a goal that has a finite integral and is a correct probability distribution (the realm under the curve must sum to (1)).

Any system that meets these conditions is often known as an system. The tables at the tip of the article show how the MH algorithm deals with ensuring ergodicity and due to this fact uniqueness.

Metropolis-Hastings

The approach the MH algorithm takes is to start with the definition of – a sufficient but not not essential condition for global balance. Quite simply, if our algorithm satisfies detailed balance, we’ll guarantee that our simulation has a stationary distribution.

Derivation

The definition of is:

[pi(x) P(x’|x) = pi(x’) P(x|x’) ,]

which means the probability of going from (x) to (x’) is similar because the probability flow going from (x’) to (x).

The concept is to seek out the stationary distribution by iteratively constructing the transition matrix, (P(x’,x)) by setting (pi) to be the goal distribution (P(x)) we wish to sample from.

To implement this, we decompose the transition probability (P(x’|x)) into two separate steps:

  1.  Proposal ((g)): The probability of proposing a move to (x’) given we’re at (x).
  2. Acceptance ((A)): The acceptance function gives us the probability of accepting the proposal.

Thus,

[P(x’|x) = g(x’|x) a(x’,x)]

The Hastings Correction

Substituting these values back into the equation above gives us:

[frac{pi(x)}{pi(x’)} = fracx’)  a(x,x’)x)  a(x’,x)]

and at last rearranging we get an expression for our acceptance as a ratio:

[frac{a(x’,x)}{a(x,x’)} = fracx’)pi(x’)x)pi(x)]

This ratio represents how way more likely we’re to just accept a move to (x’) in comparison with a move to (x).

The (fracx’)pi(x’)x)pi(x)) term is often known as the Hastings correction.

Vital Note

Because we frequently select a symmetric distribution for the proposal, the probability of jumping from (x rightarrow x’) is similar as    jumping from (x’ rightarrow x). Due to this fact, the proposal terms    cancel one another out leaving only the ratio of the goal densities.

This special case where the proposal is symmetric and the (g) terms vanish is historically often known as the Metropolis Algorithm (1953). The more general version that enables for asymmetric proposals (requiring the (g) ratio – often known as the ) is the Metropolis-Hastings Algorithm (1970).

The Breakthrough

Recall the unique problem: we cannot calculate (pi(x)) because we don’t know the normalisation constant (C) (the integral).

Nevertheless, look closely on the ratio (frac{pi(x’)}{pi(x)}). If we expand (pi(x)) into its unnormalised density (f(x)) and the constant (C):

[frac{pi(x’)}{pi(x)} = frac{{f(x’)} / C}{f(x) / C} = frac{f(x’)}{f(x)}]

The constant (C) cancels out!

That is the breakthrough. We will now sample from a fancy distribution using only the unnormalised density (which we all know) and the proposal distribution (which we elect).

All that’s left to do is to seek out an acceptance function (A) that can satisfy detailed balance:

[frac{a(x’,x)}{a(x,x’)} = R ,]

where (R) represents (fracx’)pi(x’)x)pi(x)).

The Metropolis Acceptance

The acceptance function the algorithm uses is:

[a(x’,x) = min(1,R)]

This ensures that the acceptance probability is at all times between (0) and (1).

To see why this selection satisfies detailed balance, we must confirm that the equation holds for the reverse move as well. We want to confirm that:

[frac{a(x’,x)}{a(x,x’)} = R,]

in two cases:

Case I: The move is advantageous ((R ge 1))

Since (R ge 1), the inverse (frac{1}{R} le 1):

  • our forward acceptance is (a(x’,x) = min(1,R) = 1)
  • our reverse acceptance is (a(x,x’) = min(1,frac{1}{R}) = frac{1}{R})

[frac{1}{a(x,x’)} = R]

[frac{1}{frac{1}{R}} = R]

Case II: The move is just not advantageous ((R < 1))

Since (R < 1), the inverse (frac{1}{R} > 1):

  • our forward acceptance is (a(x’,x) = min(1,R) = R)
  • our reverse acceptance is (a(x,x’) = min(1,frac{1}{R}) = 1)

Thus:

[frac{R}{a(x,x’)} = R]

[frac{R}{1} = R,]

and the equality is satisfied in each cases.

Implementation

Lets implement the MH algorithm in python on two example goal distributions.

I. Estimating a Gaussian Distribution

Once we plot the samples on a chart against a real normal distribution that is what we get:

The MH algorithm sampling a Gaussian distribution

Now you may be considering why we bothered running a MCMC method for something we will do using np.random.normal(n_iterations). That may be a very valid point! Actually, for a 1-dimensional Gaussian, the inverse-transform solution (using trignometry) is way more efficient and is what numpy actually uses.

But at the very least we all know that our code works! Now, let’s try something more interesting.

II. Estimating the ‘Volcano’ Distribution

Let’s attempt to sample from a much less ‘standard’ distribution that’s constructed in 2-dimensions, with the third dimension representing the distribution’s density.

Because the sampling is occurring in (2D) space (the algorithm only knows its x-y location not the ‘slope’ of the volcano) – we get a fairly ring across the mouth of the volcano.

The sampler visits the purpose of highest density, the ‘mouth’ of the volcano.

Summary of Mathematical Conditions for MCMC

Now that we’ve seen the fundamental implementation here’s a fast summary of the mathematical conditions an MCMC method requires to really work:

Condition Mechanism
Stationary Distribution
(That there exists a set of probabilities that, once reached, won’t change.)
Detailed Balance
The algorithm is designed to satisfy the detailed balance equation.
Convergence
(Guaranteeing that the chain eventually converges to the stationary distribution.)
Ergodicity
The system must satisfy the conditions in table 2 be ergodic.
Uniqueness of Stationary Distribution
(That there exists just one solution to the detailed balance equation)
Ergodicity
Guaranteed if the system is ergodic.
Table 1. The conditions required for an MCMC method and the mechanism the MH algorithm uses.

And here’s how the MH algorithm satisfies the necessities for :

Condition Mechanism
1. Irreducible
(Ability to succeed in any state from some other state.)
Proposal Function
Often satisfied through the use of a proposal (like a Gaussian) that has non-zero probability in every single place.
2. Aperiodic
(The system doesn’t get trapped in a loop.)
Rejection Step
The “coin flip” allows us to reject a move and stay in the identical state, breaking any periodicity that will have occurred.
3. Positive Recurrent
(The expected return time to any state is finite.)
Proper Probability Distribution
Guaranteed by the incontrovertible fact that we model the goal as a correct distribution (i.e., it integrates/sums to (1)).
Table 2. The three requirements for Ergodicity

Conclusion

In this text we’ve got seen how MCMC helps solve the 2 predominant challenges of sampling from a distribution given only its unnormalised density (likelihood) function:

  • The normalisation problem: For a distribution to be a sound probability distribution the realm under the curve must sum to (1). To do that we want to calculate the overall area under the curve after which divide our unnormalised values by that constant. Calculating the realm involves integrating a fancy function and within the case of the for instance no closed-form solution exists.
  • The inversion problem: To generate a sample we want to choose a random probability and ask what (x) value corresponds to this area? To do that we not only have to unravel the integral but in addition invert it. And since we will’t write down the integral it’s unattainable to unravel its inverse.

MCMC methods, starting with Metropolis-Hastings, allow us to bypass these unattainable math problems through the use of clever random walks and acceptance ratios.

For a more robust implementation of the Metropolis-Hastings algorithm and an example of sampling using an asymmetric proposal (utilising the Hastings correction) take a look at the go code here.

What’s Next?

We’ve successfully sampled from a fancy (2D) distribution without ever calculating an integral. Nevertheless, in case you have a look at the Metropolis-Hastings code, you’ll notice our proposal step is actually a blind, random guess (np.random.normal).

In low dimensions, guessing works. But within the high-dimensional spaces of recent Bayesian methods, guessing randomly is like attempting to get a greater rate from a usurer – almost every proposal you make will likely be rejected.

In Part II, we’ll introduce Hamiltonian Monte Carlo (HMC), an algorithm that enables us to efficiently explore high-dimensional space using the geometry of the distribution to guide our steps.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x