Are You Sure Your Posterior Makes Sense?

Introduction

Parameter estimation has been for a long time one of the crucial necessary topics in statistics. While frequentist approaches, akin to Maximum Likelihood Estimations, was once the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences rely on a task that is way from trivial: ensuring that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “In case you don’t know where you’re going, any road will take you there.”

This text is supposed to assist data scientists evaluate an often neglected aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we mix easy analogies with technical rigor to make sure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of sturdy parameter estimation methods. Whether the target is to make inferences or conduct simulations, having the capability to model the info generation process is an important a part of the method. For a very long time, the estimations were mainly performed using frequentist tools, akin to Maximum Likelihood Estimations (MLE) and even the famous Least Squares optimization utilized in regressions. Yet, frequentist methods have clear shortcomings, akin to the incontrovertible fact that they’re focused on point estimates and don’t incorporate prior knowledge that would improve estimates.

As an alternative choice to these tools, Bayesian methods have gained popularity over the past a long time. They supply statisticians not only with point estimates of the unknown parameter but in addition with confidence intervals for it, all of that are informed by the info and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We will define P(θ|x), the posterior distribution of a parameter’s value given the info, as:

[ P(theta|x) = fractheta) P(theta){P(x)} ]

On this formula, P(x|θ) is the likelihood of the info given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

[ P(x) = int_theta P(x, theta) dtheta ]

In some cases, as a consequence of the complexity of the calculations required, deriving the posterior distribution analytically was impossible. Nonetheless, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has grow to be easier, giving researchers a robust tool for situations where analytical posteriors will not be trivial to search out. Yet, with such power also comes a considerable amount of responsibility to make sure that results make sense. That is where sampler diagnostics are available, offering a set of useful tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the actual posterior distribution. But how can we all know so?

How samplers work

Before diving into the technicalities of diagnostics, we will cover how the means of sampling a posterior (especially with an MCMC sampler) works. In easy terms, we are able to consider a posterior distribution as a geographical area we haven’t been to but have to know the topography of. How can we draw an accurate map of the region?

Certainly one of our favourite analogies comes from Ben Gilbert. Suppose that the unknown region is definitely a house whose floorplan we want to map. For some reason, we cannot directly visit the home, but we are able to send bees inside with GPS devices attached to them. If all the things works as expected, the bees will fly across the house, and using their trajectories, we are able to estimate what the ground plan looks like. On this analogy, the ground plan is the posterior distribution, and the sampler is the group of bees flying across the house.

The explanation we’re writing this text is that, in some cases, the bees won’t fly as expected. In the event that they get stuck in a certain room for some reason (because someone dropped sugar on the ground, for instance), the info they return won’t be representative of all the house; quite than visiting all rooms, the bees only visited a couple of, and our picture of what the home looks like will ultimately be incomplete. Similarly, when a sampler doesn’t work accurately, our estimation of the posterior distribution can also be incomplete, and any inference we draw based on it’s more likely to be unsuitable.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes from one state to a different with certain properties. Markov Chain refers back to the incontrovertible fact that the subsequent state only is dependent upon the present one (or that the bee’s next location is barely influenced by its current place, and never by all the places where it has been before). Monte Carlo signifies that the subsequent state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) which are near random and regularly explore the posterior distribution.

Now that you simply understand how a sampler works, let’s dive right into a practical scenario to assist us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to know more about public annual spending on healthcare by mayors of cities with lower than 1 million inhabitants. Quite than taking a look at sheer frequencies, he wants to know the underlying distribution explaining expenditure, and a sample of spending data is about to reach. The issue is that two of the economists involved within the project disagree about how the model should look.

Model 1

The primary economist believes that each one cities spend similarly, with some variation around a certain mean. As such, he creates an easy model. Although the specifics of how the economist selected his priors are irrelevant to us, we do have to remember that he’s attempting to approximate a Normal (unimodal) distribution.

[
x_i sim text{Normal}(mu, sigma^2) text{ i.i.d. for all } i
mu sim text{Normal}(10, 2)
sigma^2 sim text{Uniform}(0,5)
]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two sorts of cities: those that do their best to spend little or no and those that will not be afraid of spending quite a bit. As such, he creates a rather more complex model, using a combination of normals to reflect his belief that the true distribution is bimodal.

[
x_i sim text{Normal-Mixture}([omega, 1-omega], [m_1, m_2], [s_1^2, s_2^2]) text{ i.i.d. for all } i
m_j sim text{Normal}(2.3, 0.5^2) text{ for } j = 1,2
s_j^2 sim text{Inverse-Gamma}(1,1) text{ for } j=1,2
omega sim text{Beta}(1,1)
]

After the info arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which can be a mirrored image of reality (1) if their assumptions are true and (2) . The primary , a discussion about assumptions, shall be left to the economists. Nonetheless, how can they know whether the second holds? In other words, how can they make sure that the sampler worked accurately and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To guage a sampler’s performance, we are able to explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In easy terms, R-hat evaluates whether bees that began at different places have all explored the identical rooms at the tip of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to evaluate the of the chains. It measures whether multiple MCMC chains have mixed well (i.e., in the event that they have sampled the identical topography) by comparing the variance of samples inside each chain to the variance of the sample means across chains. Intuitively, because of this

[
hat{R} = sqrt{frac{text{Variance Between Chains}}{text{Variance Within Chains}}}
]

If R-hat is near 1.0 (or below 1.01), it signifies that the variance inside each chain could be very much like the variance between chains, suggesting that they’ve converged to the identical distribution. In other words, the chains are behaving similarly and are also indistinguishable from each other. That is precisely what we see after sampling the posterior of the primary model, shown within the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, nevertheless, tells a special story. The very fact we now have such large r-hat values indicates that, at the tip of the sampling process, the various chains had not converged yet. In practice, because of this , or that every bee created a map of a special room of the home. This fundamentally leaves us with out a clue of how the pieces connect or what the whole floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we all know something went unsuitable with the sampling process within the second model. Nonetheless, even when the R-hat had turned out inside acceptable levels, this doesn’t give us certainty that the sampling process worked. . Sometimes, even in case your R-hat readout is lower than 1.01, the sampler won’t have properly explored the complete posterior. This happens when multiple bees start their exploration in the identical room and remain there. Likewise, in case you’re using a small variety of chains, and in case your posterior happens to be multimodal, there’s a probability that each one chains began in the identical mode and did not explore other peaks.

The R-hat readout reflects convergence, not completion. With the intention to have a more comprehensive idea, we want to examine other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers back to the incontrovertible fact that the subsequent state is chosen randomly. This doesn’t necessarily mean that the states are fully independent. Although the bees select their next step at random, these steps are still correlated to some extent. If a bee is exploring a lounge at time t=0, it’s going to probably still be within the lounge at time t=1, though it’s in a special a part of the identical room. Resulting from this natural connection between samples, we are saying these two data points are .

Resulting from their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical evaluation and requires careful evaluation. In statistical inference, we regularly assume independent samples to make sure that the estimates of uncertainty are accurate, hence the necessity for uncorrelated samples. If two data points are too much like one another, the correlation reduces their information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

[
R_{XX}(t_1, t_2) = E[X_{t_1} overline{X_{t_2}}]
]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, that is crucial because high autocorrelation signifies that recent samples don’t teach us anything different from the old ones, effectively reducing the sample size we now have. Unsurprisingly, the metric that reflects this is named Effective Sample Size (ESS), and it helps us determine what number of truly independent samples we now have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating what number of truly independent samples would supply the identical information because the autocorrelated samples we now have. Mathematically, for a parameter θ, the ESS is defined as:

[
ESS = frac{n}{1 + 2 sum_{k=1}^{infty} rho(theta)_k}
]

where is the entire variety of samples and ρ(θ)k is the autocorrelation at lag for parameter θ.

Typically, for ESS readouts, the upper, the higher. That is what we see within the readout for the primary model. Two common ESS variations are Bulk-ESS, which assesses mixing within the central a part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Each inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we wish to see readouts which are not less than 1/10 of the entire sample size. On this case, given each chain sampled 2000 observations, we must always expect ESS readouts of not less than 800 (from the entire size of 8000 samples across 4 chains of 2000 samples each), which will not be what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Other than the numerical metrics, our understanding of sampler performance might be deepened through the usage of diagnostic plots. The important ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us discover whether the various chains have explored all the posterior distribution. If we once more consider the bee analogy, rank plots tell us which bees explored which parts of the home. Due to this fact, to judge whether the posterior was explored equally by all chains, we observe the form of the rank plots produced by the sampler. Ideally, we wish the distribution of all chains to look , like within the rank plots generated after sampling the primary model. Each color below represents a series (or bee):

Under the hood, a rank plot is produced with an easy sequence of steps. First, we run the sampler and let it sample from the posterior of every parameter. In our case, we’re sampling posteriors for parameters and of the primary model. Then, parameter by parameter, we get all samples from all chains, put them together, and get them organized from smallest to largest. We then ask ourselves, for every sample, what was the chain where it got here from? This can allow us to create plots just like the ones we see above.

In contrast, bad rank plots are easy to identify. Unlike the previous example, the distributions from the second model, shown below, will not be uniform. From the plots, what we interpret is that every chain, after starting at different random locations, got stuck in a region and didn’t explore the whole thing of the posterior. Consequently, we cannot make inferences from the outcomes, as they’re unreliable and never representative of the true posterior distribution. This may be reminiscent of having 4 bees that began at different rooms of the home and got stuck somewhere during their exploration, never covering the whole thing of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across 4 MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Just like R-hat, trace plots help us assess the of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two sorts of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of those serves a definite purpose in evaluating whether the sampler has properly explored the goal distribution.

The KDE plot (normally on the left) estimates the posterior density for every chain, where . This permits us to examine whether all chains have converged to the identical distribution. If the KDEs overlap, it suggests that the chains are sampling from the identical posterior and that mixing has occurred. However, the trace plot (normally on the fitting) visualizes how parameter values change over MCMC iterations (steps), with each line representing a special chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots might be regarded as snapshots of the “features” of the home at different locations. If the sampler is working accurately, the KDEs within the left plot should align closely, showing that each one bees (chains) have explored the home similarly. Meanwhile, the fitting plot should show highly variable traces that mix together, confirming that the chains are actively moving through the space quite than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the primary model across 4 MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for every chain, indicating consistent central tendency and spread. The suitable panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

Nonetheless, in case your sampler has poor mixing or convergence issues, you will notice something just like the figure below. On this case, the KDEs is not going to overlap, meaning that different chains have sampled from different distributions quite than a shared posterior. The trace plot may also show structured patterns as an alternative of random noise, indicating that chains are stuck in numerous regions of the parameter space and failing to completely explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By utilizing trace plots alongside the opposite diagnostics, you may discover sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A 3rd form of plot that is commonly useful for diagnostic are pair plots. In models where we wish to estimate the posterior distribution of multiple parameters, pair plots allow us to look at . To know how such plots are formed, reassess in regards to the bee analogy. In case you imagine that we’ll create a plot with the width and length of the home, each “step” that the bees take might be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we’re plotting each unique pair (x, y), leading to the scatter plot you see in the course of the image below. The one-dimensional plots you see on the sides are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a take a look at the pair plot from the primary model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents considered one of the 2 parameters whose posteriors we’re estimating. For now, let’s give attention to the scatter plot in the center, which shows the parameter mixtures sampled from the posterior. The very fact we now have a really even distribution signifies that, for any particular value of m, there was a variety of values of s that were equally more likely to be sampled. Moreover, we don’t see any correlation between the 2 parameters, which will likely be good! There are cases after we would expect some correlation, akin to when our model involves a regression line. Nonetheless, on this instance, we now have no reason to imagine two parameters needs to be highly correlated, so the very fact we don’t observe unusual behavior is positive news.

Now, take a take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Provided that this model has five parameters to be estimated, we naturally have a greater variety of plots since we’re analyzing them pair-wise. Nonetheless, they give the impression of being odd in comparison with the previous example. Namely, quite than having a fair distribution of points, the samples here either appear to be divided across two regions or seem somewhat correlated. That is one other way of visualizing what the rank plots have shown: the sampler didn’t explore the complete posterior distribution. Below we isolated the highest left plot, which incorporates the samples from m0 and m1. Unlike the plot from model 1, here we see that the worth of 1 parameter greatly influences the worth of the opposite. If we sampled m1 around 2.5, for instance, m0 is more likely to be sampled from a really narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes might be observed in problematic pair plots relatively steadily. Diagonal patterns, for instance, indicate a high correlation between parameters. Banana shapes are sometimes connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we now have two separate islands, like within the plot above, this could indicate that the posterior is bimodal AND that the chains haven’t mixed well. Nonetheless, remember that these shapes indicate problems, but not necessarily achieve this. It’s as much as the info scientist to look at the model and determine which behaviors are expected and which of them will not be!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can show you how to address the underlying issues. Sampling problems typically stem from the goal posterior being too complex for the sampler to explore efficiently. Complex goal distributions may need:

Multiple modes (peaks) that the sampler struggles to maneuver between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (just like the “neck” of a funnel)
Heavy tails which are difficult to sample accurately

Within the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal goal distributions.

Figure 13. Examples of weirdly shaped distributions.

To assist the sampler in its exploration, there are easy strategies we are able to use.

Strategy 1: Reparameterization

Reparameterization is especially effective for hierarchical models and distributions with difficult geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious lounge that connects to the kitchen through a really, very narrow hallway. One aspect we hadn’t mentioned before is that the bees should fly in the identical way through all the house. That signifies that if we dictate the bees should use large “steps,” they may explore the lounge thoroughly but hit the partitions within the hallway head-on. Likewise, if their steps are small, they may explore the narrow hallway well, but take without end to cover all the lounge. The difference in scales, which is natural to the home, makes the bees’ job harder.

A classic example that represents this scenario is Neal’s funnel, where the size of 1 parameter is dependent upon one other:

[
p(y, x) = text{Normal}(y|0, 3) times prod_{n=1}^{9} text{Normal}(x_n|0, e^{y/2})
]

Figure 14. Log the marginal density of y and the primary dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We will see that the size of x relies on the worth of y. To repair this problem, we are able to separate x and y as independent standard Normals after which transform these variables into the specified funnel distribution. As a substitute of sampling directly like this:

[
begin{align*}
y &sim text{Normal}(0, 3)
x &sim text{Normal}(0, e^{y/2})
end{align*}
]

You may reparameterize to sample from standard Normals first:

[
y_{raw} sim text{Standard Normal}(0, 1)
x_{raw} sim text{Standard Normal}(0, 1)

y = 3y_{raw}
x = e^{y/2} x_{raw}
]

This system separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the home such that as an alternative of forcing the bees to search out a single narrow hallway, we create a brand new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the best step size. Their tails require larger step sizes than their central regions (much like very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes result in inefficient sampling within the tails
Large step sizes cause too many rejections in the middle

Figure 15. Probability density functions for various Cauchy distributions illustrate the consequences of fixing the situation parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a change of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the answer lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The only approach — give the sampler more time to explore.
Increase goal acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 as an alternative of the default 0.8 for complex models, for instance).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Do not forget that while these adjustments may improve your diagnostic metrics, they often treat symptoms quite than underlying causes. The previous strategies (reparameterization and higher proposal distributions) typically offer more fundamental solutions.

Strategy 4: Higher Proposal Distributions

This solution is for function fitting processes, quite than sampling estimations of the posterior. It mainly asks the query: “I’m currently on this landscape. Where should I jump to next in order that I explore the complete landscape, or how do I do know that the subsequent jump is the jump I should make?” Thus, selecting an excellent distribution means ensuring that the sampling process explores the complete parameter space as an alternative of just a particular region. proposal distribution should:

Have substantial probability mass where the goal distribution does.
Allow the sampler to make jumps of the suitable size.

One common selection of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the size of the distribution that we are able to tune to choose how far to leap from the present position to the subsequent position. If we decide the size for the proposal distribution to be too small, it would either take too long to explore all the posterior or it’s going to get stuck in a region and never explore the complete distribution. But when the size is simply too large, you would possibly never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the 2 edges but not the center.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes result in sampling difficulties. More informative priors, when justified by domain knowledge, will help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain difficult to sample effectively. In such cases, consider whether an easier model might achieve similar inferential goals while being more computationally tractable. The most effective model is commonly not essentially the most complex one, however the one which balances complexity with reliability. The table below shows the summary of fixing strategies for various issues.

Diagnostic Signal	Potential Issue	Beneficial Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in numerous regions	Higher proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the standard of MCMC sampling is crucial for ensuring reliable inference. In this text, we explored key diagnostic metrics akin to R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we wish you to remember it’s that you must at all times run diagnostics before drawing conclusions out of your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues quite than proving convergence. When problems arise, strategies akin to reparameterization, hyperparameter tuning, and prior specification will help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you may ensure a more robust evaluation, reducing the chance of misleading inferences as a consequence of poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to evaluate MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Perhaps it’s time to let the old ways die; or We broke R-hat so now we now have to repair it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

Are You Sure Your Posterior Makes Sense?

Introduction