Cracking the Density Code: Why MAF Flows Where KDE Stalls

Certainly one of the essential problems that arises in high-dimensional density estimation is that as our dimension increases, our data becomes more sparse. Due to this fact, for models that depend on local neighborhood estimation we’d like exponentially more data as our dimension increases to proceed getting meaningful results. That is known as the curse of dimensionality.

In my previous article on density estimation, I demonstrated how the kernel density estimator (KDE) will be effectively used for one-dimensional data. Nonetheless, its performance deteriorates significantly in higher dimensions. For example this, I ran a simulation to find out what number of samples are required for KDE to realize a mean relative error of 0.2 when estimating the density of a multivariate Gaussian distribution across various dimensions. Bandwidth was chosen using Scott’s rule. The outcomes are as follows:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
np.random.seed(42)

# Gaussian sample generator
def generate_gaussian_samples(n_samples, dim, mean=0, std=1):
    return np.random.normal(mean, std, size=(n_samples, dim))

def compute_bandwidth(samples):
    # Scott method
    n, d = samples.shape
    return np.power(n, -1./(d + 4))

# KDE error computation
def compute_kde_error(samples, dim, n_test=1000):
    bandwidth = compute_bandwidth(samples)
    kde = KernelDensity(bandwidth=bandwidth).fit(samples)
    test_points = np.random.normal(0, 1, size=(n_test, dim))
    kde_density = np.exp(kde.score_samples(test_points))
    true_density = np.exp(-np.sum(test_points**2, axis=1) / 2) / ((2 * np.pi)**(dim / 2))
    error = np.mean(np.abs(kde_density - true_density) / true_density)
    return error, bandwidth

# Determine required samples for a goal error
def find_required_samples(dim, target_error=0.2, max_samples=500000, start_samples=10, n_experiments=5):
    samples = start_samples
    while samples <= max_samples:
        errors = [compute_kde_error(generate_gaussian_samples(samples, dim), dim)[0] for _ in range(n_experiments)]
        avg_error = np.mean(errors)
        if avg_error <= target_error:
            return samples, avg_error
        samples = int(samples * 1.5)
    return max_samples, avg_error

# Major
def analyze_kde(dims, target_error):
    results = []
    for dim in dims:
        samples, error = find_required_samples(dim, target_error)
        results.append((dim, samples))
        print(f"Dim {dim}: {samples} samples")
    return results

# Visualization
def plot_results(dims, results,target_error=.2):
    samples = [x[1] for x in results]
    plt.figure(figsize=(8, 6))
    plt.plot(dims, samples, 'o-', color='blue')
    plt.yscale('log')
    plt.xlabel('Dimension')
    plt.ylabel('Required Variety of Samples (log scale)')
    plt.title(f'Samples Needed for a Mean Relative Error of {target_error}')
    plt.grid(True)
    
    for i, sample in enumerate(samples):
        plt.text(dims[i], sample * 1.15, f'{sample}', fontsize=10, ha='right', color='black')  
    plt.show()

# Run the evaluation
dims = range(1, 7)
target_error = 0.2
results = analyze_kde(dims, target_error)
plot_results(dims, results)

That’s right: in my simulation, to match the accuracy of just 22 data points in a single dimension, you would want greater than 360,000 data points in six dimensions! Much more astonishingly, in his book Multivariate Density Estimation, David W. Scott shows that, depending on the metric, over 1,000,000 data points are required in eight dimensions to realize the identical accuracy as just 50 data points in a single dimension.

Hopefully, this is sufficient to persuade you that the kernel density estimator shouldn't be ideal for estimating densities in higher dimensions. But what’s the choice?

Part 2: Introduction to Normalizing Flows

One promising alternative is Normalizing Flows, and the precise model I'll give attention to is the Masked Autoregressive Flow (MAF).

This section draws partially on the work of George Papamakarios and Balaji Lakshminarayanan, as presented in Chapter 23 of Probabilistic Machine Learning: Advanced Topics by Kevin P. Murphy (see the book for further details).

The core idea behind normalizing flows is that a distribution p(x) will be modeled by starting with random variables sampled from a straightforward base distribution, (resembling a Gaussian) after which passing them through a sequence of differentiable, invertible transformations (diffeomorphisms). Each transformation incrementally reshapes the distribution, progressively mapping the bottom distribution into the goal distribution. A visible illustration of this process is shown below.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
np.random.seed(42)

#Sample from a typical normal distribution
n_points = 1000
initial_dist = np.random.normal(loc=[0, 0], scale=1.0, size=(n_points, 2))

#Generate goal distribution
theta = np.linspace(0, np.pi, n_points//2)
r = 2
x1 = r * np.cos(theta)
y1 = r * np.sin(theta)
x2 = (r-0.5) * np.cos(theta)
y2 = (r-0.5) * np.sin(theta) - 1
target_dist = np.vstack([
    np.column_stack([x1, y1 + 0.5]),
    np.column_stack([x2, y2 + 0.5])
])
target_dist += np.random.normal(0, 0.1, target_dist.shape)

def f1(x, t):
    """Split transformation"""
    shift = 2 * t * np.sign(x[:, 1])[:, np.newaxis] * np.array([1, 0])
    return x + shift

def f2(x, t):
    """Curve transformation"""
    theta = t * np.pi / 2
    r = np.sqrt(x[:, 0]**2 + x[:, 1]**2)
    phi = np.arctan2(x[:, 1], x[:, 0]) + theta * (1 - r/4)
    return np.column_stack([r * np.cos(phi), r * np.sin(phi)])

def f3(x, t):
    """Advantageous-tune to focus on"""
    return (1 - t) * x + t * target_dist

# Create figure
fig, ax = plt.subplots(figsize=(10, 10))
scatter = ax.scatter([], [], alpha=0.6, s=10)
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

def sigmoid(x):
    """Smooth transition function"""
    return 1 / (1 + np.exp(-(x - 0.5) * 10))

def get_title(t):
    if t < 0.33:
        return f'Applying Split Transformation (f₁)'
    elif t < 0.66:
        return f'Applying Curve Transformation (f₂)'
    else:
        return f'Advantageous-tuning to Goal Distribution (f₃)'

def init():
    scatter.set_offsets(initial_dist)
    ax.set_title('Initial Gaussian Distribution', pad=20, fontsize=18)
    return [scatter]

def update(frame):
    #Normalize frame to [0, 1]
    t = frame / 100
    
    #Apply transformations sequentially
    points = initial_dist
    
    #f1: Split the distribution
    t1 = sigmoid(t * 3) if t < 0.33 else 1
    points = f1(points, t1)
    
    #f2: Create curves
    t2 = sigmoid((t - 0.33) * 3) if 0.33 <= t < 0.66 else (0 if t < 0.33 else 1)
    points = f2(points, t2)
    
    #f3: Fine-tune to target
    t3 = sigmoid((t - 0.66) * 3) if t >= 0.66 else 0
    points = f3(points, t3)
    
    #Update scatter plot
    scatter.set_offsets(points)
    colours = points[:, 0] + points[:, 1]
    scatter.set_array(colours)
    
    #Update title
    ax.set_title(get_title(t), pad=20, fontsize=18)
    
    return [scatter]

#Create animation
anim = FuncAnimation(fig, update, frames=100, init_func=init,
                    interval=50, blit=True)
plt.tight_layout()
plt.show()

#Save animation as a gif
anim.save('normalizing_flow_single.gif', author='pillow')

More formally, assume the next:

Then our goal distribution is defined by the next change of variables formula:

Where J_{f^{-1}}(x), the Jacobian of f^{-1} evaluated at x.

Since we'd like to compute the determinant, there's also a computational consideration; our transformation functions should ideally have Jacobians whose determinants are easy to calculate. Designing a diffeomorphic function that each models a posh distribution and yields a tractable determinant is a difficult task. The way in which that is addressed in practice is by constructing the goal distribution through a flow of simpler functions. Thus, f is defined as follows:

Then, because the composition of diffeomorphic functions can also be diffeomorphic, f can be invertible and differentiable.

There are just a few typical candidates for f. Listed below are popular selections.

Affine Flows

Affine flows are given by the next function:

We want to limit A to being an invertible square matrix in order that f is invertible. Affine flows will not be superb at modelling data on their very own, but they're useful when mixed with other functions.

Elementwise Flows

Elementwise flows transform the vector u element smart. Let h be a scalar bijection, we will create a vector-valued bijection f defined as follows:

The determinant of the Jacobian is then given by:

Just like affine flows, elementwise flows will not be very effective at modeling data on their very own, since they don't capture interactions between dimensions. Nonetheless, they are sometimes used together with other transformations to construct more complex flows.

Coupling Flows

Coupling flows, introduced by Dinh et al. (2015), differ from the flows discussed earlier in that they permit using non-linear functions to raised capture the structure of the info. Apologies for using a picture here, but to avoid confusion I needed inline LaTeX.

Here, the parameters of f-hat are calculated by sending the subset b of u through Θ, where Θ is a general function known as the conditioner. This setup contrasts with affine flows, which only mix dimensions linearly, and elementwise flows, which keep each dimension isolated. Coupling flows allow for a non-linear mixing of dimensions through the conditioner. When you are fascinated with the kind of coupling layers which were proposed, see Kobyzev, Ivan & Prince, Simon & Brubaker, Marcus. (2020).

The determinant is kind of easy to calculate because the partial derivative of x-b with respect to u-b is 0. Hence, the Jacobian is the next upper block triangular matrix:

The determinant of the Jacobian is then given by:

The next showcases visually how each of those functions could effect the distribution.

Masked Autoregressive Flows

Assume that u is a vector containing d elements. An autoregressive bijection function, which outputs a vector x with d elements, is defined as follows:

Here, h is a scalar bijection parameterized by Θ, where Θ is an arbitrary non-linear function, typically a neural network. Consequently of the autoregressive structure, each element x_i depends only on the weather of u as much as the i-th index. Consequently, the Jacobian matrix can be triangular, and its determinant can be the product of the diagonal entries, as follows:

If we were to make use of multiple autoregressive bijection functions as our f, we would want to coach d different neural networks, which will be quite computationally expensive. So as a substitute, to handle this, a more efficient approach in practice is to share parameters between the conditioners by combining them right into a single model Θ that takes in a shared input x and outputs the set of parameters (Θ1, Θ2,…, Θd). Nonetheless, to maintain the autoregressive structure, we've to make sure that each Θi depends only on x1,x2,…,xi−1.

Masked Autoregressive Flows (MAF) use a multi-layer perceptron because the non-linear function, after which apply masking to zero out any computational paths that will violate the autoregressive structure. By doing so, MAF ensures that every output Θi is conditionally dependent only on the previous inputs x1,x2,…,xi−1 and allowing for efficient training.

Part 3: Showdown

To find out whether KDE or MAF higher models distributions in higher dimensions, I designed an experiment that is comparable to my introductory evaluation of KDE. I trained each models on progressively larger datasets until each achieved a KL divergence of 0.5.

For those unfamiliar with this metric, KL divergence quantifies how one probability distribution differs from a reference distribution. Specifically, it measures the expected excess ‘surprise’ from using one distribution to approximate one other. A KL divergence of 0.0 indicates perfect alignment between distributions, while higher values signify greater discrepancy. To offer visual intuition, the figure below illustrates what .5 KL divergence looks like when comparing two three-dimensional distributions:

The experimental design encompassed three distinct distribution families, each chosen to check different facets of the models’ capabilities. First, I examined Conditional Gaussian Distributions, which represent the best case with unimodal, symmetric probability mass. Second, I tested Conditional Mixture of Gaussians, introducing multimodality to challenge the models’ ability to capture multiple distinct modes in the info. Finally, I included Conditional Skew Normal distributions to evaluate performance on asymmetric distributions.

For the Kernel Density Estimation model, choosing appropriate bandwidth parameters was difficult for the larger dimensions. I ended up employing Leave-One-Out Cross-Validation (LOOCV), a method where each data point is held out while the remaining points are used to estimate the optimal bandwidth. This process, while computationally expensive, requiring n separate model suits for n data points, was essential for achieving reliable leads to higher dimensions. In my previous versions of this experiments with alternative bandwidth selection methods, all demonstrated inferior performance, requiring substantially more training data to realize the identical KL divergence threshold.

The Masked Autoregressive Flow model required a special optimization strategy. Like most neural network based models, MAF depends upon plenty of hyperparameters. I developed a scaling strategy where these hyperparameters were adjusted proportionally to the input dimensionality. It’s necessary to notice that this scaling was based on reasonable heuristics somewhat than an exhaustive optimization. The hyperparameter search was kept minimal to determine baseline performance, more sophisticated tuning would likely give large performance improvements for the MAF model.

The entire codebase, including data generation, model implementations, training procedures, and evaluation metrics, is on the market in this repository for reproducibility and further experimentation. Listed here are the outcomes:

The experimental results show a striking a difference in relative performance of KDE and MAF! As shown by the graphs, a transition occurs across the fifth dimension. Below this threshold, KDE showed higher performance, nonetheless, beyond five dimensions, MAF begins to vastly outperform KDE by increasingly dramatic margins.

The magnitude of this difference becomes stark at dimension 7, where our results display a profound disparity in data efficiency. Across all three distribution types tested KDE consistently required greater than 100,000 data points to realize satisfactory performance. In contrast, MAF reached the identical performance threshold with a maximum of merely a maximum of two,000 data points across all distributions. This represents an improvement factor starting from 50x to 100x!

Other than sample efficiency, the computational performance differences are equally compelling because the KDE required roughly 12 times longer to coach than MAF at these higher dimensions.

The mix of superior data efficiency and faster training times makes MAF the clear winner for prime dimensional tasks. KDE remains to be definitely a invaluable tool for low-dimensional problems, but should you are working on an application involving greater than five dimensions, I highly recommend trying the MAF approach.

Part 4: Why does MAF Crush KDE?

To know this why KDE suffers in high dimension, we must first examine how KDE actually works under the hood. As discussed in my previous article, Kernel Density Estimation uses local neighborhood estimation, where for any point where we would like to guage the density, KDE looks at nearby data points and uses their proximity to estimate the local probability density. Each kernel function creates a neighborhood around each data point, and the density estimate at any location is the sum of contributions from all kernels whose neighborhoods include that location.

This local approach works well in low dimensions. Nonetheless, as the scale increase, the info becomes sparser, causing the estimator to want exponentially more data points to fill the space with the identical density.

In contrast, MAF doesn’t use local neighborhood based estimation. As an alternative of estimating density by taking a look at nearby points, MAF learns functions that map previous variables to conditional distribution parameters. The neural network’s weights are shared across the complete input space, allowing it to generalize from training data while not having to populate local neighborhoods. This architectural difference enables MAF to scale much better then KDE with dimension.

This distinction between local and global approaches explains the dramatic performance gap observed in my experiment. While KDE must populate an exponentially expanding space with data points to take care of accurate local neighborhoods, MAF can exploit the compositional structure of neural networks to learn global patterns using far fewer samples.

Conclusion

The Kernel Density Estimator is great at nonparametrically analyzing data in low dimensions; it’s intuitive, fast, and requires far less tuning. Nonetheless, for prime dimensional data, or when computational time is a priority, I’d recommend trying out normalizing flows. While the model isn’t nearly as battle tested as KDE, they’re a solid alternative to check out, and might just find yourself being your latest favorite density estimator.

Cracking the Density Code: Why MAF Flows Where KDE Stalls