The Math Behind Kernel Density Estimation

The next derivation takes inspiration from Bruce E. Hansen’s “Lecture Notes on Nonparametric” (2009). Should you are curious about learning more you possibly can confer with his original lecture notes here.

Suppose we desired to estimate a probability density function, f(t), from a sample of information. origin can be to estimate the cumulative distribution function, F(t), using the empirical distribution function (EDF). Let X1, …, Xn be independent, identically distributed real random variables with the common cumulative distribution function F(t). The EDF is defined as:

Then, by the strong law of enormous numbers, as n approaches infinity, the EDF converges almost surely to F(t). Now, the EDF is a step function that might appear to be the next:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=40)
# Sort the information
data_sorted = np.sort(data)
# Compute ECDF values
ecdf_y = np.arange(1, len(data_sorted)+1) / len(data_sorted)
# Generate x values for the traditional CDF
x = np.linspace(-4, 4, 1000)
cdf_y = norm.cdf(x)
# Create the plot
plt.figure(figsize=(6, 4))
plt.step(data_sorted, ecdf_y, where='post', color='blue', label='ECDF')
plt.plot(x, cdf_y, color='gray', label='Normal CDF')
plt.plot(data_sorted, np.zeros_like(data_sorted), '|', color='black', label='Data points')
# Label axes
plt.xlabel('X')
plt.ylabel('Cumulative Probability')
# Add grid
plt.grid(True)
# Set limits
plt.xlim([-4, 4])
plt.ylim([0, 1])
# Add legend
plt.legend()
# Show plot
plt.show()

Due to this fact, if we were to try to find an estimator for f(t) by taking the derivative of the EDF, we’d get a scaled sum of Dirac delta functions, which isn’t very helpful. As an alternative allow us to think about using the two-point central difference formula of the estimator as an approximation of the derivative. Which, for a small h>0, we get:

Now define the function k(u) as follows:

Then now we have that:

Which is a special case of the kernel density estimator, where here k is the uniform kernel function. More generally, a kernel function is a non-negative function from the reals to the reals which satisfies:

We are going to assume that every one kernels discussed in this text are symmetric, hence now we have that k(-u) = k(u).

The moment of a kernel, which provides insights into the form and behavior of the kernel function, is defined as the next:

Lastly, the order of a kernel is defined as the primary non-zero moment.

We will only minimize the error of the kernel density estimator by either changing the h value (bandwidth), or the kernel function. The bandwidth parameter has a much larger impact on the resulting estimate than the kernel function but can also be way more difficult to decide on. To exhibit the influence of the h value, take the next two kernel density estimates. A Gaussian kernel was used to estimate a sample generated from a typical normal distribution, the one difference between the estimators is the chosen h value.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=100)
# Define the bandwidths
bandwidths = [0.1, 0.3]
# Plot the histogram and KDE for every bandwidth
plt.figure(figsize=(12, 8))
plt.hist(data, bins=30, density=True, color='gray', alpha=0.3, label='Histogram')
x = np.linspace(-5, 5, 1000)
for bw in bandwidths:
kde = gaussian_kde(data , bw_method=bw)
plt.plot(x, kde(x), label=f'Bandwidth = {bw}')
# Add labels and title
plt.title('Impact of Bandwidth Selection on KDE')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()

Quite a dramatic difference.

Now allow us to have a look at the impact of adjusting the kernel function while keeping the bandwidth constant.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity# Generate sample data
np.random.seed(14)
data = np.random.normal(loc=0, scale=1, size=100)[:, np.newaxis]  # reshape for sklearn
# Intialize a relentless bandwidth
bandwidth = 0.6
# Define different kernel functions
kernels = ["gaussian", "epanechnikov", "exponential", "linear"]
# Plot the histogram (transparent) and KDE for every kernel
plt.figure(figsize=(12, 8))
# Plot the histogram
plt.hist(data, bins=30, density=True, color="gray", alpha=0.3, label="Histogram")
# Plot KDE for every kernel function
x = np.linspace(-5, 5, 1000)[:, np.newaxis]
for kernel in kernels:
kde = KernelDensity(bandwidth=bandwidth, kernel=kernel)
kde.fit(data)
log_density = kde.score_samples(x)
plt.plot(x[:, 0], np.exp(log_density), label=f"Kernel = {kernel}")
plt.title("Impact of Different Kernel Functions on KDE")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

While visually there’s a big difference within the tails, the general shape of the estimators are similar across the various kernel functions. Due to this fact, I’ll focus primarily give attention to finding the optimal bandwidth for the estimator. Now, let’s explore a number of the properties of the kernel density estimator, including its bias and variance.

The Math Behind Kernel Density Estimation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why Should We Trouble with Quantum Computing in ML?

Five with MIT ties elected to National Academy of Medicine for 2025

OpenAI Releases ‘Atlas’ Browser

Dispatch: Partying at certainly one of Africa’s largest AI gatherings

OpenAI enters browser war with Atlas

The Math Behind Kernel Density Estimation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.