Home Artificial Intelligence Curse of Dimensionality: An Intuitive Exploration

Curse of Dimensionality: An Intuitive Exploration

0
Curse of Dimensionality: An Intuitive Exploration

Photo by Mathew Schwartz on Unsplash

Within the previous article, we discussed the surprising behavior of information in higher dimensions. We found that volume tends to build up within the corners of spaces in a wierd way, and we simulated a hypersphere inscribed inside a hypercube to research this, observing an interesting decrease of their volume ratio as the size grew. Examples that demonstrated some great benefits of multi-dimensional considering were the DVD-paper experiment and the kernel trick in support vector machines(SVMs).

Today, we can be a few of the difficult points of high-dimensional data which is known as curse of dimensionality. Our goal is to have an intuitive understanding of this idea and its practical implications. The diagram below outlines how our article is structured.

Image by Creator

Understanding the Curse of Dimensionality

“Curse of dimensionality” is a term that was first utilized by Richard E. Bellman back within the Nineteen Sixties. It began as Bellman’s idea from dynamic optimization and it turned out to be a fundamental concept for understanding complexity in high-dimensional spaces.

Good, but what’s “curse of dimensionality”?

It’s at its core the difficulties and unique characteristics one faces when working with data in high-dimensional spaces( in our case this refers to having many features, columns or attributes in datasets). These spaces go far beyond our experience of on a regular basis life in three-dimensional space.

Once we increase the variety of dimensions on a dataset, the amount it occupies expands exponentially. This might appear initially as a bonus — more room could mean more data and possibly more insights? Nonetheless, that just isn’t the case because having many dimensions comes with plenty of challenges which change how we’d like to cope with and understand these high-dimensional data.

The shift from low-dimensional to high-dimensional data faces several harsh challenges. There are two, which stand out because they’ve essentially the most significant effects: 1) sparsity of information; 2) the difficulty with distance metric. Each of them makes evaluation in higher dimensions much more complex.

Data Sparsity: Islands in an Ocean of Emptiness

Data sparsity in highly dimensional spaces is like few small islands lost inside an unlimited ocean. When dimensionality increases, data points that were close together in lower dimensions develop into increasingly separated. That is because of the proven fact that the quantity of space expands exponentially with each latest addition of one other dimension. Just imagine a cube becoming a hypercube; its corners move further away from its center and make it more empty inside. This growing emptiness is what we check with as data sparsity.

Many data evaluation techniques struggle with sparsity. For instance, many clustering algorithms rely on closely situated data points to form meaningful clusters. Nonetheless, when data points develop into too dispersed, these algorithms face difficulties.

Distance Metric Problems: When Proximity Loses Meaning

In high-dimensional spaces, distance metrics encounter significant challenges. Metrics like Euclidean or Manhattan distances, that are useful for measuring proximity between data points in lower dimensions, lose their effectiveness. In these expanded spaces, distances begin to converge. Which means most pairs of points develop into nearly equidistant from one another and from a reference point. This convergence makes it harder to differentiate between close neighbors and distant ones.

In tasks like classification, where distance measurements are crucial for categorizing latest data points, these metrics develop into less effective. In consequence, algorithm performance drops, resulting in less accurate predictions and analyses.

To raised understand how distance behavior changes in higher dimensions, let’s perform an easy simulation. We’ll generate random points in each low and high-dimensional spaces. This may allow us to look at and compare the distribution of distances, showing us how these distances evolve as we move to higher dimensions.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

def generate_points(dimensions, num_points, range_min, range_max):
return np.random.uniform(range_min, range_max, (num_points, dimensions))

def calculate_pairwise_distances(points):
distances = np.sqrt(((points[:, np.newaxis, :] - points[np.newaxis, :, :]) ** 2).sum(axis=-1))
np.fill_diagonal(distances, np.nan) # Ignore self-distances by setting them to NaN
return distances

def calculate_distances_from_reference(points, reference_point):
distances = np.sqrt(((points - reference_point) ** 2).sum(axis=1))
return distances

def calculate_stats_for_dimensions(num_points, dimensions_range, range_min, range_max):
means_pairwise = []
stds_pairwise = []
means_ref = []
stds_ref = []

for dim in dimensions_range:
points = generate_points(dim, num_points, range_min, range_max)
pairwise_distances = calculate_pairwise_distances(points)
reference_point = generate_points(dim, 1, range_min, range_max)
distances_from_ref = calculate_distances_from_reference(points, reference_point)

means_pairwise.append(np.nanmean(pairwise_distances))
stds_pairwise.append(np.nanstd(pairwise_distances))
means_ref.append(np.mean(distances_from_ref))
stds_ref.append(np.std(distances_from_ref))

return means_pairwise, stds_pairwise, means_ref, stds_ref

def plot_histograms_and_stats(num_points, dimensions_range, range_min, range_max):
fig, axs = plt.subplots(2, 3, figsize=(12, 7), tight_layout=True)

# Plotting histograms for 3D and 100D
for i, dim in enumerate([3, 100]):
points = generate_points(dim, num_points, range_min, range_max)
pairwise_distances = calculate_pairwise_distances(points)
reference_point = generate_points(dim, 1, range_min, range_max)
distances_from_ref = calculate_distances_from_reference(points, reference_point)

axs[i, 0].hist(pairwise_distances[~np.isnan(pairwise_distances)], bins=50, alpha=0.7, color='blue', edgecolor='black')
axs[i, 0].set_title(f'Pairwise Distances in {dim}D')
axs[i, 1].hist(distances_from_ref, bins=30, alpha=0.7, color='green', edgecolor='black', range=(0, max(distances_from_ref)))
axs[i, 1].set_title(f'Distances to Reference in {dim}D')

# Calculating and plotting mean and std deviation trends across dimensions
means_pairwise, stds_pairwise, means_ref, stds_ref = calculate_stats_for_dimensions(num_points, dimensions_range, range_min, range_max)

# Plotting mean and std deviation graphs for pairwise distances
axs[0, 2].plot(dimensions_range, means_pairwise, label='Mean Pairwise', marker='o', color='blue')
axs[0, 2].plot(dimensions_range, stds_pairwise, label='Std Dev Pairwise', marker='x', color='cyan')
axs[0, 2].set_title('Pairwise Distances Stats')

# Plotting mean and std deviation graphs for distances to reference point
axs[1, 2].plot(dimensions_range, means_ref, label='Mean Reference', marker='o', color='green')
axs[1, 2].plot(dimensions_range, stds_ref, label='Std Dev Reference', marker='x', color='lime')
axs[1, 2].set_title('Reference Point Distances Stats')

axs[0, 2].legend()
axs[1, 2].legend()

plt.show()

plot_histograms_and_stats(1000, range(1, 101), 1, 100)

Image by Creator

The code output shows how distances change across dimensions. In 3D, there are different distances between points. In 100D, distances between points are likely to be similar. Graphs to the suitable also show that as dimensions increase, the mean distance between points gets larger, but the usual deviation stays roughly similar to it was on 2D or 3D space.

One other note here is that as dimensions increase, the mean distance between points gets larger and approaches the utmost distance. This happens because a lot of the space is concentrated within the corners.

To raised understand this, we are able to simulate random points in dimensions as much as 100D. This may allow us to compare the typical distance to the utmost distance.

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

def generate_points(dimensions, num_points, range_min, range_max):
return np.random.uniform(range_min, range_max, (num_points, dimensions))

def calculate_distances_stats(points):
# Compute pairwise distances
distances = pdist(points)

# Calculate average and maximum distance
average_distance = np.mean(distances)
max_distance = np.max(distances)

return average_distance, max_distance
def plot_normalized_difference(num_points, dimensions_range, range_min, range_max):
normalized_differences = []

for dim in dimensions_range:
points = generate_points(dim, num_points, range_min, range_max)
average_distance, max_distance = calculate_distances_stats(points)
normalized_difference = (max_distance - average_distance) / max_distance
normalized_differences.append(normalized_difference)

plt.figure(figsize=(8, 6))
plt.plot(dimensions_range, normalized_differences, label='Normalized Difference', marker='o', color='blue')
plt.xlabel('Variety of Dimensions')
plt.ylabel('Normalized Difference')
plt.title('Normalized Difference Between Max and Average Distances Across Dimensions')
plt.legend()
plt.show()
plot_normalized_difference(500, range(1, 101), 0, 1)

Image by Creator

The graph shows that as we go into higher dimensions, the typical distance gets closer to the utmost distance. We used normalization in here to be sure the scales were accurate.

It’s vital to know the difference between absolute and relative distances. While absolute distances generally increase with more dimensions, it’s the relative differences that matter more. Clustering algorithms like K-means or DBSCAN work by how points are positioned in comparison with one another, not their exact distances. This lets us find patterns and relationships that we’d miss if we only checked out the distances.

But this results in an interesting query: why do pairs of points in high-dimensional spaces are likely to be roughly the identical distance apart as we add more dimensions? What causes this to occur?

Photo by Aakash Dhage on Unsplash

To know why pairs of points in high-dimensional spaces develop into equidistant, we are able to have a look at the Law of Large Numbers (LLN). This statistical principle suggests that as we increase our sample size or the variety of dimensions, the typical of our observations gets closer to the expected value.

Let’s consider the instance of rolling a good six-sided dice. The expected mean of a roll is 3.5, which is the typical of all possible outcomes. Initially, with just just a few rolls, like 5 or 10, the typical is likely to be significantly different from 3.5 because of randomness. But as we increase the variety of rolls to tons of or hundreds, the typical roll value gets closer to three.5. This phenomenon, where the typical of many trials aligns with the expected value, shows the essence of the LLN. It demonstrates that while individual outcomes are unpredictable, the typical becomes highly predictable over many trials.

Now, how does this relate to distances in high-dimensional spaces?

The Euclidean distance between two points in an n-dimensional space is calculated by summing the squared differences across each dimension. We will consider each squared difference as a random variable, just like a roll of a dice. Because the variety of dimensions (or rolls) increases, the sum of those ‘rolls’ gets closer to an expected value.

A vital requirement for the LLN is the independence of random variables. In high-dimensional vectors, this independence is likely to be shown through an interesting geometric property: the vectors are likely to be almost orthogonal to one another.

import numpy as np

def test_orthogonality(dimensions, n_trials):
for i in range(n_trials):
# Generate two random vectors
v1 = np.random.randn(dimensions)
v2 = np.random.randn(dimensions)

# Calculate dot product
dot_product = np.dot(v1, v2)

# Calculate magnitudes
magnitude_v1 = np.linalg.norm(v1)
magnitude_v2 = np.linalg.norm(v2)

# Calculate the cosine of the angle
cos_theta = dot_product / (magnitude_v1 * magnitude_v2)

# Check if vectors are almost orthogonal
if np.abs(cos_theta) < 0.1: # Adjust this threshold as needed
orthogonality = "Almost Orthogonal"
else:
orthogonality = "Not Orthogonal"

# Calculate angle in degrees
theta = np.arccos(cos_theta) * (180 / np.pi) # Convert to degrees

print(f"Trial {i+1}:")
print(f" Dot Product: {dot_product}")
print(f" Cosine of Angle: {cos_theta}")
print(f" Angle: {theta} degrees")
print(f" Status: {orthogonality}")
print("--------------------------------")

# Attempt to edit this and spot the near-orthogonality of vectors in higher dimensions
dimensions = 100 # Variety of dimensions
n_trials = 10 # Variety of trials

test_orthogonality(dimensions, n_trials)

Try running the code above and editing the variety of dimensions/ trials, and you may notice that vectors in higher dimensions are almost orthogonal.

The angle between two vectors, A and B, is set by the cosine of the angle, which is derived from their dot product and magnitudes. The formula is expressed as:

Here, AB represents the dot product of vectors A and B, and ∥A∥ and ∥B∥ are their respective magnitudes. For 2 vectors to be orthogonal, the angle between them have to be 90 degrees, making cos(θ) equal to zero. Typically, that is achieved when the dot product AB is zero, a condition familiar in lower dimensions.

Nonetheless, in high-dimensional spaces, one other phenomenon emerges. The ratio of the dot product to the magnitude of the vectors becomes so small that we are able to consider the vectors to be ‘almost orthogonal.’

But what does it mean for 2 vectors to be ‘independent’ on this context?

Navigating a Grid City: An Analogy for Independence in High Dimensions

Imagine you might be in a city specified by a grid, like Manhattan’s streets. Picture yourself at an intersection, trying to succeed in one other point on this city. On this analogy, each street represents a dimension in a high-dimensional space. Moving along a street is like changing the worth in a single dimension of a high-dimensional vector. Moving along one street doesn’t affect your position on one other street, similar to changing one dimension doesn’t affect the others.

To succeed in a selected intersection, you make a series of independent decisions, like calculating distance in high-dimensional space. Each decision contributes independently but leads you to your destination.

This analogy also applies to the concept of orthogonality in high-dimensional vectors. When vectors are almost orthogonal, they follow their very own paths without significantly influencing one another. This condition complements the necessity for statistical independence for the LLN.

A crucial note: while this analogy of LLN offers a helpful perspective, it might not capture all the thought or causes behind this behavior. Nonetheless, it serves as a useful proxy, providing an understanding of what the rationale might be for pairs of point to be almost equidistant.

A method the curse of dimensionality problems show up is overfitting. Overfitting happens when a posh model learns noise as an alternative of the patterns in the info. This is very true in high-dimensional spaces where there are various features. The model could make false connections or correlations and perform poorly when it sees latest data(failing to generalize).

The curse also makes it hard to seek out patterns in big datasets. High-dimensional data is opened up and sparse, so it’s difficult for traditional evaluation methods to seek out meaningful insights. Some changes or specialized methods are needed to navigate and understand the sort of data.

One other implication is that processing high-dimensional data takes lots of computational power and memory. Algorithms that work well in lower dimensions develop into far more complex and resource-heavy because the variety of dimensions increases. This implies either having more powerful hardware or optimizing algorithms to handle the increased computational load efficiently.

There are several strategies to cope with the curse of dimensionality. A method is to cut back the dimensionality while keeping the vital information(ex. PCA algorithm). One other method is manifold learning(may be regarded as a style of dimensionality reduction).which uncovers the structure inside the high-dimensional data. The important thing idea behind manifold learning is that many high-dimensional datasets actually lie on a lower-dimensional manifold inside the high-dimensional space(ex. Isomaps)

Note here that -generally speaking- traditional dimensionality reduction techniques like PCA (Principal Component Evaluation) deal with preserving global data structure and variance in a linear fashion. In contrast, manifold learning techniques like Isomap(Isometric Mapping) emphasize uncovering the underlying non-linear structure(manifold) of information, aiming to preserve local relationships and geometrical features.

Feature selection can also be an option, where relevant features are chosen to enhance model performance. Regularization techniques prevent overfitting by shrinking less vital features. Increasing the sample size may help, even though it may not all the time be possible. These methods may also help us analyze high-dimensional data more accurately and efficiently.

The curse of dimensionality is probably the most vital problems in data science and machine learning. It happens when coping with high-dimensional spaces. Two significant challenges that arise are data sparsity and issues with distance metrics. These challenges could cause overfitting in machine learning models and make computations more complex. To tackle these challenges, strategies like dimensionality reduction, feature selection, and regularization techniques may be used.

If you’ve gotten made it this far, I would love to thanks for spending time reading this! I hope you found the subject enjoyable and a minimum of inspiring enough to delve deeper into the world of high-dimensional data. Please be at liberty to suggest any edits or indicate any mistakes or inaccuracies.

LEAVE A REPLY

Please enter your comment!
Please enter your name here