Visualizing the True Extent of the Curse of Dimensionality

Artificial Intelligence

Visualizing the True Extent of the Curse of Dimensionality

admin

June 29, 2023

Visualizing the True Extent of the Curse of Dimensionality

Using the Monte Carlo method to visualise the behavior of observations with very large numbers of features

Consider a dataset, manufactured from some variety of observations, each remark having N features. In the event you convert all features to a numeric representation, you might say that every remark is a degree in an N-dimensional space.

When N is low, the relationships between points are only what you’ll expect intuitively. But sometimes N grows very large — this might occur, for instance, should you’re creating quite a lot of features via one-hot encoding, etc. For very large values of N, observations behave as in the event that they are sparse, or as if the distances between them are someway larger than what you’ll expect.

The phenomenon is real. Because the variety of dimensions N grows, and all else stays the identical, the N-volume containing your observations really does increase in a way (or at the very least the variety of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does change into more sparse. That is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset will probably be influenced as a consequence of those changes.

Many things can go improper if the variety of features could be very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such an area (e.g. GridSearch) becomes less efficient — you would like more trials to cover the identical intervals along any axis. A subtle effect has an impact on any models based on distance or vicinity: because the variety of dimensions grows to some very large values, should you consider any point in your observations, all the opposite points seem like far-off and someway nearly equidistant — since these models depend on distance to do their job, the leveling out of differences of distance makes their job much harder. E.g. clustering doesn’t work as well if all points seem like nearly equidistant.

For all these reasons, and more, techniques comparable to PCA, LDA, etc. have been created — in an effort to maneuver away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset right down to plenty of dimensions more compatible with the actual information contained in it.

It is tough to perceive intuitively the true magnitude of this phenomenon, and spaces with greater than 3 dimensions are extremely difficult to visualise, so let’s do some easy 2D visualizations to assist our intuition. There may be a geometrical basis for the explanation why dimensionality can change into an issue, and that is what we’ll visualize here. If you will have not seen this before, the outcomes is likely to be surprising — the geometry of high-dimensional spaces is way more complex than the everyday intuition is prone to suggest.

Consider a square of size 1, centered within the origin. Within the square, you inscribe a circle.

That’s the setup in 2 dimensions. Now think in the overall, N-dimensional case. In 3 dimensions, you will have a sphere inscribed in a cube. Beyond that, you will have an N-sphere inscribed in an N-cube, which is essentially the most general option to put it. For simplicity, we’ll seek advice from these objects as “sphere” and “cube”, irrespective of what number of dimensions they’ve.

The amount of the cube is fixed, it’s at all times 1. The query is: because the variety of dimensions N varies, what happens to the quantity of the sphere?

Let’s answer the query experimentally, using the Monte Carlo method. We are going to generate a really large variety of points, distributed uniformly but randomly throughout the cube. For every point we calculate its distance to the origin — if that distance is lower than 0.5 (the radius of the sphere), then the purpose is contained in the sphere.

If we divide the variety of points contained in the sphere by the full variety of points, that may approximate the ratio of the quantity of the sphere and of the quantity of the cube. Because the volume of the cube is 1, the ratio will probably be equal to the quantity of the sphere. The approximation gets higher when the full variety of points is large.

In other words, the ratio inside_points / total_points will approximate the quantity of the sphere.

The code is quite straightforward. Since we want many points, explicit loops have to be avoided. We could use NumPy, however it’s CPU-only and single-threaded, so it is going to be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We are going to use PyTorch — however the NumPy code would look almost similar.

In the event you follow the nested torch statements, we generate 100 million random points, calculate their distances to the origin, count the points contained in the sphere, and divide the count by the full variety of points. The ratio array will find yourself containing the quantity of the sphere in numerous numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of memory — adjust them in case your hardware is different.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n should you run out of memory
n = 10**8
ratio = np.zeros(d_max)
for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# mix large tensor statements for higher memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)
# clean up memory
torch.cuda.empty_cache()

Let’s visualize the outcomes:

Visualizing the True Extent of the Curse of Dimensionality

Using the Monte Carlo method to visualise the behavior of observations with very large numbers of features

2 COMMENTS

LEAVE A REPLY Cancel reply