Home Artificial Intelligence Visualizing the True Extent of the Curse of Dimensionality Volume Linear distance Distances between observations Conclusions

Visualizing the True Extent of the Curse of Dimensionality Volume Linear distance Distances between observations Conclusions

0
Visualizing the True Extent of the Curse of Dimensionality
Volume
Linear distance
Distances between observations
Conclusions

Using the Monte Carlo method to visualise the behavior of observations with very large numbers of features

Consider a dataset, manufactured from some variety of observations, each commentary having N features. If you happen to convert all features to a numeric representation, you might say that every commentary is a degree in an N-dimensional space.

When N is low, the relationships between points are only what you’ll expect intuitively. But sometimes N grows very large — this might occur, for instance, in case you’re creating loads of features via one-hot encoding, etc. For very large values of N, observations behave as in the event that they are sparse, or as if the distances between them are someway larger than what you’ll expect.

The phenomenon is real. Because the variety of dimensions N grows, and all else stays the identical, the N-volume containing your observations really does increase in a way (or no less than the variety of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does turn into more sparse. That is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset can be influenced as a consequence of those changes.

Many things can go improper if the variety of features could be very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such an area (e.g. GridSearch) becomes less efficient — you would like more trials to cover the identical linear interval limits. A subtle effect has an impact on any models based on distance or vicinity: — since these models depend on distance to do their job, the leveling out of differences of distance makes their job harder. E.g. clustering doesn’t work as well if all points seem like nearly equidistant.

For all these reasons, and more, techniques comparable to PCA, LDA, etc. have been created — in an effort to maneuver away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset all the way down to a lot of dimensions more compatible with the actual information contained in it.

It is tough to perceive intuitively the true magnitude of this phenomenon, and spaces with greater than 3 dimensions are extremely difficult to visualise, so let’s do some easy 2D visualizations to assist our intuition. There’s a geometrical basis for the rationale why dimensionality can turn into an issue, and that is what we are going to visualize here. If you’ve gotten not seen this before, the outcomes is perhaps surprising — the geometry of high-dimensional spaces is much more complex than the standard intuition is more likely to suggest.

Consider a square of size 1, centered within the origin. Within the square, you inscribe a circle.

a circle inscribed in a square
a circle inscribed in a square

That’s the setup in 2 dimensions. Now think in the final, N-dimensional case. In 3 dimensions, you’ve gotten a sphere inscribed in a cube. Beyond that, you’ve gotten an N-sphere inscribed in an N-cube, which is probably the most general method to put it. For simplicity, we are going to discuss with these objects as “sphere” and “cube”, irrespective of what number of dimensions they’ve.

The amount of the cube is fixed, it’s all the time 1. The query is: because the variety of dimensions N varies, what happens to the quantity of the sphere?

Let’s answer the query experimentally, using the Monte Carlo method. We are going to generate a really large variety of points, distributed uniformly but randomly throughout the cube. For every point we calculate its distance to the origin — if that distance is lower than 0.5 (the radius of the sphere), then the purpose is contained in the sphere.

random points
random points

If we divide the variety of points contained in the sphere by the full variety of points, that may approximate the ratio of the quantity of the sphere and of the quantity of the cube. Because the volume of the cube is 1, the ratio can be equal to the quantity of the sphere. The approximation gets higher when the full variety of points is large.

In other words, the ratio inside_points / total_points will approximate the quantity of the sphere.

The code is quite straightforward. Since we want many points, explicit loops should be avoided. We could use NumPy, however it’s CPU-only and single-threaded, so it should be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We are going to use PyTorch — however the NumPy code would look almost an identical.

If you happen to follow the nested torch statements, we generate 100 million random points, calculate their distances to the origin, count the points contained in the sphere, and divide the count by the full variety of points. The ratio array will find yourself containing the quantity of the sphere in numerous numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of memory — adjust them in case your hardware is different.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'

# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n in case you run out of memory
n = 10**8

ratio = np.zeros(d_max)

for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# mix large tensor statements for higher memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)

# clean up memory
torch.cuda.empty_cache()

Let’s visualize the outcomes:

LEAVE A REPLY

Please enter your comment!
Please enter your name here