1️0️. PCA + tSNE/UMAP
More data doesn’t necessarily mean higher models. Some datasets are only too large, and you can do well without using them to the fullest. But should you aren’t comfortable setting aside a part of the information, I suggest using dimensionality reduction techniques to project the information to a lower space.
A rise in model performance shouldn’t be guaranteed, but in the long term, you get to run rather more experiments on a smaller dataset because you’ll have lower RAM usage, and computation times can be much shorter.
But the issue is, quality dimensionality reduction can take too long if there are numerous features within the dataset. You won’t get it right on the primary try, so more experimentation can be much more costly time-wise.
That’s why Sklearn documentation suggests combining dimensionality reduction algorithms with PCA (Principal Component Evaluation).
PCA works fast for any variety of dimensions, making it ideal for a first-stage reduction. It is strongly recommended to project the information to an affordable variety of dimensions, like 30–50 with PCA, after which use other algorithms to cut back much more, like tSNE or UMAP.
Below is the mixture of PCA and tSNE:
On an artificial dataset with 1M rows and ~300 features, projecting the information to the primary 30 dimensions after which to 2 dimensions took 4.5 hours. Unfortunately, the outcomes aren’t pretty:
That’s why I like to recommend using UMAP. It is far faster than tSNE and preserves the local structure of the information higher:
UMAP managed to search out the clear distinction between goal classes, and it did it 20 times faster than tSNE.
— link.
— link.
— link.
resort ambience