Oversampling and Undersampling, Explained: A Visual Guide with Mini 2D Dataset

-

DATA PREPROCESSING

Artificially generating and deleting data for the greater good

Collecting a dataset where each class has the exact same number of sophistication to predict could be a challenge. In point of fact, things are rarely perfectly balanced, and when you’re making a classification model, this might be a problem. When a model is trained on such dataset, where one class has more examples than the opposite, it has often change into higher at predicting the larger groups and worse at predicting the smaller ones. To assist with this issue, we will use tactics like oversampling and undersampling — creating more examples of the smaller group or removing some examples from the larger group.

There are a lot of different oversampling and undersampling methods (with intimidating names like SMOTE, ADASYN, and Tomek Links) on the market but there doesn’t appear to be many resources that visually compare how they work. So, here, we are going to use one easy 2D dataset to indicate the changes that occur in the info after applying those methods so we will see how different the output of every method is. You will note within the visuals that these various approaches give different solutions, and who knows, one may be suitable to your specific machine learning challenge!

All visuals: Creator-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Oversampling

Oversampling make a dataset more balanced when one group has lots fewer examples than the opposite. The way in which it really works is by making more copies of the examples from the smaller group. This helps the dataset represent each groups more equally.

Undersampling

Alternatively, undersampling works by deleting among the examples from the larger group until it’s almost the identical in size to the smaller group. Ultimately, the dataset is smaller, sure, but each groups may have a more similar variety of examples.

Hybrid Sampling

Combining oversampling and undersampling might be called “hybrid sampling”. It increases the dimensions of the smaller group by making more copies of its examples and likewise, it removes a few of example of the larger group by removing a few of its examples. It tries to create a dataset that’s more balanced — not too big and never too small.

Let’s use an easy artificial golf dataset to indicate each oversampling and undersampling. This dataset shows what form of golf activity an individual do in a specific weather condition.

Columns: Temperature (0–3), Humidity (0–3), Golf Activity (A=Normal Course, B=Drive Range, or C=Indoor Golf). The training dataset has 2 dimensions and 9 samples.

⚠️ Note that while this small dataset is sweet for understanding the concepts, in real applications you’d want much larger datasets before applying these techniques, as sampling with too little data can result in unreliable results.

Random Oversampling

Random Oversampling is a straightforward technique to make the smaller group larger. It really works by making duplicates of the examples from the smaller group until all of the classes are balanced.

👍 Best for very small datasets that must be balanced quickly
👎 Not really useful for classy datasets

Random Oversampling simply duplicates chosen samples from the smaller group (A) while keeping all samples from the larger groups (B and C) unchanged, as shown by the A×2 markings in the appropriate plot.

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling technique that makes recent examples by interpolating the smaller group. Unlike the random oversampling, it doesn’t just copy what’s there however it uses the examples of the smaller group to generate some examples between them.

👍 Best when you may have a good amount of examples to work with and wish variety in your data
👎 Not really useful if you may have only a few examples
👎 Not really useful if data points are too scattered or noisy

SMOTE creates recent A samples by choosing pairs of A points and placing recent points somewhere along the road between them. Similarly, a brand new B point is placed between pairs of randomly chosen B points

ADASYN

ADASYN (Adaptive Synthetic) is like SMOTE but focuses on making recent examples within the harder-to-learn parts of the smaller group. It finds the examples which can be trickiest to categorise and makes more recent points around those. This helps the model higher understand the difficult areas.

👍 Best when some parts of your data are harder to categorise than others
👍 Best for complex datasets with difficult areas
👎 Not really useful in case your data is fairly easy and easy

ADASYN creates more synthetic points from the smaller group (A) in ‘difficult areas’ where A points are near other groups (B and C). It also generates recent B points in similar areas.

Undersampling shrinks the larger group to make it closer in size to the smaller group. There are some ways of doing this:

Random Undersampling

Random Undersampling removes examples from the larger group at random until it’s the identical size because the smaller group. Identical to random oversampling the strategy is pretty easy, however it might do away with necessary info that actually show how different the groups are.

👍 Best for very large datasets with numerous repetitive examples
👍 Best if you need a fast, easy fix
👎 Not really useful if every example in your larger group is vital
👎 Not really useful for those who can’t afford losing any information

Random Undersampling removes randomly chosen points from the larger groups (B and C) while keeping all points from the smaller group (A) unchanged.

Tomek Links

Tomek Links is an undersampling method that makes the “lines” between groups clearer. It searches for pairs of examples from different groups which can be really alike. When it finds a pair where the examples are one another’s closest neighbors but belong to different groups, it removes the instance from the larger group.

👍 Best when your groups overlap an excessive amount of
👍 Best for cleansing up messy or noisy data
👍 Best if you need clear boundaries between groups
👎 Not really useful in case your groups are already well separated

Tomek Links identifies pairs of points from different groups (A-B, B-C) which can be closest neighbors to one another. Points from the larger groups (B and C) that form these pairs are then removed while all points from the smaller group (A) are kept.”

Near Miss

Near Miss is a set of undersampling techniques that works on different rules:

  • Near Miss-1: Keeps examples from the larger group which can be closest to the examples within the smaller group.
  • Near Miss-2: Keeps examples from the larger group which have the smallest average distance to their three closest neighbors within the smaller group.
  • Near Miss-3: Keeps examples from the larger group which can be furthest away from other examples in their very own group.

The essential idea here is to maintain essentially the most informative examples from the larger group and do away with those that aren’t as necessary.

👍 Best if you want control over which examples to maintain
👎 Not really useful for those who need an easy, quick solution

NearMiss-1 keeps points from the larger groups (B and C) which can be closest to the smaller group (A), while removing the remainder. Here, only the B and C points nearest to A points are kept.

ENN

Edited Nearest Neighbors (ENN) method removes examples which can be probably noise or outliers. For every example in the larger group, it checks whether most of its closest neighbors belong to the identical group. In the event that they don’t, it removes that example. This helps create cleaner boundaries between the groups.

👍 Best for cleansing up messy data
👍 Best when you could remove outliers
👍 Best for creating cleaner group boundaries
👎 Not really useful in case your data is already clean and well-organized

ENN removes points from larger groups (B and C) whose majority of nearest neighbors belong to a special group. In the appropriate plot, crossed-out points are removed because most of their closest neighbors are from other groups.

SMOTETomek

SMOTETomek works by first creating recent examples for the smaller group using SMOTE, then cleansing up messy boundaries by removing “confusing” examples using Tomek Links. This helps making a more balanced dataset with clearer boundaries and fewer noise.

👍 Best for unbalanced data that is basically severe
👍 Best if you need each more examples and cleaner boundaries
👍 Best when coping with noisy, overlapping groups
👎 Not really useful in case your data is already clean and well-organized
👎 Not really useful for small dataset

SMOTETomek combines two steps: first applying SMOTE to create recent A points along lines between existing A points (shown in middle plot), then removing Tomek Links from larger groups (B and C). The outcome has more balanced groups with clearer boundaries between them.

SMOTEENN

SMOTEENN works by first creating recent examples for the smaller group using SMOTE, then cleansing up each groups by removing examples that don’t fit well with their neighbors using ENN. Identical to SMOTETomek, this helps create a cleaner dataset with clearer borders between the groups.

👍 Best for cleansing up each groups without delay
👍 Best if you need more examples but cleaner data
👍 Best when coping with numerous outliers
👎 Not really useful in case your data is already clean and well-organized
👎 Not really useful for small dataset

SMOTEENN combines two steps: first using SMOTE to create recent A points along lines between existing A points (middle plot), then applying ENN to remove points from larger groups (B and C) whose nearest neighbors are mostly from different groups. The ultimate plot shows the cleaned, balanced dataset.
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x