Home Artificial Intelligence How you can Tackle Class Imbalance Without Resampling Introduction How you can Take care of Class Imbalance Tackling Class Imbalance with Clustering Key Takeaways

How you can Tackle Class Imbalance Without Resampling Introduction How you can Take care of Class Imbalance Tackling Class Imbalance with Clustering Key Takeaways

How you can Tackle Class Imbalance Without Resampling
How you can Take care of Class Imbalance
Tackling Class Imbalance with Clustering
Key Takeaways

Photo by Denise Johnson on Unsplash

Imbalanced classification is a relevant machine learning task. This problem is often handled with certainly one of three approaches: resampling, cost-sensitive models, or threshold tuning.

In this text, you’ll learn a special approach. We’ll explore the way to use clustering evaluation to tackle imbalanced classification.

Many real-world problems involve imbalanced data sets. In these, certainly one of the classes is rare and, normally, more vital to users.

Take fraud detection for instance. Fraud cases are rare instances amongst vast amounts of normal activity. The accurate detection of rare but fraudulent activity is key across many domains. Other common examples involving imbalanced data sets include customer churn or credit default prediction.

Imbalanced distributions are a challenge for machine learning algorithms. There’s relatively little information in regards to the minority class. This hinders the power of algorithms to coach good models because they have an inclination to bias toward the bulk class.

There are three standard approaches for coping with class imbalance:

  1. Resampling methods;
  2. Cost-sensitive models;
  3. Threshold tuning.
Photo by Viktor Talashuk on Unsplash

Resampling is arguably the most well-liked strategy for handling imbalanced classification tasks. The sort of method transforms the training set to enhance the relevance of the minority class.

Resampling may be used to create recent cases for the minority class (over-sampling), discard cases from the bulk class (under-sampling), or a mix of each.

Here’s an example of how resampling methods work using the imblearn library:

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

X_train, y_train = make_classification(n_samples=500, n_features=5, n_informative=3)
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

Resampling methods are versatile and straightforward to couple with any learning algorithm. But, they’ve some limitations.

Under-sampling the bulk class may result in vital information loss. Over-sampling may increase the possibility of overfitting. This happens if resampling propagates noise from cases of the minority class.

There are some alternatives to resampling the training data. These include tuning the choice threshold or using cost-sensitive models. Different thresholds result in distinct precision and recall scores. So, adjusting the choice threshold can improve the performance of models.

Cost-sensitive models work by assigning different costs to misclassification errors. Errors within the minority class are typically more costly. This approach requires domain expertise to define the prices of every style of error.

Most resampling methods work by finding instances near the choice boundary — the frontier that splits the instances from the bulk class from those of the minority class. Borderline cases are, in principle, essentially the most difficult to categorise. So, they’re used to drive the resampling process.

Decision boundary of an SVM model. Original: Alisneaky Vector: Zirguezi, CC BY-SA 4.0. Image source.

For instance, ADASYN is a well-liked over-sampling technique. It creates artificial instances using cases from the minority class whose nearest neighbors are from the bulk class.

Finding borderline cases with clustering evaluation

We may also capture which observations are near the choice boundary using clustering evaluation.

Suppose there’s a cluster whose observations all belong to the bulk class. This might mean that this cluster is somewhat removed from the choice boundary, on the side of the bulk class. Generally, those observations are easy to model.

However, an instance may be considered borderline if it belongs to a cluster that comprises each classes.

We will use this information to construct a hierarchical model for imbalanced classification.

How you can construct a hierarchical model for imbalanced classification

We construct a hierarchical model based on two levels.

In the primary level, a model is built to separate easy instances from borderline ones. So, the goal is to predict if an input instance belongs to a cluster with a minimum of one statement from the minority class.

Within the second level, we discard the straightforward cases. Then, we construct a model to unravel the unique classification task with the remaining data. The primary level affects the second by removing easy instances from the training set.

In each levels, the imbalanced problem is reduced, which makes the modeling task simpler.

Python implementation

The strategy described above known as ICLL (for Imbalanced Classification via Layered Learning). Here’s its implementation:

from collections import Counter
from typing import List

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist

class ICLL:
Imbalanced Classification via Layered Learning

def __init__(self, model_l1, model_l2):
:param model_l1: Predictive model for the primary layer
:param model_l2: Predictive model for the second layer
self.model_l1 = model_l1
self.model_l2 = model_l2
self.clusters = []
self.mixed_arr = np.array([])

def fit(self, X: pd.DataFrame, y: np.ndarray):
:param X: Explanatory variables
:param y: binary goal variable
assert isinstance(X, pd.DataFrame)
X = X.reset_index(drop=True)

if isinstance(y, pd.Series):
y = y.values

self.clusters = self.clustering(X=X)

self.mixed_arr = self.cluster_to_layers(clusters=self.clusters, y=y)

y_l1 = y.copy()
y_l1[self.mixed_arr] = 1

X_l2 = X.loc[self.mixed_arr, :]
y_l2 = y[self.mixed_arr]

self.model_l1.fit(X, y_l1)
self.model_l2.fit(X_l2, y_l2)

def predict(self, X):
Predicting recent instances

yh_l1, yh_l2 = self.model_l1.predict(X), self.model_l2.predict(X)

yh_f = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1, yh_l2)])

return yh_f

def predict_proba(self, X):
Probabilistic predictions

yh_l1_p = self.model_l1.predict_proba(X)
yh_l1_p = np.array([x[1] for x in yh_l1_p])
except IndexError:
yh_l1_p = yh_l1_p.flatten()

yh_l2_p = self.model_l2.predict_proba(X)
yh_l2_p = np.array([x[1] for x in yh_l2_p])

yh_fp = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1_p, yh_l2_p)])

return yh_fp

def cluster_to_layers(cls, clusters: List[np.ndarray], y: np.ndarray) -> np.ndarray:
Defining the layers from clusters

maj_cls, min_cls, both_cls = [], [], []
for clst in clusters:
y_clt = y[np.asarray(clst)]

if len(Counter(y_clt)) == 1:
if y_clt[0] == 0:

both_cls_ind = np.array(sorted(np.concatenate(both_cls).ravel()))
both_cls_ind = np.unique(both_cls_ind)

if len(min_cls) > 0:
min_cls_ind = np.array(sorted(np.concatenate(min_cls).ravel()))
min_cls_ind = np.array([])

both_cls_ind = np.unique(np.concatenate([both_cls_ind, min_cls_ind])).astype(int)

return both_cls_ind

def clustering(cls, X, method='ward'):
Hierarchical clustering evaluation

d = pdist(X)

Z = linkage(d, method)
Z[:, 2] = np.log(1 + Z[:, 2])
sZ = np.std(Z[:, 2])
mZ = np.mean(Z[:, 2])

clust_labs = fcluster(Z, mZ + sZ, criterion='distance')

clusters = []
for lab in np.unique(clust_labs):
clusters.append(np.where(clust_labs == lab)[0])

return clusters

The clustering part is completed routinely without user input. So, the one thing you’ll want to define is the training algorithm on each level of the hierarchy.

And below is an example of how you should utilize the strategy. In this instance, the model in each level is a Random Forest.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier as RFC

# https://github.com/vcerqueira/blog/blob/foremost/src/icll.py
from src.icll import ICLL

# making a dummy data set
X, y = make_classification(n_samples=500, n_features=5, n_informative=3)
X = pd.DataFrame(X)

# making a instance of the model
icll = ICLL(model_l1=RFC(), model_l2=RFC())
# training
icll.fit(X, y)
# probabilistic predictions
probs = icll.predict_proba(X)

A more serious example

How does the hierarchical method compare with resampling?

Below is a comparison based on a knowledge set related to diabetes. You possibly can check reference [1] for details. Here’s how we are able to apply each methods to this data:

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from imblearn.over_sampling import SMOTE

# loading diabetes dataset https://github.com/vcerqueira/blog/tree/foremost/data
data = pd.read_csv('data/pima.csv')

X, y = data.drop('goal', axis=1), data['target']
X = X.fillna(X.mean())

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# resampling with SMOTE
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

# creating the models
smote = RFC()
icll = ICLL(model_l1=RFC(), model_l2=RFC())

# training
smote.fit(X_res, y_res)
icll.fit(X_train, y_train)

# inference
smote_probs = smote.predict_proba(X_test)
icll_probs = icll.predict_proba(X_test)

Below is the ROC curve for every approach:

ROC curve for every method. Image by creator.

ICLL’s curve is closer to the top-left side, which indicates it’s the higher model.

Rather a lot more experiments were carried out within the paper in reference [2] where ICLL was presented. The outcomes suggest that ICLL provides competitive performance in imbalanced classification problems. You possibly can check the code for the experiments on Github.



Please enter your comment!
Please enter your name here