## Imbalanced classification beyond resampling, threshold tuning, or cost-sensitive models

Imbalanced classification is a relevant machine learning task. This problem is often handled with certainly one of three approaches: resampling, cost-sensitive models, or threshold tuning.

In this text, you’ll learn a special approach. We’ll explore the way to use clustering evaluation to tackle imbalanced classification.

Many real-world problems involve imbalanced data sets. In these, certainly one of the classes is rare and, normally, more vital to users.

Take fraud detection for instance. Fraud cases are rare instances amongst vast amounts of normal activity. The accurate detection of rare but fraudulent activity is key across many domains. Other common examples involving imbalanced data sets include customer churn or credit default prediction.

Imbalanced distributions are a challenge for machine learning algorithms. There’s relatively little information in regards to the minority class. This hinders the power of algorithms to coach good models because they have an inclination to bias toward the bulk class.

There are three standard approaches for coping with class imbalance:

- Resampling methods;
- Cost-sensitive models;
- Threshold tuning.

Resampling is arguably the most well-liked strategy for handling imbalanced classification tasks. The sort of method transforms the training set to enhance the relevance of the minority class.

Resampling may be used to create recent cases for the minority class (over-sampling), discard cases from the bulk class (under-sampling), or a mix of each.

Here’s an example of how resampling methods work using the *imblearn* library:

`from sklearn.datasets import make_classification`

from imblearn.over_sampling import SMOTEX_train, y_train = make_classification(n_samples=500, n_features=5, n_informative=3)

X_res, y_res = SMOTE().fit_resample(X_train, y_train)

Resampling methods are versatile and straightforward to couple with any learning algorithm. But, they’ve some limitations.

Under-sampling the bulk class may result in vital information loss. Over-sampling may increase the possibility of overfitting. This happens if resampling propagates noise from cases of the minority class.

There are some alternatives to resampling the training data. These include tuning the choice threshold or using cost-sensitive models. Different thresholds result in distinct precision and recall scores. So, adjusting the choice threshold can improve the performance of models.

Cost-sensitive models work by assigning different costs to misclassification errors. Errors within the minority class are typically more costly. This approach requires domain expertise to define the prices of every style of error.

Most resampling methods work by finding instances near the choice boundary — the frontier that splits the instances from the bulk class from those of the minority class. Borderline cases are, in principle, essentially the most difficult to categorise. So, they’re used to drive the resampling process.

For instance, ADASYN is a well-liked over-sampling technique. It creates artificial instances using cases from the minority class whose nearest neighbors are from the bulk class.

## Finding borderline cases with clustering evaluation

We may also capture which observations are near the choice boundary using clustering evaluation.

Suppose there’s a cluster whose observations all belong to the bulk class. This might mean that this cluster is somewhat removed from the choice boundary, on the side of the bulk class. Generally, those observations are easy to model.

However, an instance may be considered borderline if it belongs to a cluster that comprises each classes.

We will use this information to construct a hierarchical model for imbalanced classification.

## How you can construct a hierarchical model for imbalanced classification

We construct a hierarchical model based on two levels.

In the primary level, a model is built to separate easy instances from borderline ones. So, the goal is to predict if an input instance belongs to a cluster with a minimum of one statement from the minority class.

Within the second level, we discard the straightforward cases. Then, we construct a model to unravel the unique classification task with the remaining data. The primary level affects the second by removing easy instances from the training set.

In each levels, the imbalanced problem is reduced, which makes the modeling task simpler.

## Python implementation

The strategy described above known as ICLL (for Imbalanced Classification via Layered Learning). Here’s its implementation:

`from collections import Counter`

from typing import Listimport numpy as np

import pandas as pd

from scipy.cluster.hierarchy import linkage, fcluster

from scipy.spatial.distance import pdist

class ICLL:

"""

Imbalanced Classification via Layered Learning

"""

def __init__(self, model_l1, model_l2):

"""

:param model_l1: Predictive model for the primary layer

:param model_l2: Predictive model for the second layer

"""

self.model_l1 = model_l1

self.model_l2 = model_l2

self.clusters = []

self.mixed_arr = np.array([])

def fit(self, X: pd.DataFrame, y: np.ndarray):

"""

:param X: Explanatory variables

:param y: binary goal variable

"""

assert isinstance(X, pd.DataFrame)

X = X.reset_index(drop=True)

if isinstance(y, pd.Series):

y = y.values

self.clusters = self.clustering(X=X)

self.mixed_arr = self.cluster_to_layers(clusters=self.clusters, y=y)

y_l1 = y.copy()

y_l1[self.mixed_arr] = 1

X_l2 = X.loc[self.mixed_arr, :]

y_l2 = y[self.mixed_arr]

self.model_l1.fit(X, y_l1)

self.model_l2.fit(X_l2, y_l2)

def predict(self, X):

"""

Predicting recent instances

"""

yh_l1, yh_l2 = self.model_l1.predict(X), self.model_l2.predict(X)

yh_f = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1, yh_l2)])

return yh_f

def predict_proba(self, X):

"""

Probabilistic predictions

"""

yh_l1_p = self.model_l1.predict_proba(X)

try:

yh_l1_p = np.array([x[1] for x in yh_l1_p])

except IndexError:

yh_l1_p = yh_l1_p.flatten()

yh_l2_p = self.model_l2.predict_proba(X)

yh_l2_p = np.array([x[1] for x in yh_l2_p])

yh_fp = np.asarray([x1 * x2 for x1, x2 in zip(yh_l1_p, yh_l2_p)])

return yh_fp

@classmethod

def cluster_to_layers(cls, clusters: List[np.ndarray], y: np.ndarray) -> np.ndarray:

"""

Defining the layers from clusters

"""

maj_cls, min_cls, both_cls = [], [], []

for clst in clusters:

y_clt = y[np.asarray(clst)]

if len(Counter(y_clt)) == 1:

if y_clt[0] == 0:

maj_cls.append(clst)

else:

min_cls.append(clst)

else:

both_cls.append(clst)

both_cls_ind = np.array(sorted(np.concatenate(both_cls).ravel()))

both_cls_ind = np.unique(both_cls_ind)

if len(min_cls) > 0:

min_cls_ind = np.array(sorted(np.concatenate(min_cls).ravel()))

else:

min_cls_ind = np.array([])

both_cls_ind = np.unique(np.concatenate([both_cls_ind, min_cls_ind])).astype(int)

return both_cls_ind

@classmethod

def clustering(cls, X, method='ward'):

"""

Hierarchical clustering evaluation

"""

d = pdist(X)

Z = linkage(d, method)

Z[:, 2] = np.log(1 + Z[:, 2])

sZ = np.std(Z[:, 2])

mZ = np.mean(Z[:, 2])

clust_labs = fcluster(Z, mZ + sZ, criterion='distance')

clusters = []

for lab in np.unique(clust_labs):

clusters.append(np.where(clust_labs == lab)[0])

return clusters

The clustering part is completed routinely without user input. So, the one thing you’ll want to define is the training algorithm on each level of the hierarchy.

And below is an example of how you should utilize the strategy. In this instance, the model in each level is a Random Forest.

`import pandas as pd`

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier as RFC# https://github.com/vcerqueira/blog/blob/foremost/src/icll.py

from src.icll import ICLL

# making a dummy data set

X, y = make_classification(n_samples=500, n_features=5, n_informative=3)

X = pd.DataFrame(X)

# making a instance of the model

icll = ICLL(model_l1=RFC(), model_l2=RFC())

# training

icll.fit(X, y)

# probabilistic predictions

probs = icll.predict_proba(X)

## A more serious example

How does the hierarchical method compare with resampling?

Below is a comparison based on a knowledge set related to diabetes. You possibly can check reference [1] for details. Here’s how we are able to apply each methods to this data:

`from sklearn.model_selection import train_test_split`

from sklearn.metrics import roc_curve, roc_auc_score

from imblearn.over_sampling import SMOTE# loading diabetes dataset https://github.com/vcerqueira/blog/tree/foremost/data

data = pd.read_csv('data/pima.csv')

X, y = data.drop('goal', axis=1), data['target']

X = X.fillna(X.mean())

# train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# resampling with SMOTE

X_res, y_res = SMOTE().fit_resample(X_train, y_train)

# creating the models

smote = RFC()

icll = ICLL(model_l1=RFC(), model_l2=RFC())

# training

smote.fit(X_res, y_res)

icll.fit(X_train, y_train)

# inference

smote_probs = smote.predict_proba(X_test)

icll_probs = icll.predict_proba(X_test)

Below is the ROC curve for every approach:

ICLL’s curve is closer to the top-left side, which indicates it’s the higher model.

Rather a lot more experiments were carried out within the paper in reference [2] where ICLL was presented. The outcomes suggest that ICLL provides competitive performance in imbalanced classification problems. You possibly can check the code for the experiments on Github.

Adding value to the conversation in a way that’s as engaging as a flirtatious wink. Can’t wait to hear more.