Home Artificial Intelligence Model Selection with Imbalance Data: Only AUC may Not Prevent EXPERIMENT SETUP MODEL SELECTION SUMMARY

Model Selection with Imbalance Data: Only AUC may Not Prevent EXPERIMENT SETUP MODEL SELECTION SUMMARY

Model Selection with Imbalance Data: Only AUC may Not Prevent

Photo by Mpho Mojapelo on Unsplash

Most data scientists, who attend meetings to present ML results to business stakeholders, normally answer questions like these:

AUC? What’s it? Could you please elaborate?

. This continuously happens when artificial intelligence products are developed to resolve real-world problems. On this scenario, data scientists work together and collaborate with domain experts to know field dynamics and accordingly incorporate them into automated solutions.

. In quite a lot of situations, the adoption of machine learning could also be useless because the tasks will be solved with easy automation rules, or there isn’t a evidence within the available data that justifies the usage of artificial intelligence techniques. That said, .

. Selecting AUC as a metric, to present to business stakeholders the goodness/strengths of the machine learning approach adopted, could also be dangerous. Firstly since the definition of AUC might not be clear to all. Secondly, . Business individuals are money-oriented. In the event that they don’t understand that the proposed solutions make them save time or money, they likely reject them.

On this post, we don’t suggest a technique to decide on the proper business metric. As an alternative, we give attention to a more technical problem that’s strictly correlated with the metric definition. We’re referring to model selection. We wish to . The scope is to analyze how easy decisions (like metric selection or threshold tuning) influence the ultimate results and the way these relate to the business goals.

We start by simulating an unbalanced tabular dataset with 90% negative and 10% positive goal samples.

Goal distribution of simulated data [image by the author]

We are able to imagine the minority class (10% of the sample in our case) as the shoppers that churned in a set temporal range, because the failures that happened in an engine system, or also because the variety of frauds occurred. .

Unbalancing is difficult to take care of. As an alternative of fighting with extreme unbalance, a greater approach, which is easy and works generally, consists in leveraging it throughout the learning phase. In other words, . Applying an inexpensive undersampling ratio, it’s possible to make the models learn from the info. Moreover, the unbalanced nature of the phenomena is preserved and replicable at inference time.

Comparing techniques to handle goal imbalance [image by the author]

With this easy modeling strategy in mind, we’re able to deep dive into model selection.

Machine Learning use cases lifecycle [image by the author]

. We’re referring to all of the metrics which evaluate the goodness of fit using the expected probabilities. In a binary classification context, probably the most known scoring metrics are AUC, average precision, or cross-entropy.

Using a scoring metric in this example seems an inexpensive solution. We’re evaluating the goodness of fit independently from a tough threshold, like one used to compute accuracy, precision, recall, or Fbeta.

Coming back to our simulated use case… Supposing one in every of the necessities, defined by the business stakeholder, is to acquire a high precision on the minority class. How can we supply out model selection and parameter tuning to satisfy this request?

model = RandomizedSearchCV(
n_iter=20, random_state=1234,
cv=5, n_jobs=-1,
refit=False, error_score='raise',
'fbeta': make_scorer(fbeta_score, beta=0.1),
).fit(X, y)

We arrange a randomized search with a random forest, looking for the optimal variety of trees. We register cross-validated scoring for AUC, average precision, and Fbeta. We decide (what we are attempting to optimize). The trial results are reported within the plots below.

Fbeta as a function of AUC (on the left) and average precision (on the appropriate) [image by the author]

As expected, there isn’t a clear relation between AUC/average precision and Fbeta. .

At this point, with our “optimal” parameter configuration chosen based on AUC, now we have to operate an extra fine-tuning, on a more moderen set of knowledge, to pick a tough threshold to maximise precision and make our stakeholders blissful.

Nothing bad in doing this but, is there a more efficient approach?

Embedding the brink searching contained in the model training is simple. (Fbeta in our case). This is finished routinely on a validation set derived by splitting the received training data. The anticipated classes are obtained by discretizing the chances based on the tuned threshold.

from sklearn.metrics import fbeta_score
from sklearn.model_selection import train_test_split
from sklearn.base import clone, BaseEstimator, ClassifierMixin

class ThresholdClassifier(BaseEstimator, ClassifierMixin):

def __init__(self, estimator, refit=True, val_size=0.3):
self.estimator = estimator
self.refit = refit
self.val_size = val_size

def fit(self, X, y):

def scoring(th, y, prob):
pred = (prob > th).astype(int)
return 0 if not pred.any() else
-fbeta_score(y, pred, beta=0.1)

X_train, X_val, y_train, y_val = train_test_split(
X, y, stratify=y, test_size=self.val_size,
shuffle=True, random_state=1234

self.estimator_ = clone(self.estimator)
self.estimator_.fit(X_train, y_train)

prob_val = self.estimator_.predict_proba(X_val)[:,1]
thresholds = np.linspace(0,1, 200)[1:-1]
scores = [scoring(th, y_val, prob_val)
for th in thresholds]
self.score_ = np.min(scores)
self.th_ = thresholds[np.argmin(scores)]

if self.refit:
self.estimator_.fit(X, y)
if hasattr(self.estimator_, 'classes_'):
self.classes_ = self.estimator_.classes_

return self

def predict(self, X):
proba = self.estimator_.predict_proba(X)[:,1]
return (proba > self.th_).astype(int)

def predict_proba(self, X):
return self.estimator_.predict_proba(X)

The estimator is model agnostic and will be used with any binary classifier that outputs probabilities. In our example, we apply it to our random forest allowing, as before, the seek for optimal parameters.

Fbeta as a function of AUC (on the left) and average precision (on the appropriate) [image by the author]

Not surprisingly, there isn’t a relationship between AUC/average precision and Fbeta. Comparing the scores, obtained by the raw random forest and the random forest with threshold tuning, we observe a difference in the worth of Fbeta.

Fbeta obtained w/ (red) and w/o threshold tuning (blue) for a similar set of parameters [image by the author]

. The outcomes don’t affect the produced probabilities. Scoring metrics, like AUC or average precision, remain unaltered.

Fbeta, as a function of AUC (on the left) and average precision (on the appropriate), obtained w/ (red) and w/o threshold tuning (blue) [image by the author]

We are usually not here for claiming the models with the very best performances by cents improvements of validation metrics. We must pursue business goals. In our simulated scenario, it’s evident that with easy tricks . The notable point is that we get these findings .

On this post, we outlined the important differences between scoring metrics and accuracy-based ones. We saw how these behave in an unbalance binary classification context to resolve real business problems. If our scope is to measure how good we’re at detecting churned customers, identifying frauds, or finding failed engine components, using only AUC may produce incomplete/suboptimal solutions. As all the time, we must deeply understand the business logic from the start and check out to satisfy them.



Please enter your comment!
Please enter your name here