Achieve Higher Classification Results with ClassificationThresholdTuner

A python tool to tune and visualize the brink selections for binary and multi-class classification problems

Adjusting the thresholds utilized in classification problems (that’s, adjusting the cut-offs in the possibilities used to make your mind up between predicting one class or one other) is a step that’s sometimes forgotten, but is kind of easy to do and might significantly improve the standard of a model. It’s a step that needs to be performed with most classification problems (with some exceptions depending on what we want to optimize for, described below).

In this text, we glance closer at what’s actually happening after we do that — with multi-class classification particularly, this is usually a bit nuanced. And we have a look at an open source tool, written by myself, called ClassificationThesholdTuner, that automates and describes the method to users.

Given how common the duty of tuning the thresholds is with classification problems, and the way similar the method normally is from one project to a different, I’ve been capable of use this tool on many projects. It eliminates lots of (nearly duplicate) code I used to be adding for many classification problems and provides rather more details about tuning the brink that I might have otherwise.

Although ClassificationThesholdTuner is a useful gizmo, you could find the ideas behind the tool described in this text more relevant — they’re easy enough to duplicate where useful to your classification projects.

In a nutshell, ClassificationThesholdTuner is a tool to optimally set the thresholds used for classification problems and to present clearly the results of various thresholds. In comparison with most other available options (and the code we’d most definitely develop ourselves for optimizing the brink), it has two major benefits:

It provides visualizations, which help data scientists understand the implications of using the optimal threshold that’s discovered, in addition to alternative thresholds which may be chosen. This may also be very useful when presenting the modeling decisions to other stakeholders, for instance where it’s needed to search out a superb balance between false positives and false negatives. Often business understanding, in addition to data modeling knowledge, is needed for this, and having a transparent and full understanding of the alternatives for threshold can facilitate discussing and deciding on one of the best balance.
It supports multi-class classification, which is a standard form of problem in machine learning, but is more complicated with respect to tuning the thresholds than binary classification (for instance, it requires identifying multiple thresholds). Optimizing the thresholds used for multi-class classification is, unfortunately, not well-supported by other tools of this kind.

Although supporting multi-class classification is considered one of the vital properties of ClassificationThesholdTuner, binary classification is simpler to grasp, so we’ll begin by describing this.

Just about all modern classifiers (including those in scikit-learn, CatBoost, LGBM, XGBoost, and most others) support producing each predictions and probabilities.

For instance, if we create a binary classifier to predict which clients will churn in the following yr, then for every client we will generally produce either a binary prediction (a Yes or a No for every client), or can produce a probability for every client (e.g. one client could also be estimated to have a probability of 0.862 of leaving in that timeframe).

Given a classifier that may produce probabilities, even where we ask for binary predictions, behind the scenes it can generally actually produce a probability for every record. It’ll then convert the possibilities to class predictions.

By default, binary classifiers will predict the positive class where the anticipated probability of the positive class is larger than or equal to 0.5, and the negative class where the probability is under 0.5. In this instance (predicting churn), it will, by default, predict Yes if the anticipated probability of churn is ≥ 0.5 and No otherwise.

Nevertheless, this will not be the perfect behavior, and sometimes a threshold apart from 0.5 can work preferably, possibly a threshold somewhat lower or somewhat higher, and sometimes a threshold substantially different from 0.5. This could rely on the info, the classifier built, and the relative importance of false positives vs false negatives.

With the intention to create a powerful model (including balancing well the false positives and false negatives), we are going to often want to optimize for some metric, resembling F1 Rating, F2 Rating (or others within the family of f-beta metrics), Matthews Correlation Coefficient (MCC), Kappa Rating, or one other. If that’s the case, a significant a part of optimizing for these metrics is setting the brink appropriately, which can most frequently set it to a worth apart from 0.5. We’ll describe soon how this works.

Scikit-learn provides good background on the thought of threshold tuning in its Tuning the choice threshold for sophistication prediction page. Scikit-learn also provides two tools: FixedThresholdClassifier and TunedThresholdClassifierCV (introduced in version 1.5 of scikit-learn) to help with tuning the brink. They work quite similarly to ClassificationThesholdTuner.

Scikit-learn’s tools might be considered convenience methods, as they’re not strictly needed; as indicated, tuning is fairly straightforward in any case (no less than for the binary classification case, which is what these tools support). But, having them is convenient — it remains to be quite a bit easier to call these than to code the method yourself.

ClassificationThresholdTuner was created as an alternative choice to these, but where scikit-learn’s tools work well, they’re excellent selections as well. Specifically, where you will have a binary classification problem, and don’t require any explanations or descriptions of the brink discovered, scikit-learn’s tools can work perfectly, and will even be barely more convenient, as they permit us to skip the small step of putting in ClassificationThresholdTuner.

ClassificationThresholdTuner could also be more useful where explanations of the thresholds found (including some context related to alternative values for the brink) are needed, or where you will have a multi-class classification problem.

As indicated, it also may at times be the case that the ideas described in this text are what’s most beneficial, not the precise tools, and you could be best to develop your individual code — perhaps along similar lines, but possibly optimized by way of execution time to more efficiently handle the info you will have, possibly more able support other metrics to optimize for, or possibly providing other plots and descriptions of the threshold-tuning process, to supply the data relevant to your projects.

With most scikit-learn classifiers, in addition to CatBoost, XGBoost, and LGBM, the possibilities for every record are returned by calling predict_proba(). The function outputs a probability for every class for every record. In a binary classification problem, they’ll output two probabilities for every record, for instance:

[[0.6, 0.4], 
[0.3, 0.7], 
[0.1, 0.9],
…
]

For every pair of probabilities, we will take the primary because the probability of the negative class and the second because the probability of the positive class.

Nevertheless, with binary classification, one probability is solely 1.0 minus the opposite, so only the possibilities of considered one of the classes are strictly needed. In reality, when working with class probabilities in binary classification problems, we regularly use only the possibilities of the positive class, so could work with an array resembling: [0.4, 0.7, 0.9, …].

Thresholds are easy to grasp within the binary case, as they might be viewed simply because the minimum predicted probability needed for the positive class to really predict the positive class (within the churn example, to predict customer churn). If now we have a threshold of, say, 0.6, it’s then easy to convert the array of probabilities above to predictions, on this case, to: [No, Yes, Yes, ….].

By utilizing different thresholds, we allow the model to be more, or less, wanting to predict the positive class. If a comparatively low threshold, say, 0.3 is used, then the model will predict the positive class even when there’s only a moderate likelihood that is correct. In comparison with using 0.5 as the brink, more predictions of the positive class will likely be made, increasing each true positives and false positives, and in addition reducing each true negatives and false negatives.

Within the case of churn, this might be useful if we wish to concentrate on catching most cases of churn, regardless that doing so, we also predict that many consumers will churn when they’ll not. That’s, a low threshold is nice where false negatives (missing churn) is more of an issue than false positives (erroneously predicting churn).

Setting the brink higher, say to 0.8, may have the alternative effect: fewer clients will likely be predicted to churn, but of those which can be predicted to churn, a big portion will quite likely actually churn. We are going to increase the false negatives (miss some who will actually churn), but decrease the false positives. This might be appropriate where we will only follow up with a small variety of potentially-churning clients, and wish to label only those which can be most definitely to churn.

There’s almost all the time a powerful business component to the choice of where to set the brink. Tools resembling ClassificationThresholdTuner could make these decisions more clear, as there’s otherwise not normally an obvious point for the brink. Picking a threshold, for instance, simply based on intuition (possibly determining 0.7 feels about right) is not going to likely work optimally, and usually no higher than simply using the default of 0.5.

Setting the brink is usually a bit unintuitive: adjusting it a bit up or down can often help or hurt the model greater than could be expected. Often, for instance, increasing the brink can greatly decrease false positives, with only a small effect on false negatives; in other cases the alternative could also be true. Using a Receiver Operator Curve (ROC) is a superb solution to help visualize these trade-offs. We’ll see some examples below.

Ultimately, we’ll set the brink in order to optimize for some metric (resembling F1 rating). ClassificationThresholdTuner is solely a tool to automate and describe that process.

Normally, we will view the metrics used for classification as being of three essential types:

Those who examine how well-ranked the prediction probabilities are, for instance: Area Under Receiver Operator Curve (AUROC), Area Under Precision Recall Curve (AUPRC)
Those who examine how well-calibrated the prediction probabilities are, for instance: Brier Rating, Log Loss
Those who have a look at how correct the anticipated labels are, for instance: F1 Rating, F2 Rating, MCC, Kappa Rating, Balanced Accuracy

The primary two categories of metric listed here work based on predicted probabilities, and the last works with predicted labels.

While there are many metrics inside each of those categories, for simplicity, we are going to consider for the moment just two of the more common, the Area Under Receiver Operator Curve (AUROC) and the F1 rating.

These two metrics have an interesting relationship (as does AUROC with other metrics based on predicted labels), which ClassificationThresholdTuner takes advantage of to tune and to clarify the optimal thresholds.

The thought behind ClassificationThresholdTuner is to, once the model is well-tuned to have a powerful AUROC, benefit from this to optimize for other metrics — metrics which can be based on predicted labels, resembling the F1 rating.

Fairly often metrics that have a look at how correct the anticipated labels are are essentially the most relevant for classification. That is in cases where the model will likely be used to assign predicted labels to records and what’s relevant is the variety of true positives, true negatives, false positives, and false negatives. That’s, if it’s the anticipated labels which can be used downstream, then once the labels are assigned, it’s not relevant what the underlying predicted probabilities were, just these final label predictions.

For instance, if the model assigns labels of Yes and No to clients indicating in the event that they’re expected to churn in the following yr and the clients with a prediction of Yes receive some treatment and people with a prediction of No don’t, what’s most relevant is how correct these labels are, not ultimately, how well-ranked or well-calibrated the prediction probabilities (that these class predications are based on) were. Though, how well-ranked the anticipated probabilities are is relevant, as we’ll see, to assign predicted labels accurately.

This isn’t true for each project: often metrics resembling AUROC or AUPRC that have a look at how well the anticipated probabilities are ranked are essentially the most relevant; and sometimes metrics resembling Brier Rating and Log Loss that have a look at how accurate the anticipated probabilities are most relevant.

Tuning the thresholds is not going to affect these metrics and, where these metrics are essentially the most relevant, there is no such thing as a reason to tune the thresholds. But, for this text, we’ll consider cases where the F1 rating, or one other metric based on the anticipated labels, is what we want to optimize.

ClassificationThresholdTuner starts with the anticipated probabilities (the standard of which might be assessed with the AUROC) after which works to optimize the desired metric (where the desired metric relies on predicted labels).

Metrics based on the correctness of the anticipated labels are all, in other ways, calculated from the confusion matrix. The confusion matrix, in turn, relies on the brink chosen, and might look quite quite different depending if a low or high threshold is used.

The AUROC metric is, because the name implies, based on the ROC, a curve showing how the true positive rate pertains to the false positive rate. An ROC curve doesn’t assume any specific threshold is used. But, each point on the curve corresponds to a particular threshold.

Within the plot below, the blue curve is the ROC. The world under this curve (the AUROC) measures how strong the model is mostly, averaged over all potential thresholds. It measures how well ranked the possibilities are: if the possibilities are well-ranked, records which can be assigned higher predicted probabilities of being within the positive class are, in truth, more more likely to be within the positive class.

For instance, an AUROC of 0.95 means a random positive sample has a 95% likelihood of being ranked higher than random negative sample.

First, having a model with a powerful AUROC is essential — that is the job of the model tuning process (which may very well optimize for other metrics). This is finished before we start tuning the brink, and coming out of this, it’s vital to have well-ranked probabilities, which suggests a high AUROC rating.

Then, where the project requires class predictions for all records, it’s needed to pick out a threshold (though the default of 0.5 might be used, but likely with sub-optimal results), which is such as picking some extent on the ROC curve.

The figure above shows two points on the ROC. For every, a vertical and a horizonal line are drawn to the x & y-axes to point the associated True Positive Rates and False Positive Rates.

Given an ROC curve, as we go left and down, we’re using a better threshold (for instance from the green to the red line). Less records will likely be predicted positive, so there will likely be each less true positives and fewer false positives.

As we move right and up (for instance, from the red to the green line), we’re using a lower threshold. More records will likely be predicted positive, so there will likely be each more true positives and more false positives.

That’s, within the plot here, the red and green lines represent two possible thresholds. Moving from the green line to the red, we see a small drop within the true positive rate, but a bigger drop within the false positive rate, making this quite likely a better option of threshold than that where the green line is situated. But not necessarily — we also need to contemplate the relative cost of false positives and false negatives.

What’s vital, though, is that moving from one threshold to a different can often adjust the False Positive Rate rather more or much lower than the True Positive Rate.

The next presents a set of thresholds with a given ROC curve. We are able to see where moving from one threshold to a different can affect the true positive and false positive rates to significantly different extents.

That is the essential idea behind adjusting the brink: it’s often possible to realize a big gain in a single sense, while taking only a small loss in the opposite.

It’s possible to have a look at the ROC curve and see the effect of moving the thresholds up and down. On condition that, it’s possible, to an extent, to eye-ball the method and pick some extent that appears to best balance true positives and false positives (which also effectively balances false positives and false negatives). In some sense, that is what ClassificationThesholdTuner does, but it surely does so in a principled way, as a way to optimize for a certain, specified metric (resembling the F1 rating).

Moving the brink to different points on the ROC generates different confusion matrixes, which may then be converted to metrics (F1 Rating, F2 rating, MCC etc.). We are able to then take the purpose that optimizes this rating.

As long as a model is trained to have a powerful AUROC, we will normally find a superb threshold to realize a high F1 rating (or other such metric).

On this ROC plot, the model may be very accurate, with an AUROC of 0.98. It’ll, then, be possible to pick out a threshold that ends in a high F1 rating, though it remains to be needed to pick out a superb threshold, and the optimal may easily not be 0.5.

Being well-ranked, the model is just not necessarily also well-calibrated, but this isn’t needed: as long as records which can be within the positive class are likely to get higher predicted probabilities than those within the negative class, we will find a superb threshold where we separate those predicted to be positive from those predicted to be negative.

this one other way, we will view the distribution of probabilities in a binary classification problem with two histograms, as shown here (actually using KDE plots). The blue curve shows the distribution of probabilities for the negative class and the orange for the positive class. The model is just not likely well-calibrated: the possibilities for the positive class are consistently well below 1.0. But, they’re well-ranked: the possibilities for the positive class are likely to be higher than those for the negative class, which suggests the model would have a high AUROC and the model can assign labels well if using an appropriate threshold, on this case, likely about 0.25 or 0.3. On condition that there’s overlap within the distributions, though, it’s impossible to have an ideal system to label the records, and the F1 rating can never be quite 1.0.

It is feasible to have, even with a high AUROC rating, a low F1 rating: where there’s a poor alternative of threshold. This could occur, for instance, where the ROC hugs the axis as within the ROC shown above — a really low or very high threshold may go poorly. Hugging the y-axis can even occur where the info is imbalanced.

Within the case of the histograms shown here, though the model is well-calibrated and would have a high AUROC rating, a poor alternative of threshold (resembling 0.5 or 0.6, which might lead to the whole lot being predicted because the negative class) would lead to a really low F1 rating.

It’s also possible (though less likely) to have a low AUROC and high F1 Rating. This is feasible with a very sensible choice of threshold (where most thresholds would perform poorly).

As well, it’s not common, but possible to have ROC curves which can be asymmetrical, which may greatly affect where it’s best to position the brink.

That is taken from a notebook available on the github site (where it’s possible to see the complete code). We’ll go over the essential points here. For this instance, we first generate a test dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from threshold_tuner import ClassificationThresholdTunerNUM_ROWS = 100_000
def generate_data():
num_rows_per_class = NUM_ROWS // 2
np.random.seed(0)
d = pd.DataFrame(
{"Y": ['A']*num_rows_per_class + ['B']*num_rows_per_class,
"Pred_Proba": 
np.random.normal(0.7, 0.3, num_rows_per_class).tolist() + 
np.random.normal(1.4, 0.3, num_rows_per_class).tolist()
})
return d, ['A', 'B']
d, target_classes = generate_data()

Here, for simplicity, we don’t generate the unique data or the classifier that produced the anticipated probabilities, only a test dataset containing the true labels and the anticipated probabilities, as that is what ClassificationThresholdTuner works with and is all that’s needed to pick out one of the best threshold.

There’s actually also code within the notebook to scale the possibilities, to make sure they’re between 0.0 and 1.0, but for here, we’ll just assume the possibilities are well-scaled.

We are able to then set the Pred column using a threshold of 0.5:

d['Pred'] = np.where(d["Pred_Proba"] > 0.50, "B", "A")

This simulates what’s normally done with classifiers, simply using 0.5 as the brink. That is the baseline we are going to attempt to beat.

We then create a ClassificationThresholdTuner object and use this, to start out, simply to see how strong the present predictions are, calling considered one of it’s APIs, print_stats_lables().

tuner = ClassificationThresholdTuner()tuner.print_stats_labels(
y_true=d["Y"], 
target_classes=target_classes,
y_pred=d["Pred"])

This means the precision, recall, and F1 scores for each classes (was well because the macro scores for these) and presents the confusion matrix.

This API assumes the labels have been predicted already; where only the possibilities can be found, this method can’t be used, though we will all the time, as in this instance, select a threshold and set the labels based on this.

We can even call the print_stats_proba() method, which also presents some metrics, on this case related to the anticipated probabilities. It shows: the Brier Rating, AUROC, and several other plots. The plots require a threshold, though 0.5 is used if not specified, as in this instance:

tuner.print_stats_proba(
y_true=d["Y"], 
target_classes=target_classes, 
y_pred_proba=d["Pred_Proba"])

This displays the results of a threshold of 0.5. It shows the ROC curve, which itself doesn’t require a threshold, but draws the brink on the curve. It then presents how the info is split into two predicted classes based on the brink, first as a histogram, and second as a swarm plot. Here there are two classes, with class A in green and sophistication B (the positive class in this instance) in blue.

Within the swarm plot, any misclassified records are shown in red. These are those where the true class is A but the anticipated probability of B is above the brink (so the model would predict B), and people where the true class is B but the anticipated probability of B is below the brink (so the model would predict A).

We are able to then examine the results of various thresholds using plot_by_threshold():

tuner.plot_by_threshold(
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d["Pred_Proba"])

In this instance, we use the default set of potential thresholds: 0.1, 0.2, 0.3, … as much as 0.9. For every threshold, it can predict any records with predicted probabilities over the brink because the positive class and anything lower because the negative class. Misclassified records are shown in red.

To avoid wasting space in this text, this image shows just three potential thresholds: 0.2, 0.3, and 0.4. For every we see: the position on the ROC curve this threshold represents, the split in the info it results in, and the resulting confusion matrix (together with the F1 macro rating related to that confusion matrix).

We are able to see that setting the brink to 0.2 ends in almost the whole lot being predicted as B (the positive class) — just about all records of sophistication A are misclassified and so drawn in red. As the brink is increased, more records are predicted to be A and fewer as B (though at 0.4 most records which can be truly B are appropriately predicted as B; it is just not until a threshold of about 0.8 where just about all records which can be truly class B are erroneously predicted as A: only a few have predicted probability over 0.8).

Examining this for nine possible values from 0.1 to 0.9 gives a superb overview of the possible thresholds, but it surely could also be more useful to call this function to display a narrower, and more realistic, range of possible values, for instance:

tuner.plot_by_threshold(
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d["Pred_Proba"], 
start=0.50, end=0.55, num_steps=6)

This can show each threshold from 0.50 to 0.55. Showing the primary two of those:

The API helps present the implications of various thresholds.

We can even view this calling describe_slices(), which describes the info between pairs of potential thresholds (i.e., inside slices of the info) as a way to see more clearly what the precise changes will likely be of moving the brink from one potential location to the following (we see how lots of each true class will likely be re-classified).

tuner.describe_slices(    
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d["Pred_Proba"], 
start=0.3, end=0.7, num_slices=5)

This shows each slice visually and in table format:

Here, the slices are fairly thin, so we see plots each showing them in context of the complete range of probabilities (the left plot) and zoomed in (the suitable plot).

We are able to see, for instance, that moving the brink from 0.38 to 0.46 we’d re-classify the points within the third slice, which has 17,529 true instances of sophistication A and 1,464 true instances of sophistication B. This is clear each within the rightmost swarm plot and within the table (within the swarm plot, there are way more green than blue points inside slice 3).

This API may also be called for a narrower, and more realistic, range of potential thresholds:

tuner.describe_slices(    
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d["Pred_Proba"], 
start=0.4, end=0.6, num_slices=10)

This produces:

Having called these (or one other useful API, print_stats_table(), skipped here for brevity, but described on the github page and in the instance notebooks), we will have some idea of the results of moving the brink.

We are able to then move to the essential task, trying to find the optimal threshold, using the tune_threshold() API. With some projects, this will actually be the one API called. Or it could be called first, with the above APIs being called later to supply context for the optimal threshold discovered.

In this instance, we optimize the F1 macro rating, though any metric supported by scikit-learn and based on class labels is feasible. Some metrics require additional parameters, which might be passed here as well. In this instance, scikit-learn’s f1_score() requires the ‘average’ parameter, passed here as a parameter to tune_threshold().

from sklearn.metrics import f1_scorebest_threshold = tuner.tune_threshold(
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d["Pred_Proba"],
metric=f1_score,
average='macro',
higher_is_better=True,
max_iterations=5
)
best_threshold

This, optionally, displays a set of plots demonstrating how the strategy over five iterations (in this instance max_iterations is specified as 5) narrows in on the brink value that optimizes the desired metric.

The primary iteration considers the complete range of potential thresholds between 0.0 and 1.0. It then narrows in on the range 0.5 to 0.6, which is examined closer in the following iteration and so forth. Ultimately a threshold of 0.51991 is chosen.

After this, we will call print_stats_labels() again, which shows:

We are able to see, in this instance, a rise in Macro F1 rating from 0.875 to 0.881. On this case, the gain is small, but comes for nearly free. In other cases, the gain could also be smaller or larger, sometimes much larger. It’s also never counter-productive; at worst the optimal threshold found will likely be the default, 0.5000, in any case.

As indicated, multi-class classification is a little more complicated. Within the binary classification case, a single threshold is chosen, but with multi-class classification, ClassificationThesholdTuner identifies an optimal threshold per class.

Also different from the binary case, we’d like to specify considered one of the classes to be the default class. Going through an example should make it more clear why that is the case.

In lots of cases, having a default class might be fairly natural. For instance, if the goal column represents various possible medical conditions, the default class could also be “No Issue” and the opposite classes may each relate to specific conditions. For every of those conditions, we’d have a minimum predicted probability we’d require to really predict that condition.

Or, if the info represents network logs and the goal column relates to varied intrusion types, then the default could also be “Normal Behavior”, with the opposite classes each referring to specific network attacks.

In the instance of network attacks, we could have a dataset with 4 distinct goal values, with the goal column containing the classes: “Normal Behavior”, “Buffer Overflow”, “Port Scan”, and “Phishing”. For any record for which we run prediction, we are going to get a probability of every class, and these will sum to 1.0. We may get, for instance: [0.3, 0.4, 0.1, 0.2] (the possibilities for every of the 4 classes, within the order above).

Normally, we’d predict “Buffer Overflow” as this has the best probability, 0.4. Nevertheless, we will set a threshold as a way to modify this behavior, which can then affect the speed of false negatives and false positives for this class.

We may specify, for instance that: the default class is ‘Normal Behavior”; the brink for “Buffer Overflow” is 0.5; for “Port Scan” is 0.55; and for “Phishing” is 0.45. By convention, the brink for the default class is ready to 0.0, because it doesn’t actually use a threshold. So, the set of thresholds here could be: 0.0, 0.5, 0.55, 0.45.

Then to make a prediction for any given record, we consider only the classes where the probability is over the relevant threshold. In this instance (with predictions [0.3, 0.4, 0.1, 0.2]), none of the possibilities are over their thresholds, so the default class, “Normal Behavior” is predicted.

If the anticipated probabilities were as a substitute: [0.1, 0.6, 0.2, 0.1], then we’d predict “Buffer Overflow”: the probability (0.6) is the best prediction and is over its threshold (0.5).

If the anticipated probabilities were: [0.1, 0.2, 0.7, 0.0], then we’d predict “Port Scan”: the probability (0.7) is over its threshold (0.55) and that is the best prediction.

This implies: if a number of classes have predicted probabilities over their threshold, we take the considered one of these with the best predicted probability. If none are over their threshold, we take the default class. And, if the default class has the best predicted probability, it can be predicted.

So, a default class is required to cover the case where not one of the predictions are over the the brink for that class.

If the predictions are: [0.1, 0.3, 0.4, 0.2] and the thresholds are: 0.0, 0.55, 0.5, 0.45, one other solution to have a look at that is: the third class would normally be predicted: it has the best predicted probability (0.4). But, if the brink for that class is 0.5, then a prediction of 0.4 is just not high enough, so we go to the following highest prediction, which is the second class, with a predicted probability of 0.3. That’s below its threshold, so we go again to the following highest predicted probability, which is the forth class with a predicted probability of 0.2. Additionally it is below the brink for that focus on class. Here, now we have all classes with predictions which can be fairly high, but not sufficiently high, so the default class is used.

This also highlights why it’s convenient to make use of 0.0 as the brink for the default class — when examining the prediction for the default class, we don’t need to contemplate if its prediction is under or over the brink for that class; we will all the time make a prediction of the default class.

It’s actually, in principle, also possible to have more complex policies — not only using a single default class, but as a substitute having multiple classes that might be chosen under different conditions. But these are beyond the scope of this text, are sometimes unnecessary, and should not supported by ClassificationThresholdTuner, no less than at present. For the rest of this text, we’ll assume there’s a single default class specified.

Again, we’ll start by creating the test data (using considered one of the test data sets provided within the example notebook for multi-class classification on the github page), on this case, having three, as a substitute of just two, goal classes:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from threshold_tuner import ClassificationThresholdTunerNUM_ROWS = 10_000
def generate_data():
num_rows_for_default = int(NUM_ROWS * 0.9) 
num_rows_per_class = (NUM_ROWS - num_rows_for_default) // 2
np.random.seed(0)
d = pd.DataFrame({
"Y": ['No Attack']*num_rows_for_default + ['Attack A']*num_rows_per_class + ['Attack B']*num_rows_per_class,
"Pred_Proba No Attack": 
np.random.normal(0.7, 0.3, num_rows_for_default).tolist() + 
np.random.normal(0.5, 0.3, num_rows_per_class * 2).tolist(),
"Pred_Proba Attack A": 
np.random.normal(0.1, 0.3, num_rows_for_default).tolist() + 
np.random.normal(0.9, 0.3, num_rows_per_class).tolist() + 
np.random.normal(0.1, 0.3, num_rows_per_class).tolist(),
"Pred_Proba Attack B": 
np.random.normal(0.1, 0.3, num_rows_for_default).tolist() + 
np.random.normal(0.1, 0.3, num_rows_per_class).tolist() + 
np.random.normal(0.9, 0.3, num_rows_per_class).tolist()                    
})
d['Y'] = d['Y'].astype(str)
return d, ['No Attack', 'Attack A', 'Attack B']
d, target_classes = generate_data()

There’s some code within the notebook to scale the scores and ensure they sum to 1.0, but for here, we will just assume this is finished and that now we have a set of well-formed probabilities for every class for every record.

As is common with real-world data, considered one of the classes (the ‘No Attack’ class) is rather more frequent than the others; the dataset in imbalanced.

We then set the goal predictions, for now just taking the category with the best predicted probability:

def set_class_prediction(d):    
max_cols = d[proba_cols].idxmax(axis=1)
max_cols = [x[len("Pred_Proba_"):] for x in max_cols]
return max_cols   d['Pred'] = set_class_prediction(d)

This produces:

Taking the category with the best probability is the default behaviour, and in this instance, the baseline we want to beat.

We are able to, as with the binary case, call print_stats_labels(), which works similarly, handling any variety of classes:

tuner.print_stats_labels(
y_true=d["Y"], 
target_classes=target_classes,
y_pred=d["Pred"])

This outputs:

Using these labels, we get an F1 macro rating of only 0.447.

Calling print_stats_proba(), we also get the output related to the prediction probabilities:

This can be a bit more involved than the binary case, since now we have three probabilities to contemplate: the possibilities of every class. So, we first show how the info lines up relative to the possibilities of every class. On this case, there are three goal classes, so three plots in the primary row.

As could be hoped, when plotting the info based on the anticipated probability of ‘No Attack’ (the left-most plot), the records for ‘No Attack’ are given a better probabilities of this class than for other classes. Similar for ‘Attack A’ (the center plot) and ‘Attack B’ (the right-most plot).

We can even see that the classes should not perfectly separated, so there is no such thing as a set of thresholds that can lead to an ideal confusion matrix. We are going to must selected a set of thresholds that best balances correct and incorrect predictions for every class.

Within the figure above, the underside plot shows each point based on the probability of its true class. So for the the records where the true class is ‘No Attack’ (the green points), we plot these by their predicted probability of ‘No Attack’, for the records where the true class is ‘Attack A’, (in dark blue) we plot these by their predicted probability of ‘Attack A’, and similar for Attack B (in dark yellow). We see that the model has similar probabilities for Attack A and Attack B, and better probabilities for these than for No Attack.

The above plots didn’t consider any specific thresholds which may be used. We can even, optionally, generate more output, passing a set of thresholds (one per class, using 0.0 for the default class):

tuner.print_stats_proba(
y_true=d["Y"], 
target_classes=target_classes, 
y_pred_proba=d[proba_cols].values,
default_class='No Attack',
thresholds=[0.0, 0.4, 0.4]
)

This will be most useful to plot the set of thresholds discovered as optimal by the tool, but may also be used to view other potential sets of thresholds.

This produces a report for every class. To avoid wasting space, we just show one here, for sophistication Attack A (the complete report is shown in the instance notebook; viewing the reports for the opposite two classes as well is useful to grasp the complete implications of using, in this instance, [0.0, 0.4, 0.4] because the thresholds):

As now we have a set of thresholds specified here, we will see the implications of using these thresholds, including how lots of each class will likely be appropriately and incorrectly classified.

We see first where the brink appears on the ROC curve. On this case, we’re viewing the report for Class A so see a threshold of 0.4 (0.4 was specified for sophistication A within the API call above).

The AUROC rating can be shown. This metric applies only to binary prediction, but in a multi-class problem we will calculate the AUROC rating for every class by treating the issue as a series of one-vs-all problems. Here we will treat the issue as ‘Attack A’ vs not ‘Attack A’ (and similarly for the opposite reports).

The subsequent plots show the distribution of every class with respect to the anticipated probabilities of Attack A. As there are different counts of the several classes, these are shown two ways: one showing the actual distributions, and one showing them scaled to be more comparable. The previous is more relevant, however the latter can allow all classes to be seen clearly where some classes are rather more rare than others.

We are able to see that records where the true class is ‘Attack A’ (in dark blue) do have higher predicted probabilities of ‘Attack A’, but there’s some decision to be made as to where the brink is specifically placed. We see here the effect using 0.4 for this class. It seems that 0.4 is probably going near ideal, if not exactly.

We also see this in the shape a swarm plot (the right-most plot), with the misclassified points in red. We are able to see that using a better threshold (say 0.45 or 0.5), we’d have more records where the true class is Attack A misclassified, but less records where the true class is ‘No Attack’ misclassified. And, using a lower threshold (say 0.3 or 0.35) would have the alternative effect.

We can even call plot_by_threshold() to have a look at different potential thresholds:

tuner.plot_by_threshold(
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d[proba_cols].values,
default_class='No Attack'
)

This API is solely for explanation and never tuning, so for simplicity uses (for every potential threshold), the identical threshold for every class (apart from the default class). Showing this for the potential thresholds 0.2, 0.3, and 0.4:

The primary row of figures shows the implication of using 0.2 for the brink for all classes apart from the default (that is just not predicting Attack A unless the estimated probability of Attack A is no less than 0.2; and never predicting Attack B unless the anticipated probability of Attack B is no less than 0.2 — though all the time otherwise taking the category with the best predicted probability). Similarly in the following two rows for thresholds of 0.3 and 0.4.

We are able to see here the trade-offs to using lower or higher thresholds for every class, and the confusion matrixes that can result (together with the F1 rating related to these confusion matrixes).

In this instance, moving from 0.2 to 0.3 to 0.4, we will see how the model will less often predict Attack A or Attack B (raising the thresholds, we are going to less and fewer often predict anything apart from the default) and more often No Attack, which leads to less misclassifications where the true class is No Attack, but more where the true class is Attack A or Attack B.

When the brink is kind of low, resembling 0.2, then of those records where the true class is the default, only those with the best predicted probability of the category being No Attack (concerning the top half) were predicted appropriately.

Once the brink is ready above about 0.6, nearly the whole lot is predicted because the default class, so all cases where the bottom truth is the default class are correct and all others are incorrect.

As expected, setting the thresholds higher means predicting the default class more often and missing less of those, though missing more of the opposite classes. Attack A and B are generally predicted appropriately when using low thresholds, but mostly incorrectly when using higher thresholds.

To tune the thresholds, we again use tune_threshold(), with code resembling:

from sklearn.metrics import f1_scorebest_thresholds = tuner.tune_threshold(
y_true=d['Y'], 
target_classes=target_classes,
y_pred_proba=d[proba_cols].values,
metric=f1_score,
average='macro',
higher_is_better=True,
default_class='No Attack',
max_iterations=5
)
best_thresholds

This outputs: [0.0, 0.41257, 0.47142]. That’s, it found a threshold of about 0.413 for Attack A, and 0.471 for Attack B works best to optimize for the desired metric, macro F1 rating on this case.

Calling print_stats_proba() again, we get:

tuner.print_stats_proba(
y_true=d["Y"], 
target_classes=target_classes, 
y_pred_proba=d[proba_cols].values,
default_class='No Attack',
thresholds=best_thresholds
)

Which outputs:

The macro F1 rating, using the thresholds discovered here, has improved from about 0.44 to 0.68 (results will vary barely from run to run).

One additional API is provided which might be very convenient, get_predictions(), to get label predictions given a set of predictions and thresholds. This might be called resembling:

tuned_pred = tuner.get_predictions(
target_classes=target_classes,
d["Pred_Proba"], 
None, 
best_threshold)

Testing has been performed with many real datasets as well. Often the thresholds discovered work no higher than the defaults, but more often they work noticeably higher. One notebook is included on the github page covering a small number (4) real datasets. This was provided more to supply real examples of using the tool and the plots it generates (versus the synthetic data used to clarify the tool), but additionally gives some examples where the tool does, in truth, improve the F1 macro scores.

To summarize these quickly, by way of the thresholds discovered and the gain in F1 macro scores:

Breast cancer: discovered an optimal threshold of 0.5465, which improved the macro F1 rating from 0.928 to 0.953.

Steel plates fault: discovered an optimal threshold of 0.451, which improved the macro F1 rating from 0.788 to 0.956.

Phenome discovered an optimal threshold of 0.444, which improved the macro F1 rating from 0.75 to 0.78.

With the digits dataset, no improvement over the default was found, though could also be with different classifiers or otherwise different conditions.

This project uses a single .py file.

This have to be copied into your project and imported. For instance:

from threshold_tuner import ClassificationThesholdTunertuner = ClassificationThesholdTuner()

There are some subtle points about setting thresholds in multi-class settings, which can or might not be relevant for any given project. This will get more into the weeds than is needed to your work, and this articles is already quite long, but a bit is provided on the essential github page to cover cases where that is relevant. Specifically, thresholds set above 0.5 can behave barely in another way than those below 0.5.

While tuning the thresholds used for classification projects won’t all the time improve the standard of the model, it very often will, and sometimes significantly. This is simple enough to do, but using ClassificationThesholdTuner makes this a bit easier, and with multi-class classification, it might probably be particularly useful.

It also provides visualizations that specify the alternatives for threshold, which might be helpful, either in understanding and accepting the brink(s) it discovers, or in choosing other thresholds to higher match the goals of the project.

With multi-class classification, it might probably still take a little bit of effort to grasp well the results of moving the thresholds, but this is far easier with tools resembling this than without, and in lots of cases, simply tuning the thresholds and testing the outcomes will likely be sufficient in any case.

All images are by the creator

Achieve Higher Classification Results with ClassificationThresholdTuner

A python tool to tune and visualize the brink selections for binary and multi-class classification problems

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why agentic AI needs a brand new category of customer data

6 Technical Skills That Make You a Senior Data Scientist

Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate

CUGA on Hugging Face: Democratizing Configurable AI Agents

Roomba maker iRobot swept into chapter 11

Achieve Higher Classification Results with ClassificationThresholdTuner

A python tool to tune and visualize the brink selections for binary and multi-class classification problems

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.