A comprehensive (and colourful) guide to all the pieces you’ll want to learn about evaluating classification models
I spotted through my learning journey that I’m an incredibly visual learner and I appreciate using color and fun illustrations to learn recent concepts, especially scientific ones which might be typically explained like this:
From my previous articles, through tons of lovely comments and messages (thanks for all of the support!), I discovered that several people resonated with this sentiment. So I made a decision to begin a recent series where I’m going to aim as an example machine learning and computer science concepts to hopefully make learning them fun. So, buckle up and revel in the ride!
Let’s begin this series by exploring a fundamental query in machine learning: ?
In previous articles akin to Decision Tree Classification and Logistic Regression, we discussed find out how to construct classification models. Nevertheless, it’s crucial to quantify how well these models are performing, which begs the query: what metrics should we use to achieve this?
As an instance this idea, let’s construct a .
Our goal is to predict whether a person is prone to repay their loan based on their credit rating. While other variables like age, salary, loan amount, loan type, occupation, and credit history might also factor into such a classifier, for the sake of simplicity, we’ll only consider credit rating as the first determinant in our model.
Following the steps specified by the Logistic Regression article, we construct a classifier that predicts the probability that somebody will repay the loan based on their credit rating.
From this, we see that the lower the credit rating, the more likely it’s that the person is just not going to repay their loan and vice-versa.
Right away, the output of this model is the that an individual goes to repay their loan. Nevertheless, if we wish to categorise the loan as going to repay or not going to repay, then we want to seek out a method to turn these probabilities right into a classification.
One method to do this is to set a threshold of 0.5 and classify any people below that threshold as not going to repay and any above it as going to repay.
From this, we deduce that this model will classify anyone with a credit rating below 600 as not going to repay (pink) and above 600 as going to repay (blue).
Using 0.5 as a cutoff, we classify this person with a credit rating of 420 as…
…not going to repay. And this person with a credit rating of 700 as…
…going to repay.
Now to out how effective our model is, we want way greater than 2 people. So let’s dig through past records and collect details about 10,000 people’s credit scores and in the event that they repaid or didn’t repay their loans.
NOTE: In our records, we now have 9500 people who repaid their loan and only 500 that didn’t.
We then run our classifier on all and sundry and based on their credit scores we predict if the person goes to repay their loan or not.
Confusion Matrix
To higher visualize how our predictions compared with the reality, we create something called a confusion matrix.
On this specific confusion matrix, we consider a person who repaid their loan as a label and a person who didn’t repay their loan as a label.
- : People who actually repaid their loan and were classified by the model as going to repay
- : People who actually repaid their loan, but were classified by the model as not going to repay
- : People who in point of fact didn’t repay their loans and were classified by the model as not going to repay
- : People who in point of fact didn’t repay their loans, but were classified by the model as going to repay
Now imagine, we passed information in regards to the 10,000 people through our model. We find yourself with a confusion matrix that appears like this:
From this, we are able to deduce that —
- Out of 9500 people who repaid their loan — 9000 were accurately classified (TP) and 500 were incorrectly classified (FN)
- Out of the five hundred people who didn’t repay their loan — 200 (TN) were accurately classified and 300 (FP) were incorrectly classified.
Accuracy
Intuitively, the very first thing we ask ourselves is:
In our case, accuracy is:
92% accuracy is actually impressive, however it’s necessary to notice that accuracy is commonly a simplistic metric for evaluating model performance.
If we take a better have a look at the confusion matrix, we are able to see that while many individuals who repaid their loans were accurately classified, only 200 out of the five hundred individuals who didn’t repay were accurately labeled by the model, with the opposite 300 being incorrectly classified.
So let’s explore another commonly used metrics that we are able to use to evaluate the performance of our model.
Precision
One other query we are able to ask is: ?
To calculate precision, we are able to divide the True Positives by the whole variety of predicted Positives (i.e., individuals classified as going to repay).
So when our classifier predicts that an individual is going to repay, our classifier is correct 96.8% of the time.
Sensitivity (aka Recall)
Next, can ask ourselves: ?
To compute sensitivity, we are able to take the True Positives and divide them by the whole number of people who actually repaid their loans.
The classier accurately labeled 94.7% of people who actually repaid their loan and the remainder it incorrectly labeled not going to repay.
NOTE: The terms utilized in Precision and Sensitivity formulas could be confusing at times. One easy mnemonic to distinguish between the 2 is to keep in mind that each formulas use TP (True Positive), however the denominators differ. Precision has (TP + FP) within the denominator, while Sensitivity has (TP + FN).
To recollect this difference, consider the “P” in FP matching the “P” in Precision:
and that leaves FN, which we discover within the denominator of Sensitivity:
F1 Rating
One other useful metric that mixes sensitivity and precision is the F1 rating, which calculates the harmonic mean of precision and sensitivity.
F1 Rating in our case:
Normally the F1 rating provides a more comprehensive evaluation of model performance. Thus, the F1 rating is usually a more useful metric than accuracy in practice.
One other critical query to think about is specificity, which asks the query: ?
To calculate specificity, we divide the True Negatives by the whole number of people who didn’t repay their loans.
We will see that our classifier only accurately identifies 40% of people who didn’t repay their loans.
This stark difference between specificity and the opposite evaluation metrics emphasizes the importance of choosing the suitable metrics to evaluate model performance. It’s crucial to think about all evaluation metrics and interpret them appropriately, as each may provide a definite perspective on the model’s effectiveness.
NOTE: I often find it helpful to mix various metrics or devise my very own metric based on the issue at hand
In our scenario, accurately identifying individuals who is not going to repay their loans is more critical, as providing loans to such individuals can incur significant costs in comparison with rejecting loans for many who will repay. So we want to take into consideration ways to enhance its performance to do this.
One method to achieve that is by .
Although doing so could appear counterintuitive, what is essential to us is to accurately discover the individuals who usually are not going to repay their loans. Thus, incorrectly labeling people who find themselves actually going to repay is just not as essential to us.
By adjusting the brink value, we are able to make our model more sensitive to the Negative class (individuals who aren’t going to repay) on the expense of the Positive class (people who find themselves going to repay). This will likely increase the variety of False Negatives (classifying individuals who repaid as not going to repay), but can potentially reduce False Positives (failing to accurately discover individuals who didn’t repay).
Until now we used a threshold of 0.5, but let’s try changing it around to see if our model performs higher.
Let’s start by setting the brink to 0.
Which means that one and all might be classified as going to repay (represented by blue):
This may end in this confusion matrix:
…with accuracy:
…sensitivity and precision:
…and specificity:
At threshold = 0, our classifier is just not capable of accurately classify any of the individuals who didn’t repay their loans, rendering it ineffective though the accuracy and sensitivity could appear impressive.
Let’s try a threshold of 0.1:
So any person with a credit rating of below 420 might be classified as not going to repay. This ends in this confusion matrix and metrics:
Again we see that each one the metrics except specificity are pretty great.
Next, let’s go to the opposite extreme and set the brink to 0.9:
So any person below a credit rating of 760 goes to be labeled not going to repay. This may end in this confusion matrix and metrics:
Here, we see the metrics are almost flipped. The specificity and precision are great, however the accuracy and sensitivity are terrible.
You get the thought. We will do that for a bunch more thresholds (0.004, 0.3, 0.6, 0.875…). But doing so will end in a staggering variety of confusion matrices and metrics. And it will cause lots of confusion. Pun definitely intended.
ROC Curve
That is where Receiver Operating Characteristics (ROC) curve is available in to dispel this confusion.
.
The y-axis of the curve is the True Positive Rate, which is similar as Sensitivity. And the x-axis is the False Positive Rate, which is 1-Specificity.
The False Positive Rate tells us the proportion of people who didn’t repay that were incorrectly classified as going to repay (FP).
So when threshold = 0, from earlier we saw that our confusion matrix and metrics were:
We all know that the and the True Positive Rate = Sensitivity = 1 and the False Positive Rate = 1 — Specificty = 1 — 0 = 1.
Now let’s plot this information on the ROC curve:
This dotted blue line shows us where the True Positive Rate = False Positive Rate:
Any point on this line implies that the proportion of accurately classified people who repaid is similar because the proportion of incorrectly classified people who didn’t repay.
The bottom line is that we wish our threshold point to be as distant from the road to the left as possible and we don’t desire any point below this line.
Now when threshold = 0.1:
Plotting this threshold on the ROC curve:
For the reason that recent point (0.84, 0.989) is to the left of the blue dotted line, we all know that the proportion of accurately classified people who repaid is bigger than the proportion of incorrectly classified people who didn’t repay.
In other words, the brand new threshold is healthier than the primary one on the blue dotted line.
Now let’s increase the brink to 0.2. We calculate the True Positive Rate and False Positive Rate for this threshold and plot it:
The brand new point (0.75, 0.98) is even further to the left of the dotted blue line, showing that the brand new threshold is healthier than the previous one.
And now we keep repeating the identical process with a few other thresholds (=0.35, 0.5, 0.65, 0.7, 0.8, 1) until threshold = 1.
At threshold = 1, we’re at the purpose (0, 0) where True Positive Rate = False Negative Rate = 0 because the classifier classifies all of the points as not going to repay.
Now without having to sort through all of the confusing matrices and metrics, I can see that:
Because on the purple point when TPR = 0.8 and FPR = 0,
In other words, this threshold resulted in no False Positives. Whereas on the blue point, although 80% of the people who repaid are accurately classified only 80% of the people who didn’t repay are accurately classified (versus 100% for the previous threshold).
Now if we connect all these dots…
…we find yourself with the ROC curve.
AUC
Now let’s say we wish to match two different classifiers that we construct. As an example, the primary classifier is the logistic regression one we were up to now which resulted on this ROC curve:
And we decided to construct one other decision tree classifier that resulted on this ROC curve:
Then a method to compare each classifiers is to calculate the areas under their respective curves or AUCs.
For the reason that AUC of the logistic regression curve is bigger, we conclude that it’s a greater classifier.
In summary, we discussed commonly used metrics to judge classification models. Nevertheless, it will be important to notice that the choice of metrics is subjective and is dependent upon understanding the issue at hand and the business requirements. It might also be useful to make use of a mixture of those metrics and even create recent ones which might be more appropriate for the precise model’s needs.
Massive shoutout to StatQuest, my favorite statistics and machine learning resource. And please be happy to attach with me on LinkedIn or shoot me an email at shreya.statistics@gmail.com.


