The Machine Learning “Advent Calendar” Bonus 1: AUC in Excel

-

, we’ll implement AUC in Excel.

AUC is normally used for classification tasks as a performance metric.

But we start with a confusion matrix, because that’s where everyone begins in practice. Then we’ll see why a single confusion matrix is just not enough.

And we may even answer these questions:

  • AUC means Area Under the Curve, but under which curve?
  • Where does that curve come from?
  • Why is the world meaningful?
  • Is AUC a probability? (Yes, it has a probabilistic interpretation)

1. Why a confusion matrix is just not enough

1.1 Scores from models

A classifier will often give us scores, not final decisions. The choice comes later, when we decide a threshold.

In case you read the previous “Advent Calendar” articles, you might have already seen that “rating” can mean various things depending on the model family:

  • Distance-based models (corresponding to k-NN) often compute the proportion of neighbors for a given class (or a distance-based confidence), which becomes a rating.
  • Density-based models compute a likelihood under each class, then normalize to get a final (posterior) probability.
  • Classification Tree-based models often output the proportion of a given class among the many training samples contained in the leaf (that’s the reason many points share the identical rating).
  • Weight-based models (linear models, kernels, neural networks) compute a weighted sum or a non-linear rating, and sometimes apply a calibration step (sigmoid, softmax, Platt scaling, etc.) to map it to a probability.

So irrespective of the approach, we find yourself with the identical situation: a rating per remark.

Then, in practice, we pick a threshold, often 0.5, and we convert scores into predicted classes.

And this is precisely where the confusion matrix enters the story.

1.2 The confusion matrix at one threshold

Once a threshold is chosen, every remark becomes a binary decision:

  • predicted positive (1) or predicted negative (0)

From that, we are able to count 4 numbers:

  • TP (True Positives): predicted 1 and really 1
  • TN (True Negatives): predicted 0 and really 0
  • FP (False Positives): predicted 1 but actually 0
  • FN (False Negatives): predicted 0 but actually 1

This 2×2 counting table is the confusion matrix.

Then we typically compute ratios corresponding to:

  • Precision = TP / (TP + FP)
  • Recall (TPR) = TP / (TP + FN)
  • Specificity = TN / (TN + FP)
  • FPR = FP / (FP + TN)
  • Accuracy = (TP + TN) / Total

Thus far, every part is clean and intuitive.

But there may be a hidden limitation: all these values rely upon the edge. So the confusion matrix evaluates the model at one operating point, not the model itself.

Confusion matrix – image by creator

1.3 When one threshold breaks every part

This can be a strange example, but it surely still makes the purpose very clearly.

Imagine that your threshold is ready to 0.50, and all scores are below 0.50.

Then the classifier predicts:

  • Predicted Positive: none
  • Predicted Negative: everyone

So that you get:

  • TP = 0, FP = 0
  • FN = 10, TN = 10
Confusion matrix with all scores below 0.5 – image by creator

That is a superbly valid confusion matrix. It also creates a really strange feeling:

  • Precision becomes #DIV/0! because there are not any predicted positives.
  • Recall is 0% because you probably did not capture any positive.
  • Accuracy is 50%, which sounds “not too bad”, although the model found nothing.

Nothing is mistaken with the confusion matrix. The difficulty is the query we asked it to reply.

A confusion matrix answers: “How good is the model at this specific threshold?”

If the edge is poorly chosen, the confusion matrix could make a model look useless, even when the scores contain real separation.

And in your table, the separation is visible: positives often have scores around 0.49, negatives are more around 0.20 or 0.10. The model is just not random. Your threshold is just too strict.

That’s the reason a single threshold is just not enough.

What we’d like as an alternative is a solution to evaluate the model across thresholds, not at a single one.

2. ROC

First we’ve to construct the curve, since AUC stands for Area Under a Curve, so we’ve to grasp this curve.

2.1 What ROC means (and what it’s)

Because the primary query everyone should ask is: AUC under which curve?

The reply is:

AUC is the world under the ROC curve.

But this raises one other query.

What’s the ROC curve, and where does it come from?

ROC stands for Receiver Operating Characteristic. The name is historical (early signal detection), but the concept is modern and easy: it describes what happens if you change the choice threshold.

The ROC curve is a plot with:

  • x-axis: FPR (False Positive Rate)
    FPR = FP / (FP + TN)
  • y-axis: TPR (True Positive Rate), also called Recall or Sensitivity
    TPR = TP / (TP + FN)

Each threshold gives one point (FPR, TPR). Whenever you connect all points, you get the ROC curve.

At this stage, one detail matters: the ROC curve is just not directly observed; it’s constructed by sweeping the edge over the rating ordering.

2.2 Constructing the ROC curve from scores

For every rating, we are able to use it as a threshold (and naturally, we could also define customized thresholds).

For every threshold:

  • we compute TP, FP, FN, TN from the confusion matrix
  • then we calculate FPR and TPR

So the ROC curve is just the gathering of all these (FPR, TPR) pairs, ordered from strict thresholds to permissive thresholds.

This is precisely what we’ll implement in Excel.

ROC from scores – image by creator

At this point, it is necessary to note something that feels almost too easy. After we construct the ROC curve, the actual numeric values of the scores don’t matter. What matters is the order.

If one model outputs scores between 0 and 1, one other outputs scores between -12 and +5, and a 3rd outputs only two distinct values, ROC works the identical way. So long as higher scores are inclined to correspond to the positive class, the edge sweep will create the identical sequence of selections.

That’s the reason step one in Excel is all the time the identical: sort by rating from highest to lowest. Once the rows are in the appropriate order, the remainder is just counting.

2.3 Reading the ROC curve

Within the Excel sheet, the development becomes very concrete.

You sort observations by Rating, from highest to lowest. Then you definately walk down the list. At each row, you act as if the edge is ready to that rating, meaning: every part above is predicted positive.

That lets Excel compute cumulative counts:

  • what number of positives you might have accepted up to now
  • what number of negatives you might have accepted up to now

From these cumulative counts and the dataset totals, we compute TPR and FPR.

Now every row is one ROC point.

Why the ROC curve looks like a staircase

  • When the following accepted row is a positive, TP increases, so TPR increases while FPR stays flat.
  • When the following accepted row is a negative, FP increases, so FPR increases while TPR stays flat.

That’s the reason, with real finite data, the ROC curve is a staircase. Excel makes this visible.

2.4 Reference cases you must recognize

A number of reference cases aid you read the curve immediately:

  • Perfect classification: the curve goes straight up (TPR reaches 1 while FPR stays 0), then goes right at the highest.
Perfect classification ROC – image by creator
  • Random classifier: the curve stays near the diagonal line from (0,0) to (1,1).
Random classification ROC – image by creator
  • Inverted rating: the curve falls “below” the diagonal, and the AUC becomes smaller than 0.5. But on this case we’ve to vary the scores with 1-score. In theory, we are able to consider this fictive case. In practice, this often happens when scores are interpreted within the mistaken direction or class labels are swapped.
Inverted rating ROC – image by creator

These are usually not just theory. They’re visual anchors. Once you might have them, you’ll be able to interpret any real ROC curve quickly.

3. ROC AUC

Now, with the curve, what can we do?

3.1 Computing the world

Once the ROC curve exists as an inventory of points (FPR, TPR), the AUC is pure geometry.

Between two consecutive points, the world added is the world of a trapezoid:

  • width = change in FPR
  • height = average TPR of the 2 points

In Excel, this becomes a “delta column” approach:

  • compute dFPR between consecutive rows
  • multiply by the typical TPR
  • sum every part
ROC AUC in excel – image by creator

Different cases:

  • perfect classification: AUC = 1
  • random rating: AUC ≈ 0.5
  • inverted rating: AUC < 0.5

So the AUC is literally the summary of the entire ROC staircase.

3.2. AUC as a probability

AUC is just not about selecting a threshold.

It answers a much simpler query:

If I randomly pick one positive example and one negative example, what’s the probability that the model assigns a better rating to the positive one?

That’s all.

  • AUC = 1.0 means perfect rating (the positive all the time gets a better rating)
  • AUC = 0.5 means random rating (it is essentially a coin flip)
  • AUC < 0.5 means the rating is inverted (negatives are inclined to get higher scores)

This interpretation is amazingly useful, since it explains again this essential point:

AUC only relies on rating ordering, not on absolutely the values.

Because of this ROC AUC works even when the “scores” are usually not perfectly calibrated probabilities. They could be raw scores, margins, leaf proportions, or any monotonic confidence value. So long as higher means “more likely positive”, AUC can evaluate the rating quality.

Conclusion

A confusion matrix evaluates a model at one threshold, but classifiers produce scores, not decisions.
ROC and AUC evaluate the model across all thresholds by specializing in rating, not calibration.

In the long run, AUC answers an easy query: how often does a positive example receive a better rating than a negative one?
Seen this fashion, ROC AUC is an intuitive metric, and a spreadsheet is sufficient to make every step explicit.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x