The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

-

With Logistic Regression, we learned classify into two classes.

Now, what happens if there are greater than two classes.

n is just the multiclass extension of this concept. And we are going to discuss this model for Day 14 of my Machine Learning “Advent Calendar” (follow this link to get all the knowledge in regards to the approach and the files I exploit).

As a substitute of 1 rating, we now create one rating per class. As a substitute of 1 probability, we apply the Softmax function to provide probabilities that sum to 1.

Understanding the Softmax model

Before training the model, allow us to first understand what the model is.

Softmax Regression shouldn’t be about optimization yet.
It’s first about how predictions are computed.

A tiny dataset with 3 classes

Allow us to use a small dataset with one feature x and three classes.

As we said before, the goal variable y should not be treated as numerical.
It represents categories, not quantities.

A typical strategy to represent that is one-hot encoding, where each class is represented by its own indicator.

From this standpoint, Softmax Regression might be seen as three Logistic Regressions running in parallel, one per class.

Small datasets are perfect for learning.
You’ll be able to see every formula, every value, and the way each a part of the model contributes to the end result.

Softmax regression in Excel – All images by creator

Description of the Model

So what’s the model, exactly?

Rating per class

In logistic regression, the model rating is an easy linear expression: rating = a * x + b.

Softmax Regression does the exact same, but one rating per class:

score_0 = a0 * x + b0
score_1 = a1 * x + b1
score_2 = a2 * x + b2

At this stage, these scores are only real numbers.
They should not probabilities yet.

Turning scores into probabilities: the Softmax step

Softmax converts the three scores into three probabilities. Each probability is positive, and all three sum to 1.

The computation is direct:

  1. Exponentiate each rating
  2. Compute the sum of all exponentials
  3. Divide each exponential by this sum

This provides us p0, p1, and p2 for every row.

These values represent the model confidence for every class.

At this point, the model is fully defined.
Training the model will simply consist in adjusting the coefficients ak​ and bk​ in order that these probabilities match the observed classes in addition to possible.

Softmax regression in Excel – All images by creator

Visualizing the Softmax model

At this point, the model is fully defined.

We now have:

  • one linear rating per class
  • a Softmax step that turns these scores into probabilities

Training the model simply consists in adjusting the coefficients aka_kak​ and bkb_kbk​ in order that these probabilities match the observed classes in addition to possible.

Once the coefficients have been found, we will visualize the model behavior.

To do that, we take a spread of input values, for instance x from 0 to 7, and we compute: score0,score1,score2 and the corresponding probabilities p0,p1,p2.

Plotting these probabilities gives three smooth curves, one per class.

Softmax regression in Excel – All images by creator

The result could be very intuitive.

For small values of x, the probability of sophistication 0 is high.
As x increases, this probability decreases, while the probability of sophistication 1 increases.
For larger values of x, the probability of sophistication 2 becomes dominant.

At every value of x, the three probabilities sum to 1.
The model doesn’t make abrupt decisions; as an alternative, it expresses how confident it’s in each class.

This plot makes the behavior of Softmax Regression easy to know.

  • You’ll be able to see how the model transitions easily from one class to a different
  • Decision boundaries correspond to intersections between probability curves
  • The model logic becomes visible, not abstract

That is one among the important thing advantages of constructing the model in Excel:
you don’t just compute predictions, you may see how the model thinks.

Now that the model is defined, we’d like a strategy to evaluate how good it’s, and a technique to improve its coefficients.

Each steps reuse ideas we already saw with Logistic Regression.

Evaluating the model: Cross-Entropy Loss

Softmax Regression uses the same loss function as Logistic Regression.

For every data point, we take a look at the probability assigned to the correct class, and we take the negative logarithm:

loss = – log (p true class)

If the model assigns a high probability to the right class, the loss is small.
If it assigns a low probability, the loss becomes large.

In Excel, this could be very easy to implement.

We select the right probability based on the worth of y, and apply the logarithm:

loss = -LN( CHOOSE(y + 1, p0, p1, p2) )

Finally, we compute the average loss over all rows.
This average loss is the amount we would like to reduce.

Softmax regression in Excel – All images by creator

Computing residuals

To update the coefficients, we start by computing residuals, one per class.

For every row:

  • residual_0 = p0 minus 1 if y equals 0, otherwise 0
  • residual_1 = p1 minus 1 if y equals 1, otherwise 0
  • residual_2 = p2 minus 1 if y equals 2, otherwise 0

In other words, for the right class, we subtract 1.
For the opposite classes, we subtract 0.

These residuals measure how far the anticipated probabilities are from what we expect.

Computing the gradients

The gradients are obtained by combining the residuals with the feature values.

For every class k:

  • the gradient of ak is the typical of residual_k * x
  • the gradient of bk is the typical of residual_k

In Excel, that is implemented with easy formulas reminiscent of SUMPRODUCT and AVERAGE.

At this point, all the pieces is explicit:
you see the residuals, the gradients, and the way each data point contributes.

Screenshot

Updating the coefficients

Once the gradients are known, we update the coefficients using gradient descent.

This step is an identical as we saw before, fore Logistic Regression or Linear regression.
The one difference is that we now update six coefficients as an alternative of two.

To visualise learning, we create a second sheet with one row per iteration:

  • the present iteration number
  • the six coefficients (a0, b0, a1, b1, a2, b2)
  • the loss
  • the gradients

Row 2 corresponds to iteration 0, with the initial coefficients.

Row 3 computes the updated coefficients using the gradients from row 2.

By dragging the formulas down for a whole bunch of rows, we simulate gradient descent over many iterations.

You’ll be able to then clearly see:

  • the coefficients progressively stabilizing
  • the loss decreasing iteration after iteration

This makes the educational process tangible.
As a substitute of imagining an optimizer, you may watch the model learn.

Logistic Regression as a Special Case of Softmax Regression

Logistic Regression and Softmax Regression are sometimes presented as different models.

In point of fact, they’re the identical idea at different scales.

Softmax Regression computes one linear rating per class and turns these scores into probabilities by comparing them.
When there are only two classes, this comparison depends only on the difference between the 2 scores.

This difference is a linear function of the input, and applying Softmax on this case produces precisely the logistic (sigmoid) function.

In other words, Logistic Regression is just Softmax Regression applied to 2 classes, with redundant parameters removed.

Once this is known, moving from binary to multiclass classification becomes a natural extension, not a conceptual jump.

Softmax Regression doesn’t introduce a brand new way of considering.

It simply shows that Logistic Regression already contained all the pieces we wanted.

By duplicating the linear rating once per class and normalizing them with Softmax, we move from binary decisions to multiclass probabilities without changing the underlying logic.

The loss is similar idea.
The gradients are the identical structure.
The optimization is similar gradient descent we already know.

What changes is barely the variety of parallel scores.

One other Strategy to Handle Multiclass Classification?

Softmax shouldn’t be the one strategy to take care of multiclass problems in weight-based models.

There’s one other approach, less elegant conceptually, but quite common in practice:
one-vs-rest or one-vs-one classification.

As a substitute of constructing a single multiclass model, we train several binary models and mix their results.
This strategy is used extensively with Support Vector Machines.

Tomorrow, we are going to take a look at SVM.
And you will note that it could possibly be explained in a fairly unusual way… and, as usual, directly in Excel.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x