The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

Someday, a knowledge scientist told that Ridge Regression was an advanced model. Because he saw that the training formula is more complicated.

Well, this is precisely the target of my Machine Learning “Advent Calendar”, to make clear this sort of complexity.

So, ile, we are going to speak about penalized versions of linear regression.

First, we are going to see why the regularization or penalization is essential, and we are going to see how the model is modified
Then we are going to explore various kinds of regularization and their effects.
We will even train the model with regularization and test different hyperparameters.
We will even ask an additional query about weight the weights within the penalization term. (confused ? You will note)

Linear regression and its “conditions”

After we speak about linear regression, people often mention that some conditions needs to be satisfied.

You will have heard statements like:

the residuals needs to be Gaussian (it is usually confused with the goal being Gaussian, which is fake)
the explanatory variables shouldn’t be collinear

In classical statistics, these conditions are required for inference. In machine learning, the main target is on prediction, so these assumptions are less central, however the underlying issues still exist.

Here, we are going to see an example of two features being collinear, and let’s make them completely equal.

And now we have the connection: y = x1 + x2, and x1 = x2

I do know that in the event that they are completely equal, we will just do: y=2*x1. But the concept is to say they might be very similar, and we will at all times construct a model using them, right?

Then what’s the problem?

When features are perfectly collinear, the answer will not be unique. Here is an example within the screenshot below.

y = 10000*x1 – 9998*x2

Ridge and Lasso in Excel – all images by writer

And we will notice that the norm of the coefficients is big.

So, the concept is to limit the norm of the coefficients.

And after applying the regularization, the conceptual model is similar!

That is correct. The parameters of the linear regression are modified. However the model is similar.

Different Versions of Regularization

So the concept is to mix the MSE and the norm of the coefficients.

As an alternative of just minimizing the MSE, we try to attenuate the sum of the 2 terms.

Which norm? We are able to do with norm L1, L2, and even mix them.

There are three classical ways to do that, and the corresponding model names.

Ridge regression (L2 penalty)

Ridge regression adds a penalty on the squared values of the coefficients.

Intuitively:

large coefficients are heavily penalized (due to square)
coefficients are pushed toward zero
but they never change into exactly zero

Effect:

all features remain within the model
coefficients are smoother and more stable
very effective against collinearity

Ridge shrinks, but doesn’t select.

Ridge regression in Excel – All images by writer

Lasso regression (L1 penalty)

Lasso uses a special penalty: the absolute value of the coefficients.

This small change has a giant consequence.

With Lasso:

some coefficients can change into exactly zero
the model mechanically ignores some features

This is the reason LASSO is named so, since it stands for Least Absolute Shrinkage and Selection Operator.

Operator: it refers back to the regularization operator added to the loss function
Least: it’s derived from a least-squares regression framework
Absolute: it uses absolutely the value of the coefficients (L1 norm)
Shrinkage: it shrinks coefficients toward zero
Selection: it may well set some coefficients exactly to zero, performing feature selection

Vital nuance:

we will say that the model still has the identical variety of coefficients
but a few of them are forced to zero during training

The model form is unchanged, but Lasso effectively removes features by driving coefficients to zero.

3. Elastic Net (L1 + L2)

Elastic Net is a combination of Ridge and Lasso.

It uses:

an L1 penalty (like Lasso)
and an L2 penalty (like Ridge)

Why mix them?

Because:

Lasso might be unstable when features are highly correlated
Ridge handles collinearity well but doesn’t select features

Elastic Net gives a balance between:

stability
shrinkage
sparsity

It is commonly probably the most practical alternative in real datasets.

What really changes: model, training, tuning

Allow us to have a look at this from a Machine Learning standpoint.

The model does not likely change

For the model, for all of the regularized versions, we still write:

y =a x + b.

Same variety of coefficients
Same prediction formula
But, the coefficients might be different.

From a certain perspective, Ridge, Lasso, and Elastic Net are not different models.

The training principle can be the identical

We still:

define a loss function
minimize it
compute gradients
update coefficients

The one difference is:

the loss function now features a penalty term

That’s it.

The hyperparameters are added (that is the true difference)

For Linear regression, we shouldn’t have the control of the “complexity” of the model.

Standard linear regression: no hyperparameter
Ridge: one hyperparameter (lambda)
Lasso: one hyperparameter (lambda)
Elastic Net: two hyperparameters
- one for overall regularization strength
- one to balance L1 vs L2

So:

standard linear regression doesn’t need tuning
penalized regressions do

This is the reason standard linear regression is commonly seen as “not likely Machine Learning”, while regularized versions clearly are.

Implementation of Regularized gradients

We keep the gradient descent of OLS regression as reference, and for Ridge regression, we only should add the regularization term for the coefficient.

We’ll use a straightforward dataset that I generated (the identical one we already used for Linear Regression).

We are able to see the three “models” differ when it comes to coefficients. And the goal on this chapter is to implement the gradient for all of the models and compare them.

Ridge lasso regression in Excel – All images by writer

Ridge with penalized gradient

First, we will do for Ridge, and we only should change the gradient of a.

Now, it doesn’t mean that the worth b will not be modified, for the reason that gradient of b is each step depends also on a.

LASSO with penalized gradient

Then we will do the identical for LASSO.

And the one difference can be the gradient of a.

For every model, we may calculate the MSE and the regularized MSE. It is sort of satisfying to see how they decrease over the iterations.

Comparison of the coefficients

Now, we will visualize the coefficient a for all of the three models. With a purpose to see the differences, we input very large lambdas.

Impact of lambda

For giant value of lambda, we are going to see that the coefficient a becomes small.

And if lambda LASSO becomes extremely large, then we theoretically get the worth of 0 for a. Numerically, now we have to enhance the gradient descent.

Regularized Logistic Regression?

We saw Logistic Regression yesterday, and one query we will ask is that if it may well even be regularized. If yes, how are they called?

The reply is in fact yes, Logistic Regression might be regularized

The exact same idea applies.

Logistic regression will also be:

L1 penalized
L2 penalized
Elastic Net penalized

There are no special names like “Ridge Logistic Regression” in common usage.

Why?

Since the concept isn’t any longer latest.

In practice, libraries like scikit-learn simply allow you to specify:

the loss function
the penalty type
the regularization strength

The naming mattered when the concept was latest.
Now, regularization is just a regular option.

Conclusion

Ridge and Lasso don’t change the linear model itself, they modify how the coefficients are learned. By adding a penalty, regularization favors stable and meaningful solutions, especially when features are correlated. Seeing this process step-by-step in Excel makes it clear that these methods aren’t more complex, just more controlled.

The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

Linear regression and its “conditions”