The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

, we ensemble learning with voting, bagging and Random Forest.

Voting itself is simply an aggregation mechanism. It doesn’t create diversity, but combines predictions from already different models.
Bagging, however, explicitly creates diversity by training the identical base model on multiple bootstrapped versions of the training dataset.

Random Forest extends bagging by moreover restricting the set of features considered at each split.

From a statistical viewpoint, the thought is easy and intuitive: diversity is created through randomness, without introducing any fundamentally latest modeling concept.

But ensemble learning doesn’t stop there.

There exists one other family of ensemble methods that doesn’t depend on randomness in any respect, but on optimization. Gradient Boosting belongs to this family. And to really understand it, we are going to start with a deliberately strange idea:

We are going to apply Gradient Boosting to Linear Regression.

Yes, I do know. This might be the primary time you have got heard about applying Gradient Boosted Linear Regression.

(We are going to see Gradient Boosted Decision Trees, tomorrow).

In this text, here is the plan:

First, we are going to step back and revisit the three fundamental steps of machine learning.
Then, we are going to introduce the Gradient Boosting algorithm.
Next, we are going to apply Gradient Boosting to linear regression.
Finally, we are going to reflect on the connection between Gradient Boosting and Gradient Descent.

1. Machine Learning in Three steps

To make machine learning easier to learn, I at all times separate it into three clear steps. Allow us to apply this framework to Gradient Boosted Linear Regression.

Because unlike bagging, each step reveals something interesting.

Trois learning steps in Machine Learning – all images by creator

1. Model

A model is something that takes input features and produces an output prediction.

In this text, the bottom model might be Linear Regression.

1 bis. Ensemble Method Model

Gradient Boosting is not a model itself. It’s an ensemble method that aggregates several base models right into a single meta-model. By itself, it doesn’t map inputs to outputs. It should be applied to a base model.

Here, Gradient Boosting might be used to aggregate linear regression models.

2. Model fitting

Each base model should be fitted to the training data.

For Linear Regression, fitting means estimating the coefficients. This could be done numerically using Gradient Descent, but in addition analytically. In Google Sheets or Excel, we will directly use the LINEST function to estimate these coefficients.

2 bis. Ensemble model learning

At first, Gradient Boosting may appear to be a straightforward aggregation of models. However it remains to be a learning process. As we are going to see, it relies on a loss function, exactly like classical models that learn weights.

3. Model tuning

Model tuning consists of optimizing hyperparameters.

In our case, the bottom model Linear Regression itself has no hyperparameters (unless we use regularized variants corresponding to Ridge or Lasso).

Gradient Boosting, nonetheless, introduces two vital hyperparameters: the variety of boosting steps and the training rate. We are going to see this in the following section.

In a nutshell, that’s machine learning, made easy, in three steps!

2. Gradient Boosting Regressor algorithm

2.1 Algorithm principle

Listed below are the principal steps of the Gradient Boosting algorithm, applied to regression.

Initialization: We start with a quite simple model. For regression, this is frequently the typical value of the goal variable.
Residual Errors Calculation: We compute residuals, defined because the difference between the actual values and the present predictions.
Fitting Linear Regression to Residuals: We fit a brand new base model (here, a linear regression) to those residuals.
Update the ensemble : We add this latest model to the ensemble, scaled by a learning rate (also called shrinkage).
Repeating the method: We repeat steps 2 to 4 until we reach the specified variety of boosting iterations or until the error converges.

That’s it! That is the fundamental procedure for performing a Gradient Boosting applied to Linear Regression.

2.2 Algorithm expressed with formulas

Now we will write the formulas explicitly, it helps make each step concrete.

Step 1 – Initialization
We start with a relentless model equal to the typical of the goal variable:
f0 = average(y)

Step 2 – Residual computation
We compute the residuals, defined because the difference between the actual values and the present predictions:
r1 = y − f0

Step 3 – Fit a base model to the residuals
We fit a linear regression model to those residuals:
r̂1 = a0 · x + b0

Step 4 – Update the ensemble
We update the model by adding the fitted regression, scaled by the training rate:
f1 = f0 − learning_rate · (a0 · x + b0)

Next iteration
We repeat the identical procedure:
r2 = y − f1
r̂2 = a1 · x + b1
f2 = f1 − learning_rate · (a1 · x + b1)

By expanding this expression, we obtain:
f2 = f0 − learning_rate · (a0 · x + b0) − learning_rate · (a1 · x + b1)

The identical process continues at each iteration. Residuals are recomputed, a brand new model is fitted, and the ensemble is updated by adding this model with a learning rate.

This formulation makes it clear that Gradient Boosting builds the ultimate model as a sum of successive correction models.

3. Gradient Boosted Linear Regression

3.1 Base model training

We start with a straightforward linear regression as our base model, using a small dataset of ten observations that I generated.

For the fitting of the bottom model, we are going to use a function in Google Sheet (it also works in Excel): LINEST to estimate the coefficients of the linear regression.

Gradient Boosted Linear Regression Easy dataset with linear regression — Image by creator

3.2 Gradient Boosting algorithm

The implementation of those formulas is simple in Google Sheet or Excel.

The table below shows the training dataset together with the several steps of the gradient boosting steps:

Gradient Boosted Linear Regression with all steps in Excel — Image by creator

For every fitting step, we use the Excel function LINEST:

Gradient Boosted Linear Regression with formula for coefficient estimation — Image by creator

We are going to only do 2 iterations, and we will guess the way it goes for more iterations. Here below is a graphic to point out the models at each iteration. The several shades of red illustrate the convergence of the model and we also show the ultimate model that’s directly found with gradient descent applied on to y.

Gradient Boosted Linear Regression — Image by creator

3.3 Why Boosting Linear Regression is only pedagogical

For those who look fastidiously on the algorithm, two vital observations emerge.

First, in step 2, we fit a linear regression to residuals, it’s going to take time and algorithmic steps to attain the model fitting steps, as an alternative of fitting a linear regression to residuals, we will directly fit a linear regression to the actual values of y, and we already would find the ultimate optimal model!

Secondly, when adding a linear regression to a different linear regression, it remains to be a linear regression.

For instance, we will rewrite f2 as:

f2 = f0 - learning_rate *(b0+b1) - learning_rate * (a0+a1) x

This remains to be a linear function of x.

This explains why Gradient Boosted Linear Regression doesn’t bring any practical profit. Its value is only pedagogical: it helps us understand how the Gradient Boosting algorithm works, however it doesn’t improve predictive performance.

In reality, it’s even less useful than bagging applied to linear regression. With bagging, the variability between bootstrapped models allows us to estimate prediction uncertainty and construct confidence intervals. Gradient Boosted Linear Regression, however, collapses back to a single linear model and provides no additional details about uncertainty.

As we are going to see tomorrow, the situation may be very different when the bottom model is a choice tree.

3.4 Tuning hyperparameters

There are two hyperparameters we will tune: the variety of iterations and the training rate.

For the variety of iterations, we only implemented two, however it is simple to assume more, and we will stop by examining the magnitude of the residuals.

For the training rate, we will change it in Google Sheet and see what happens. When the training rate is small, the “learning process” might be slow. And if the training rate is 1, we will see that the convergence is achieved at iteration 1.

Gradient Boosted Linear Regression with learning rate =1— Image by creator

And the residuals of iteration 1 are already zeros.

If the training rate is higher than 1, then the model will diverge.

Gradient Boosted Linear Regression Divergence— Image by creator

4. Boosting as Gradient Descent in Function Space

4.1 Comparison with Gradient Descent Algorithm

At first glance, the role of the training rate and the variety of iterations in Gradient Boosting looks very just like what we see in Gradient Descent. This naturally results in confusion.

Beginners often notice that each algorithms contain the word “” and follow an iterative procedure. It’s subsequently tempting to assume that Gradient Descent and Gradient Boosting are closely related, without really knowing why.
Experienced practitioners normally react otherwise. From their perspective, the 2 methods appear unrelated. Gradient Descent is used to suit weight-based models by optimizing their parameters, while Gradient Boosting is an ensemble method that mixes multiple models fitted with the residuals. The use cases, the implementations, and the intuition seem completely different.
At a deeper level, nonetheless, experts will say that these two algorithms are in truth the identical optimization idea. The difference doesn’t lie in the training rule, but within the space where this rule is applied. Or we will say that the variable of interest is different.

Gradient Descent performs gradient-based updates in parameter space. Gradient Boosting performs gradient-based updates in function space.

That’s the only difference on this mathematical numerical optimization. Let’s see the equations within the case of regression and in the overall case below.

4.2 The Mean Squared Error Case: Same Algorithm, Different Space

With the Mean Squared Error, Gradient Descent and Gradient Boosting minimize the identical objective and are driven by an identical quantity: the residual.

In Gradient Descent, residuals influence the updates of the model parameters.

In Gradient Boosting, residuals directly update the prediction function.

In each cases, the training rate and the variety of iterations play the identical role. The difference lies only in where the update is applied: parameter space versus function space.

Once this distinction is obvious, it becomes evident that Gradient Boosting with MSE is just Gradient Descent expressed at the extent of functions.

4.3 Gradient Boosting with any loss function

The comparison above will not be limited to the Mean Squared Error. Each Gradient Descent and Gradient Boosting could be defined with respect to different loss functions.

In Gradient Descent, the loss is defined in parameter space. This requires the model to be differentiable with respect to its parameters, which naturally restricts the strategy to weight-based models.

In Gradient Boosting, the loss is defined in prediction space. Only the loss should be differentiable with respect to the predictions. The bottom model itself doesn’t must be differentiable, and naturally, it doesn’t must have its own loss function.

This explains why Gradient Boosting can mix arbitrary loss functions with non–weight-based models corresponding to decision trees.

Conclusion

Gradient Boosting will not be only a naive ensemble technique but an optimization algorithm. It follows the identical learning logic as Gradient Descent, differing only within the space where the optimization is performed: parameters versus functions. Using linear regression allowed us to isolate this mechanism in its simplest form.

In the following article, we are going to see how this framework becomes truly powerful when the bottom model is a choice tree, resulting in Gradient Boosted Decision Tree Regressors.

All of the Excel files can be found through this Kofi link. Your support means rather a lot to me. The worth will increase through the month, so early supporters get the most effective value.

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

1. Machine Learning in Three steps