The Machine Learning “Advent Calendar” Day 19: Bagging in Excel

-

For 18 days, we’ve got explored many of the core machine learning models, organized into three major families: distance- and density-based models, tree- or rule-based models, and weight-based models.

Up so far, each article focused on a single model, trained by itself. Ensemble learning changes this angle completely. It just isn’t a standalone model. As an alternative, it’s a way of combining these base models to construct something recent.

As illustrated within the diagram below, an ensemble is a meta-model. It sits on top of individual models and aggregates their predictions.

Trois learning steps in Machine Learning – Image by creator

Voting: the only ensemble idea

The only type of ensemble learning is voting.

The concept is sort of trivial: train several models, take their predictions, and compute the common. If one model is mistaken in a single direction and one other is mistaken in the wrong way, the errors should cancel out. At the least, that’s the intuition.

On paper, this sounds reasonable. In practice, things are very different.

As soon as you are attempting voting on real models, one fact becomes obvious: voting just isn’t magic. Simply averaging predictions doesn’t guarantee higher performance. In lots of cases, it actually makes things worse.

The rationale is easy. Once you mix models that behave very in a different way, you furthermore mght mix their weaknesses. If the models don’t make complementary errors, averaging can dilute useful structure as a substitute of reinforcing it.

To see this clearly, consider a quite simple example. Take a call tree and a linear regression trained on the identical dataset. The choice tree captures local, non-linear patterns. The linear regression captures a worldwide linear trend. Once you average their predictions, you don’t obtain a greater model. You obtain a compromise that is usually worse than each model taken individually.

Voting machine learning – all images by creator

This illustrates a vital point: ensemble learning requires greater than averaging. It requires a method. A option to mix models that truly improves stability or generalization.

Furthermore, if we consider the ensemble as a single model, then it should be trained as such. Easy averaging offers no parameter to regulate. There’s nothing to learn, nothing to optimize.

One possible improvement to voting is to assign different weights to the models. As an alternative of giving each model the identical importance, we could attempt to learn which of them should matter more. But as soon as we introduce weights, a brand new query appears: how will we train them? At that time, the ensemble itself becomes a model that should be fitted.

This commentary leads naturally to more structured ensemble methods.

In this text, we start with one statistical approach to resample the training dataset before averaging: Bagging.

The intuition behind Bagging

Why “bagging”?

What’s bagging?

The reply is definitely hidden within the name itself.

Bagging = Bootstrap + Aggregating.

You’ll be able to immediately tell that a mathematician or a statistician named it. 🙂

Behind this barely intimidating word, the thought is amazingly easy. Bagging is about doing two things: first, creating many versions of the dataset using the bootstrap, and second, aggregating the outcomes obtained from these datasets.

The core idea is subsequently not about changing the model. It’s about changing the data.

Bootstrapping the dataset

Bootstrapping means sampling the dataset with alternative. Each bootstrap sample has the identical size as the unique dataset, but not the identical observations. Some rows appear several times. Others disappear.

In Excel, this may be very easy to implement and, more importantly, very easy to see.

You begin by adding an ID column to your dataset, one unique identifier per row. Then, using the RANDBETWEEN function, you randomly draw row indices. Each draw corresponds to 1 row within the bootstrap sample. By repeating this process, you generate a full dataset that appears familiar, but is barely different from the unique one.

This step alone already makes the thought of bagging concrete. You’ll be able to literally see the duplicates. You’ll be able to see which observations are missing. Nothing is abstract.

Below, you possibly can see examples of bootstrap samples generated from the identical original dataset. Each sample tells a rather different story, though all of them come from the identical data.

These alternative datasets are the inspiration of bagging.

Dataset generated by creator – image by creator

Bagging linear regression: understanding the principle

Bagging process

Yes, this might be the primary time you hear about bagging linear regression.

In theory, there’s nothing mistaken with it. As we said earlier, bagging is an ensemble method that will be applied to any base model. Linear regression is a model, so technically, it qualifies.

In practice, nevertheless, you’ll quickly see that this just isn’t very useful.

But nothing prevents us from doing it. And precisely since it just isn’t very useful, it makes for a superb learning example. So allow us to do it.

For every bootstrap sample, we fit a linear regression. In Excel, this is easy. We will directly use the LINEST function to estimate the coefficients. Each color within the plot corresponds to 1 bootstrap sample and its associated regression line.

Up to now, every little thing behaves exactly as expected. The lines are close to one another, but not an identical. Each bootstrap sample barely changes the coefficients, and subsequently the fitted line.

Bagging of linear regression – image by creator

Now comes the important thing commentary.

You might notice that one additional model is plotted in black. This one corresponds to the usual linear regression fitted on the original dataset, without bootstrapping.

What happens once we compare it to the bagged models?

Once we average the predictions of all these linear regressions, the outcome is still a linear regression. The form of the prediction doesn’t change. The connection between the variables stays linear. We didn’t create a more expressive model.

And more importantly, the bagged model finally ends up being very near the usual linear regression trained on the unique data.

We will even push the instance further through the use of a dataset with a clearly non-linear structure. On this case, each linear regression fitted on a bootstrap sample struggles in its own way. Some lines tilt barely upward, others downward, depending on which observations were duplicated or missing within the sample.

Bagging of linear regression – image by creator

Bootstrap confidence intervals

From a prediction performance viewpoint, bagging linear regression just isn’t very useful.

Nevertheless, bootstrapping stays extremely useful for one necessary statistical notion: estimating the confidence interval of the predictions.

As an alternative of looking only at the common prediction, we are able to take a look at the distribution of predictions produced by all of the bootstrapped models. For every input value, we now have many predicted values, one from each bootstrap sample.

A straightforward and intuitive option to quantify uncertainty is to compute the standard deviation of those predictions. This standard deviation tells us how sensitive the prediction is to changes in the information. A small value means the prediction is stable. A big value means it’s uncertain.

This concept works naturally in Excel. Once you could have all of the predictions from the bootstrapped models, computing their standard deviation is easy. The result will be interpreted as a confidence band across the prediction.

That is clearly visible within the plot below. The interpretation is easy: in regions where the training data is sparse or highly dispersed, the boldness interval becomes wide, as predictions vary significantly across bootstrap samples.

Conversely, where the information is dense, predictions are more stable and the boldness interval narrows.

Now, once we apply this to non-linear data, something becomes very clear. In regions where the linear model struggles to suit the information, the predictions from different bootstrap samples unfolded rather more. The boldness interval becomes wider.

That is a vital insight. Even when bagging doesn’t improve prediction accuracy, it provides invaluable details about uncertainty. It tells us where the model is reliable and where it just isn’t.

Seeing these confidence intervals emerge directly from bootstrap samples in Excel makes this statistical concept very concrete and intuitive.

Bagging decision trees: from weak learners to a robust model

Now we move to decision trees.

The principle of bagging stays the exact same. We generate multiple bootstrap samples, train one model on each of them, after which aggregate their predictions.

I improved the Excel implementation to make the splitting process more automatic. To maintain things manageable in Excel, we restrict the trees to a single split. Constructing deeper trees is feasible, nevertheless it quickly becomes cumbersome in a spreadsheet.

Below, you possibly can see two of the bootstrapped trees. In total, I built eight of them by simply copying and pasting formulas, which makes the method straightforward and simple to breed.

Since decision trees are highly non-linear models and their predictions are piecewise constant, averaging their outputs has a smoothing effect.

Because of this, bagging naturally smooths the predictions. As an alternative of sharp jumps created by individual trees, the aggregated model produces more gradual transitions.

In Excel, this effect may be very easy to look at. The bagged predictions are clearly smoother than the predictions of any single tree.

A few of you will have already heard of decision stumps, that are decision trees with a maximum depth of 1. That is strictly what we use here. Each model is amazingly easy. By itself, a stump is a weak learner.

The query here is:
is a group of decision stumps sufficient when combined with bagging?

We’ll come back to this later in my Machine Learning “Advent Calendar”.

Random Forest: extending bagging

What about Random Forest?

This might be one in all the favourite models amongst data scientists.

So why not discuss it here, even in Excel?

In truth, what we’ve got just built is already very near a Random Forest!

To know why, recall that Random Forest introduces two sources of randomness.

  • The primary one is the bootstrap of the dataset. This is strictly what we’ve got already done with bagging.
  • The second is randomness within the splitting process. At each split, only a random subset of features is taken into account.

In our case, nevertheless, we only have one feature. Which means there’s nothing to pick from. Feature randomness simply doesn’t apply.

Because of this, what we obtain here will be seen as a simplified Random Forest.

Once this idea is obvious, extending the thought to multiple features is just a further layer of randomness, not a brand new concept.

And it’s possible you’ll even ask, we are able to apply this principle to Linear Regression, and do a Random

Conclusion

Ensemble learning is less about complex models and more about managing instability.

Easy voting isn’t effective. Bagging linear regression changes little and stays mostly pedagogical, though it is beneficial for estimating uncertainty. With decision trees, nevertheless, bagging truly matters: averaging unstable models results in smoother and more robust predictions.

Random Forest naturally extends this concept by adding extra randomness, without changing the core principle. Seen in Excel, ensemble methods stop being black boxes and change into a logical next step.

Further Reading

Thanks on your support for my Machine Learning “Advent Calendar“.

People often talk so much about supervised learning, but unsupervised learning is typically ignored, though it might probably reveal structure that no label could ever show.
If you need to explore these ideas further, listed below are three articles that dive into powerful unsupervised models.

Gaussian Mixture Model

An improved and more flexible version of k-means.

Unlike k-means, GMM allows clusters to stretch, rotate, and adapt to the true shape of the information.

But when do k-means and GMM actually produce different results?

Have a take a look at this text to see concrete examples and visual comparisons.

Local Outlier Factor (LOF)
A clever method that compares each point’s local density to its neighbors to detect anomalies.


All of the Excel files can be found through this Kofi link. Your support means so much to me. The value will increase through the month, so early supporters get the perfect value.

All Excel/Google sheet files for ML and DL
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x