previous article, we introduced the core mechanism of Gradient Boosting through Gradient Boosted Linear Regression.
That example was deliberately easy. Its goal was not performance, but understanding.
Using a linear model allowed us to make every step explicit: residuals, updates, and the additive nature of the model. It also made the link with Gradient Descent very clear.
In this text, we move to the setting where Gradient Boosting truly becomes useful in practice: Decision Tree Regressors.
We are going to reuse the identical conceptual framework as before, however the behavior of the algorithm changes in a vital way. Unlike linear models, decision trees are non-linear and piecewise constant. After they are combined through Gradient Boosting, they now not collapse right into a single model. As a substitute, each latest tree adds structure and refines the predictions of the previous ones.
For that reason, we’ll only briefly recap the final Gradient Boosting mechanism and focus as a substitute on what is restricted to Gradient Boosted Decision Trees: how trees are trained on residuals, how the ensemble evolves, and why this approach is so powerful.
1. Machine Learning in Three Steps
We are going to again use the identical three-step framework to maintain the reason consistent and intuitive.
1. Base model
we’ll use decision tree regressors as our base model.
A choice tree is non-linear by construction. It splits the feature space into regions and assigns a continuing prediction to every region.
A vital point is that when trees are added together, they don’t collapse right into a single tree.
Each latest tree introduces additional structure to the model.
That is where Gradient Boosting becomes particularly powerful.
1 bis. Ensemble model
Gradient Boosting is the mechanism used to aggregate these base models right into a single predictive model.
2. Model fitting
For clarity, we’ll use decision stumps, meaning trees with a depth of 1 and a single split.
Each tree is trained to predict the residuals of the previous model.
2 bis. Ensemble learning
The ensemble itself is built using gradient descent in function space.
Here, the objects being optimized are usually not parameters but functions, and people functions are decision trees.
3. Model tuning
Decision trees have several hyperparameters, comparable to:
- maximum depth
- minimum variety of samples required to separate
- minimum variety of samples per leaf
In this text, we fix the tree depth to 1.
On the ensemble level, two additional hyperparameters are essential:
- the learning rate
- the variety of boosting iterations
These parameters control how briskly the model learns and the way complex it becomes.
2. Gradient Boosting Algorithm
The Gradient Boosting algorithm follows an easy and repetitive structure.
2.1 Algorithm overview
Listed below are the essential steps of the Gradient Boosting algorithm
- Initialization
Start with a continuing model. For regression with squared loss, that is the common value of the goal. - Residual computation
Compute the residuals between the present predictions and the observed values. - Fit a weak learner
Train a choice tree regressor to predict these residuals. - Model update
Add the brand new tree to the prevailing model, scaled by a learning rate. - Repeat
Iterate until the chosen variety of boosting steps is reached or the error stabilizes.
2.2 Dataset
For instance the behavior of Gradient Boosted Trees, we’ll use several varieties of datasets that I generated:
- Piecewise linear data, where the connection changes by segments
- Non-linear data, comparable to curved patterns
- Binary targets, for classification tasks
For classification, we’ll start with the squared loss for simplicity. This permits us to reuse the identical mechanics as in regression. The loss function can later get replaced by alternatives higher suited to classification, comparable to logistic or exponential loss.
These different datasets help highlight how Gradient Boosting adapts to varied data structures and loss functions while counting on the identical underlying algorithm.

2.3 Initialization
The Gradient Boosting process starts with a continuing model.
For regression with squared loss, this initial prediction is solely the average value of the goal variable.
This average value represents the most effective initial prediction before any structure is learned from the features.
It’s also a very good opportunity to recall: almost every regression model could be seen as an improvement over the worldwide average.
- k-NN looks for similar observations, and predicts with the common value of their neighbors.
- Decision Tree Regressors split the dataset into regions and compute the common value inside each leaf to predict for a brand new commentary that falls into this leaf.
- Weight-based models adjust feature weights to balance or update the worldwide average, for a given latest commentary.
Here, for gradient boosting, we also start with the common value. After which we’ll see how it should be progressively corrected.

2.4 First Tree
The primary decision tree is then trained on the residuals of this initial model.
After the initialization, the residuals are only the differences between the observed values and the common.

To construct this primary tree, we use the exact same procedure as within the article on Decision Tree Regressors.
The one difference is the goal: as a substitute of predicting the unique values, the tree predicts the residuals.
This primary tree provides the initial correction to the constant model and sets the direction for the boosting process.

2.5 Model update
Once the primary tree has been trained on the residuals, we are able to compute the primary improved prediction.
The updated model is obtained by combining the initial prediction and the primary tree’s correction:
f1(x) = f0 + learning_rate * h1(x)
where:
- f0 is the initial prediction, equal to the common value of the goal
- h1(x) is the prediction of the primary tree trained on the residuals
- learning_rate controls how much of this correction is applied
This update step is the core mechanism of Gradient Boosting.
Each tree barely adjusts the present predictions as a substitute of replacing them, allowing the model to enhance progressively and remain stable.
2.6 Repeating the Process
Once the primary update has been applied, the identical procedure is repeated.
At each iteration, latest residuals are computed using the present predictions, and a brand new decision tree is trained to predict these residuals. This tree is then added to the model using the training rate.
To make this process easier to follow in Excel, the formulas could be written in a way that’s fully automated. Once this is completed, the formulas for the second tree and all subsequent trees can simply be copied to the suitable.

Because the iterations progress, all of the predictions of the residual models are grouped together. This makes the structure of the ultimate model very clear.
At the tip, the prediction could be written in a compact form:
f(x) = f0 + eta * (h1(x) + h2(x) + h3(x) + …)
This representation highlights a vital idea: the ultimate model is solely the initial prediction plus a weighted sum of residual predictions.
It also opens the door to possible extensions. For instance, the training rate doesn’t must be constant. It might probably decrease over time, following a decay through the iteration process.
It is similar idea for the decay in gradient descent or stochastic gradient descent.

3. Understanding the Final Model
3.1 How the model evolves across iterations
We start with a piecewise dataset. Within the visualization below, we are able to see all of the intermediate models produced throughout the Gradient Boosting process.
First, we see the initial constant prediction, equal to the common value of the goal.
Then comes f1, obtained after adding the primary tree with a single split.
Next, f2, after adding a second tree, and so forth.
Each latest tree introduces an area correction. As more trees are added, the model progressively adapts to the structure of the info.

The identical behavior appears with a curved dataset. Although each individual tree is piecewise constant, their additive combination leads to a smooth curve that follows the underlying pattern.

When applied to a binary goal, the algorithm still works, but some predictions can grow to be negative or greater than one. This is anticipated when using squared error loss, which treats the issue as regression and doesn’t constrain the output range.
If probability-like outputs are required, a classification-oriented loss function, comparable to logistic loss, must be used as a substitute.

In conclusion, Gradient Boosting could be applied to several types of datasets, including piecewise, non-linear, and binary cases. Whatever the dataset, the ultimate model stays piecewise constant by construction, because it is built as a sum of decision trees.
Nevertheless, the buildup of many small corrections allows the general prediction to closely approximate complex patterns.
3.2 Comparison with a single decision tree
When showing these plots, a natural query often arises:
This impression is comprehensible, especially when working with a small dataset. Visually, the ultimate prediction can look similar, which makes the 2 approaches harder to tell apart at first glance.
Nevertheless, the difference becomes clear once we have a look at how the splits are computed.
A single Decision Tree Regressor is built through a sequence of splits. At each split, the available data is split into smaller subsets. Because the tree grows, each latest decision is predicated on fewer and fewer observations, which may make the model sensitive to noise.
Once a split is made, data points that fall into different regions aren’t any longer related. Each region is treated independently, and early decisions can’t be revised.
Gradient Boosted Trees work in a totally different way.
Each tree within the boosting process is trained using the entire dataset. No commentary is ever faraway from the training process. At every iteration, all data points contribute through their residuals.
This changes the behavior of the model fundamentally.
A single tree makes hard, irreversible decisions. Gradient Boosting, however, allows later trees to correct the mistakes made by earlier ones.
As a substitute of committing to 1 rigid partition of the feature space, the model progressively refines its predictions through a sequence of small adjustments.
This ability to revise and improve earlier decisions is one among the important thing the reason why Gradient Boosted Trees are each robust and powerful in practice.
3.3 General comparison with other models
In comparison with a single decision tree, Gradient Boosted Trees produce smoother predictions, reduce overfitting, and improve generalization.
In comparison with linear models, they naturally capture non-linear patterns, robotically model feature interactions, and require no manual feature engineering.
In comparison with non-linear weight-based models, comparable to kernel methods or neural networks, Gradient Boosted Trees offer a distinct set of trade-offs. They depend on easy, interpretable constructing blocks, are less sensitive to feature scaling, and require fewer assumptions concerning the structure of the info. In lots of practical situations, in addition they train faster and require less tuning.
These combined properties explain why Gradient Boosted Decision Tree Regressors perform so well across a big selection of real-world applications.
Conclusion
In this text, we showed how Gradient Boosting builds powerful models by combining easy decision trees trained on residuals. Ranging from a continuing prediction, the model is refined step-by-step through small, local corrections.
We saw that this approach adapts naturally to several types of datasets and that the alternative of the loss function is crucial, especially for classification tasks.
By combining the flexibleness of trees with the steadiness of boosting, Gradient Boosted Decision Trees achieve strong performance in practice while remaining conceptually easy and interpretable.
