5 days of this Machine Learning “Advent Calendar”, we explored 5 models (or algorithms) which are all based on distances (local Euclidean distance, or global Mahalanobis distance).
So it’s time to change the approach, right? We’ll come back to the notion of distance later.
For today, we are going to see something totally different: Decision Trees!
Introduction with a Easy dataset
Let’s use a straightforward dataset with just one continuous feature.
As at all times, the concept is which you can visualize the outcomes yourself. Then you might have to think how you can make the pc do it.

We will visually guess that for the primary split, there are two possible values, one around 5.5 and the opposite around 12.
Now the query is, which one do we decide?
This is strictly what we’re going to seek out out: how can we determine the worth for the primary split with an implementation in Excel?
Once we determine the worth for the primary split, we are able to apply the identical process for the next splits.
That’s the reason we are going to only implement the primary split in Excel.
The Algorithmic Principle of Decision Tree Regressors
I wrote an article to at all times distinguish three steps of machine learning to learn it effectively, and let’s apply the principle to Decision Tree Regressors.
So for the primary time, we have now a “true” machine learning model, with non-trivial steps for all three.
What’s the model?
The model here’s a algorithm, to partition the dataset, and for every partition, we’re going to assign a worth. Which one? The typical value y of all of the observations in the identical group.
So whereas k-NN predicts with the common value of the closest neighbors (similar observations when it comes to features variables), the Decision Tree regressor predicts with the common value of a gaggle of observations (similar when it comes to the feature variable).
Model fitting or training process
For a choice tree, we also call this process fully growing a tree. Within the case of a Decision Tree Regressor, the leaves will contain just one remark, with thus a MSE of zero.
Growing a tree consists of recursively partitioning the input data into smaller and smaller chunks or regions. For every region, a prediction could be calculated.
Within the case of regression, the prediction is the common of the goal variable for the region.
At each step of the constructing process, the algorithm selects the feature and the split value that maximizes the one criterion, and within the case of a regressor, it is commonly the Mean Squared Error (MSE) between the actual value and the prediction.
Model Tuning or Pruning
For a choice tree, the overall term of model tuning can also be call it pruning, be it may well be seen as dropping nodes and leaves from a completely grown tree.
It is usually similar to say that the constructing process stops when a criterion is met, resembling a maximum depth or a minimum variety of samples in each leaf node. And these are the hyperparameters that could be optimized with the tuning process.
Inferencing process
Once the choice tree regressor is built, it may well be used to predict the goal variable for brand new input instances by applying the foundations and traversing the tree from the foundation node to a leaf node that corresponds to the input’s feature values.
The anticipated goal value for the input instance is then the mean of the goal values of the training samples that fall into the identical leaf node.
Split with One Continuous Feature
Listed here are the steps we are going to follow:
- List all possible splits
- For every split, we are going to calculate the MSE (Mean Squared Error)
- We’ll select the split that minimizes the MSE because the optimal next split
All possible splits
First, we have now to list all of the possible splits which are the common values of two consecutive values. There is no such thing as a must test more values.

MSE calculation for every possible split
As a start line, we are able to calculate the MSE before any splits. This also signifies that the prediction is just the common value of y. And the MSE is similar to the Standard Deviation of y.
Now, the concept is to seek out a split in order that the MSE with a split is lower than before. It is feasible that the split doesn’t significantly improve the performance (or lower the MSE), then the ultimate tree could be trivial, that’s the common value of y.
For every possible split, we are able to then calculate the MSE (Mean Squared Error). The image below shows the calculation for the primary possible split, which is x = 2.

We will see the small print of the calculation:
- Cut the dataset into two regions: with the worth x=2, we determine two possibilities x<2 or x>2, so the x axis is cut into two parts.
- Calculate the prediction: for every part, we calculate the common of y. That’s the potential prediction for y.
- Calculate the error: then we compare the prediction to the actual value of y
- Calculate the squared error: for every remark, we are able to calculate the square error.

Optimal split
For every possible split, we do the identical to acquire the MSE. In Excel, we are able to copy and paste the formula and the one value that changes is the possible split value for x.

Then we are able to plot the MSE on the y-axis and the possible split on the x-axis, and now we are able to see that there’s a minimum of MSE for x=5.5, this is strictly the result obtained with python code.

Now, one small exercise you may do is change the MSE to MAE (Mean Absolute Error).
And you may try to reply this query: what’s the impact of this alteration?
Condensing All Split Calculations into One Summary Table
In the sooner sections, we computed each split step-by-step, in order that we are able to higher visualize the small print of calculations.
Now we put every little thing together in a single single table, so the entire process becomes compact and straightforward to automate.
To do that, we first simplify the calculations.
In a node, the prediction is the common, so the MSE is strictly the variance. And for the variance, we may use the simplified formula:

So in the next table, we use one row for every possible split.
For every split, we calculate the MSE of the left node and the MSE of the correct node.
The variance of every group could be simplified using the intermediate results of y and y squared.
Then we compute the weighted average of the 2 MSE values. And in the long run, we get the exact same result as within the step-by-step method.

Split with Multiple Continuous Features
Now we are going to use two features.
That is where it becomes interesting.
We can have candidate splits coming from each features.
How do we decide?
We’ll simply consider all of them, after which select the split with the smallest MSE.
The concept in Excel is:
- first, put all possible splits from the 2 features into one single column,
- then, for every of those splits, calculate the MSE like before,
- finally, pick the most effective one.
Splits concatenation
First, we list all possible splits for Feature 1 (for instance, all thresholds between two sorted values).
Then, we list all possible splits for Feature 2 in the identical way.
In Excel, we concatenate these two lists into one column of candidate splits.
So each row on this column represents:
- “If I cut here, using this feature, what happens to the MSE?”
This offers us one unified list of all splits from each features.

MSE calculations
Once we have now the list of all splits, the remaining is similar logic as before.
For every row (that’s, for every split):
- we divide the points into left node and right node,
- we compute the MSE of the left node,
- we compute the MSE of the correct node,
- we take the weighted average of the 2 MSE values.
At the top, we have a look at this column of “Total MSE” and we decide the split (and subsequently the feature and threshold) that offers the minimum value.

Split with One Continuous and One Categorical Feature
Now allow us to mix two very several types of features:
- one continuous feature
- one categorical feature (for instance, A, B, C).
A call tree can split on each, but the way in which we generate candidate splits will not be the identical.
With a continuous feature, we test thresholds.
With a categorical feature, we test groups of categories.
So the concept is strictly the identical as before:
we consider all possible splits from each features, then we select the one with the smallest MSE.
Categorical feature splits
For a categorical feature, the logic is different:
- each category is already a “group”,
- so the best split is: one category vs all of the others.
For instance, if the categories are A, B, and C:
- split 1: A vs (B, C)
- split 2: B vs (A, C)
- split 3: C vs (A, B)
This already gives meaningful candidate splits and keeps the Excel formulas manageable.
Each of those category-based splits is added to the identical list that accommodates the continual thresholds.
MSE calculation
Once all splits are listed (continuous thresholds + categorical partitions), the calculations follow the identical steps:
- assign each point to the left or right node,
- compute the MSE of the left node,
- compute the MSE of the correct node,
- compute the weighted average.
The split with the lowest total MSE becomes the optimal first split.

Exercise you may check out
Now, you may play with the Google Sheet:
- You’ll be able to try to seek out the subsequent split
- You’ll be able to change the criterion, as a substitute of MSE, you need to use absolute error, Poisson or friedman_mse as indicated within the documentation of DecisionTreeRegressor
- You’ll be able to change the goal variable to a binary variable, normally, this becomes a classification task, but 0 or 1 are also numbers so the criterion MSE still could be applied. But when you wish to create a correct classifier, you might have to use the standard criterion Entroy or Gini. It’s for the subsequent article.
Conclusion
Using Excel, it is feasible to implement one split to realize more insights into how Decision Tree Regressors work. Regardless that we didn’t create a full tree, it remains to be interesting, since an important part is finding the optimal split amongst all possible splits.
Another thing about missing values
Have you ever noticed something interesting about how features are handled between distance-based models, and decision trees?
For distance-based models, every little thing should be numeric. Continuous features stay continuous, and categorical features should be transformed into numbers. The model compares points in space, so every little thing has to continue to exist a numeric axis.
Decision Trees do the inverse: they features into groups. A continuous feature becomes intervals. A categorical feature stays categorical.
And a missing value? It simply becomes one other category. There is no such thing as a must impute first. The tree can naturally handle it by sending all “missing” values to 1 branch, just like every other group.
