The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

5 days of this Machine Learning “Advent Calendar”, we explored 5 models (or algorithms) which are all based on distances (local Euclidean distance, or global Mahalanobis distance).

So it’s time to change the approach, right? We’ll come back to the notion of distance later.

For today, we are going to see something totally different: Decision Trees!

Decision Tree Regressor in Excel – image by creator

Introduction with a Easy dataset

Let’s use a straightforward dataset with just one continuous feature.

As at all times, the concept is which you can visualize the outcomes yourself. Then you might have to think how you can make the pc do it.

Decision tree regression in Excel easy dataset (I generated myself) — image by creator

We will visually guess that for the primary split, there are two possible values, one around 5.5 and the opposite around 12.

Now the query is, which one do we decide?

This is strictly what we’re going to seek out out: how can we determine the worth for the primary split with an implementation in Excel?

Once we determine the worth for the primary split, we are able to apply the identical process for the next splits.

That’s the reason we are going to only implement the primary split in Excel.

The Algorithmic Principle of Decision Tree Regressors

I wrote an article to at all times distinguish three steps of machine learning to learn it effectively, and let’s apply the principle to Decision Tree Regressors.

So for the primary time, we have now a “true” machine learning model, with non-trivial steps for all three.

What’s the model?

The model here’s a algorithm, to partition the dataset, and for every partition, we’re going to assign a worth. Which one? The typical value y of all of the observations in the identical group.

So whereas k-NN predicts with the common value of the closest neighbors (similar observations when it comes to features variables), the Decision Tree regressor predicts with the common value of a gaggle of observations (similar when it comes to the feature variable).

Model fitting or training process

For a choice tree, we also call this process fully growing a tree. Within the case of a Decision Tree Regressor, the leaves will contain just one remark, with thus a MSE of zero.

Growing a tree consists of recursively partitioning the input data into smaller and smaller chunks or regions. For every region, a prediction could be calculated.

Within the case of regression, the prediction is the common of the goal variable for the region.

At each step of the constructing process, the algorithm selects the feature and the split value that maximizes the one criterion, and within the case of a regressor, it is commonly the Mean Squared Error (MSE) between the actual value and the prediction.

Model Tuning or Pruning

For a choice tree, the overall term of model tuning can also be call it pruning, be it may well be seen as dropping nodes and leaves from a completely grown tree.

It is usually similar to say that the constructing process stops when a criterion is met, resembling a maximum depth or a minimum variety of samples in each leaf node. And these are the hyperparameters that could be optimized with the tuning process.

Inferencing process

Once the choice tree regressor is built, it may well be used to predict the goal variable for brand new input instances by applying the foundations and traversing the tree from the foundation node to a leaf node that corresponds to the input’s feature values.

The anticipated goal value for the input instance is then the mean of the goal values of the training samples that fall into the identical leaf node.

Split with One Continuous Feature

Listed here are the steps we are going to follow:

List all possible splits
For every split, we are going to calculate the MSE (Mean Squared Error)
We’ll select the split that minimizes the MSE because the optimal next split

All possible splits

First, we have now to list all of the possible splits which are the common values of two consecutive values. There is no such thing as a must test more values.

Decision tree regression in Excel with possible splits — image by creator

MSE calculation for every possible split

As a start line, we are able to calculate the MSE before any splits. This also signifies that the prediction is just the common value of y. And the MSE is similar to the Standard Deviation of y.

Now, the concept is to seek out a split in order that the MSE with a split is lower than before. It is feasible that the split doesn’t significantly improve the performance (or lower the MSE), then the ultimate tree could be trivial, that’s the common value of y.

For every possible split, we are able to then calculate the MSE (Mean Squared Error). The image below shows the calculation for the primary possible split, which is x = 2.

Decision tree regression in Excel MSE for all possible splits — image by creator

We will see the small print of the calculation:

Cut the dataset into two regions: with the worth x=2, we determine two possibilities x<2 or x>2, so the x axis is cut into two parts.
Calculate the prediction: for every part, we calculate the common of y. That’s the potential prediction for y.
Calculate the error: then we compare the prediction to the actual value of y
Calculate the squared error: for every remark, we are able to calculate the square error.

Decision tree regression in Excel with all possible splits — image by creator

Optimal split

For every possible split, we do the identical to acquire the MSE. In Excel, we are able to copy and paste the formula and the one value that changes is the possible split value for x.

Decision tree regression in Excel splits— image by creator

Then we are able to plot the MSE on the y-axis and the possible split on the x-axis, and now we are able to see that there’s a minimum of MSE for x=5.5, this is strictly the result obtained with python code.

Decision tree regression in Excel Minimumization of MSE — image by creator

Now, one small exercise you may do is change the MSE to MAE (Mean Absolute Error).

And you may try to reply this query: what’s the impact of this alteration?

Condensing All Split Calculations into One Summary Table

In the sooner sections, we computed each split step-by-step, in order that we are able to higher visualize the small print of calculations.

Now we put every little thing together in a single single table, so the entire process becomes compact and straightforward to automate.

To do that, we first simplify the calculations.

In a node, the prediction is the common, so the MSE is strictly the variance. And for the variance, we may use the simplified formula:

So in the next table, we use one row for every possible split.

For every split, we calculate the MSE of the left node and the MSE of the correct node.

The variance of every group could be simplified using the intermediate results of y and y squared.

Then we compute the weighted average of the 2 MSE values. And in the long run, we get the exact same result as within the step-by-step method.

Decision Tree Regressor in Excel – get it from here – image by creator

Split with Multiple Continuous Features

Now we are going to use two features.

That is where it becomes interesting.
We can have candidate splits coming from each features.
How do we decide?
We’ll simply consider all of them, after which select the split with the smallest MSE.

The concept in Excel is:

first, put all possible splits from the 2 features into one single column,
then, for every of those splits, calculate the MSE like before,
finally, pick the most effective one.

Splits concatenation

First, we list all possible splits for Feature 1 (for instance, all thresholds between two sorted values).
Then, we list all possible splits for Feature 2 in the identical way.

In Excel, we concatenate these two lists into one column of candidate splits.
So each row on this column represents:

“If I cut here, using this feature, what happens to the MSE?”

This offers us one unified list of all splits from each features.

MSE calculations

Once we have now the list of all splits, the remaining is similar logic as before.

For every row (that’s, for every split):

we divide the points into left node and right node,
we compute the MSE of the left node,
we compute the MSE of the correct node,
we take the weighted average of the 2 MSE values.

At the top, we have a look at this column of “Total MSE” and we decide the split (and subsequently the feature and threshold) that offers the minimum value.

Split with One Continuous and One Categorical Feature

Now allow us to mix two very several types of features:

one continuous feature
one categorical feature (for instance, A, B, C).

A call tree can split on each, but the way in which we generate candidate splits will not be the identical.

With a continuous feature, we test thresholds.
With a categorical feature, we test groups of categories.

So the concept is strictly the identical as before:
we consider all possible splits from each features, then we select the one with the smallest MSE.

Categorical feature splits

For a categorical feature, the logic is different:

each category is already a “group”,
so the best split is: one category vs all of the others.

For instance, if the categories are A, B, and C:

split 1: A vs (B, C)
split 2: B vs (A, C)
split 3: C vs (A, B)

This already gives meaningful candidate splits and keeps the Excel formulas manageable.

Each of those category-based splits is added to the identical list that accommodates the continual thresholds.

MSE calculation

Once all splits are listed (continuous thresholds + categorical partitions), the calculations follow the identical steps:

assign each point to the left or right node,
compute the MSE of the left node,
compute the MSE of the correct node,
compute the weighted average.

The split with the lowest total MSE becomes the optimal first split.

Exercise you may check out

Now, you may play with the Google Sheet:

You’ll be able to try to seek out the subsequent split
You’ll be able to change the criterion, as a substitute of MSE, you need to use absolute error, Poisson or friedman_mse as indicated within the documentation of DecisionTreeRegressor
You’ll be able to change the goal variable to a binary variable, normally, this becomes a classification task, but 0 or 1 are also numbers so the criterion MSE still could be applied. But when you wish to create a correct classifier, you might have to use the standard criterion Entroy or Gini. It’s for the subsequent article.

Conclusion

Using Excel, it is feasible to implement one split to realize more insights into how Decision Tree Regressors work. Regardless that we didn’t create a full tree, it remains to be interesting, since an important part is finding the optimal split amongst all possible splits.

Another thing about missing values

Have you ever noticed something interesting about how features are handled between distance-based models, and decision trees?

For distance-based models, every little thing should be numeric. Continuous features stay continuous, and categorical features should be transformed into numbers. The model compares points in space, so every little thing has to continue to exist a numeric axis.

Decision Trees do the inverse: they features into groups. A continuous feature becomes intervals. A categorical feature stays categorical.

And a missing value? It simply becomes one other category. There is no such thing as a must impute first. The tree can naturally handle it by sending all “missing” values to 1 branch, just like every other group.

The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

Introduction with a Easy dataset