The Machine Learning “Advent Calendar” Day 7: Decision Tree Classifier

, we explored how a Decision Tree Regressor chooses its optimal split by minimizing the Mean Squared Error (MSE).

Today for Day 7 of the Machine Learning “Advent Calendar”, we proceed the identical approach but with a Decision Tree Classifier, the classification counterpart of yesterday’s model.

Quick intuition experiment with two easy datasets

Allow us to begin with a really small toy dataset that I generated, with one numerical feature and one goal variable with two classes: 0 and 1.

The thought is to chop the dataset into two parts, based on one rule. However the query is: what should this rule be? What’s the criterion that tells us which split is best?

Now, even when we have no idea the mathematics yet, we will already take a look at the information and guess possible split points.

And visually, it could 8 or 12, right?

However the query is which one is more suitable numerically.

Decision Tree Classifier in Excel – image by creator

If we expect intuitively:

With a split at 8:
- left side: no misclassification
- right side: one misclassification
With a split at 12:
- right side: no misclassification
- left side: two misclassifications

So clearly, the split at 8 feels higher.

Now, allow us to take a look at an example with three classes. I added some more random data, and made 3 classes.

Here I label them 0, 1, 3, and I plot them vertically.

But we have to be careful: these numbers are just category names, not numeric values. They shouldn’t be interpreted as “ordered”.

So the intuition is all the time: How homogeneous is each region after the split?

Nevertheless it is harder to visually determine the most effective split.

Now, we’d like a mathematical method to express this concept.

This is precisely the subject of the subsequent chapter.

Impurity measure because the criterion of split

Within the Decision Tree Regressor, we already know:

The prediction for a region is the average of the goal.
The standard of a split is measured by MSE.

Within the Decision Tree Classifier:

The prediction for a region is the majority class of the region.
The standard of a split is measured by an impurity measure: Gini impurity or Entropy.

Each are standard in textbooks, and each can be found in scikit-learn. Gini is utilized by default.

BUT, what is that this impurity measure, really?

Should you take a look at the curves of Gini and Entropy, they each behave the identical way:

They’re 0 when the node is pure (all samples have the identical class).
They reach their maximum when the classes are evenly mixed (50 percent / 50 percent).
The curve is smooth, symmetric, and increases with disorder.

That is the essential property of any impurity measure:

Impurity is low when groups are clean, and high when groups are mixed.

Decision Tree Classifier in Excel – gini and entropy – image by creator

So we are going to use these measures to determine which split to create.

Split with One Continuous Feature

Similar to for the Decision Tree Regressor, we are going to follow the identical structure.

List of all possible splits

Exactly just like the regressor version, with one numerical feature, the one splits we’d like to check are the midpoints between consecutive sorted x values.

For every split, compute impurity on both sides

Allow us to take a split value, for instance, x = 5.5.

We separate the dataset into two regions:

Region L: x < 5.5
Region R: x ≥ 5.5

For every region:

We count the overall variety of observations
We compute Gini impurity
Ultimately, we compute weighted impurity of the split

Decision Tree Classifier in Excel – image by creator

Select the split with the bottom impurity

Like within the regressor case:

List all possible splits
Compute impurity for every
The optimal split is the one with the minimum impurity

Synthetic Table of All Splits

To make every little thing automatic in Excel,
we organize all calculations in one table, where:

each row corresponds to 1 candidate split,
for every row, we compute:
- Gini of the left region,
- Gini of the right region,
- and the overall weighted Gini of the split.

This table gives a clean, compact overview of each possible split,
and the most effective split is just the one with the bottom value in the ultimate column.

Multi-class classification

Until now, we worked with two classes. However the Gini impurity extends naturally to three classes, and the logic of the split stays the exact same.

Nothing changes within the structure of the algorithm:

we list all possible splits,
we compute impurity on both sides,
we take the weighted average,
we select the split with the bottom impurity.

Only the formula of the Gini impurity becomes barely longer.

Gini impurity with three classes

If a region incorporates proportions p1, p2, p3

for the three classes, then the Gini impurity is:

The identical idea as before:
a region is “pure” when one class dominates,
and the impurity becomes large when classes are mixed.

Left and Right regions

For every split:

Region L incorporates some observations of classes 1, 2, and three
Region R incorporates the remaining observations

For every region:

count what number of points belong to every class
compute the proportions p1,p2,p3
compute the Gini impurity using the formula above

Every part is precisely the identical as within the binary case, just with yet another term.

Summary Table for 3-class splits

Similar to before, we collect all computations in a single table:

each row is one possible split
we count class 1, class 2, class 3 on the left
we count class 1, class 2, class 3 on the appropriate
we compute Gini (Left), Gini (Right), and the weighted Gini

The split with the smallest weighted impurity is the one chosen by the choice tree.

We are able to easily generalize the algorithm to K classes, using these following formulas to calculate Gini or Entropy

How Different Are Impurity Measures, Really?

Now, we all the time mention Gini or Entropy as criterion, but do they really differ? When taking a look at the mathematical formulas, some may say

The reply is just not that much.

In theory, in just about all practical situations:

Gini and Entropy select the identical split
The tree structure is almost similar
The predictions are the identical

Why?

Because their curves look extremely similar.

They each peak at 50 percent mixing and drop to zero at purity.

The one difference is the of the curve:

Gini is a quadratic function. It penalizes misclassification more linearly.
Entropy is a logarithmic function, so it penalizes uncertainty a bit more strongly near 0.5.

However the difference is tiny, in practice, and you’ll be able to do it in Excel!

Other impurity measures?

One other natural query: is it possible to invent/use other measures?

Yes, you would invent your personal function, so long as:

It’s 0 when the node is pure
It’s maximal when classes are mixed
It’s smooth and strictly increasing in “disorder”

For instance: Impurity = 4*p0*p1

That is one other valid impurity measure. And it is definitely equal to Gini multiplied by a relentless when there are only two classes.

So again, it gives the identical splits. Should you are usually not convinced, you’ll be able to

Listed below are another measures that may also be used.

Decision Tree Classifier in Excel – many impurity measures – image by creator

Exercises in Excel

Tests with other parameters and features

When you construct the primary split, you’ll be able to extend your file:

Try Entropy as an alternative of Gini
Try adding categorical features
Try constructing the next split
Try changing max depth and observe under- and over-fitting
Try making a confusion matrix for predictions

These easy tests already provide you with an excellent intuition for the way real decision trees behave.

Implementations of the foundations for Titanic Survival Dataset

A natural follow-up exercise is to recreate decision rules for the famous Titanic Survival Dataset (CC0 / Public Domain).

First, we will start with only two features: sex and age.

Implementing the foundations in Excel is long and a bit tedious, but this is precisely the purpose: it makes you realize what decision rules really appear like.

They’re nothing greater than a sequence of IF / ELSE statements, repeated time and again.

That is the true nature of a choice tree: easy rules, stacked on top of one another.

Decision Tree Classifier in Excel for Titanic Survival Dataset (**CC0 / Public Domain**) – image by creator

Conclusion

Implementing a Decision Tree Classifier in Excel is surprisingly accessible.

With just a few formulas, you uncover the guts of the algorithm:

list possible splits
compute impurity
select the cleanest split

This easy mechanism is the inspiration of more advanced ensemble models like Gradient Boosted Trees, which we are going to discuss later on this series.

And stay tuned for Day 8 tomorrow!

The Machine Learning “Advent Calendar” Day 7: Decision Tree Classifier

Quick intuition experiment with two easy datasets

Impurity measure because the criterion of split