, we explored how a Decision Tree Regressor chooses its optimal split by minimizing the Mean Squared Error (MSE).
Today for Day 7 of the Machine Learning “Advent Calendar”, we proceed the identical approach but with a Decision Tree Classifier, the classification counterpart of yesterday’s model.
Quick intuition experiment with two easy datasets
Allow us to begin with a really small toy dataset that I generated, with one numerical feature and one goal variable with two classes: 0 and 1.
The thought is to chop the dataset into two parts, based on one rule. However the query is: what should this rule be? What’s the criterion that tells us which split is best?
Now, even when we have no idea the mathematics yet, we will already take a look at the information and guess possible split points.
And visually, it could 8 or 12, right?
However the query is which one is more suitable numerically.
If we expect intuitively:
- With a split at 8:
- left side: no misclassification
- right side: one misclassification
- With a split at 12:
- right side: no misclassification
- left side: two misclassifications
So clearly, the split at 8 feels higher.

Now, allow us to take a look at an example with three classes. I added some more random data, and made 3 classes.

Here I label them 0, 1, 3, and I plot them vertically.
But we have to be careful: these numbers are just category names, not numeric values. They shouldn’t be interpreted as “ordered”.
So the intuition is all the time: How homogeneous is each region after the split?
Nevertheless it is harder to visually determine the most effective split.
Now, we’d like a mathematical method to express this concept.
This is precisely the subject of the subsequent chapter.
Impurity measure because the criterion of split
Within the Decision Tree Regressor, we already know:
- The prediction for a region is the average of the goal.
- The standard of a split is measured by MSE.
Within the Decision Tree Classifier:
- The prediction for a region is the majority class of the region.
- The standard of a split is measured by an impurity measure: Gini impurity or Entropy.
Each are standard in textbooks, and each can be found in scikit-learn. Gini is utilized by default.
BUT, what is that this impurity measure, really?
Should you take a look at the curves of Gini and Entropy, they each behave the identical way:
- They’re 0 when the node is pure (all samples have the identical class).
- They reach their maximum when the classes are evenly mixed (50 percent / 50 percent).
- The curve is smooth, symmetric, and increases with disorder.
That is the essential property of any impurity measure:
Impurity is low when groups are clean, and high when groups are mixed.

So we are going to use these measures to determine which split to create.
Split with One Continuous Feature
Similar to for the Decision Tree Regressor, we are going to follow the identical structure.
List of all possible splits
Exactly just like the regressor version, with one numerical feature, the one splits we’d like to check are the midpoints between consecutive sorted x values.
For every split, compute impurity on both sides
Allow us to take a split value, for instance, x = 5.5.
We separate the dataset into two regions:
- Region L: x < 5.5
- Region R: x ≥ 5.5
For every region:
- We count the overall variety of observations
- We compute Gini impurity
- Ultimately, we compute weighted impurity of the split

Select the split with the bottom impurity
Like within the regressor case:
- List all possible splits
- Compute impurity for every
- The optimal split is the one with the minimum impurity

Synthetic Table of All Splits
To make every little thing automatic in Excel,
we organize all calculations in one table, where:
- each row corresponds to 1 candidate split,
- for every row, we compute:
- Gini of the left region,
- Gini of the right region,
- and the overall weighted Gini of the split.
This table gives a clean, compact overview of each possible split,
and the most effective split is just the one with the bottom value in the ultimate column.

Multi-class classification
Until now, we worked with two classes. However the Gini impurity extends naturally to three classes, and the logic of the split stays the exact same.
Nothing changes within the structure of the algorithm:
- we list all possible splits,
- we compute impurity on both sides,
- we take the weighted average,
- we select the split with the bottom impurity.
Only the formula of the Gini impurity becomes barely longer.
Gini impurity with three classes
If a region incorporates proportions p1, p2, p3
for the three classes, then the Gini impurity is:

The identical idea as before:
a region is “pure” when one class dominates,
and the impurity becomes large when classes are mixed.
Left and Right regions
For every split:
- Region L incorporates some observations of classes 1, 2, and three
- Region R incorporates the remaining observations
For every region:
- count what number of points belong to every class
- compute the proportions p1,p2,p3
- compute the Gini impurity using the formula above
Every part is precisely the identical as within the binary case, just with yet another term.
Summary Table for 3-class splits
Similar to before, we collect all computations in a single table:
- each row is one possible split
- we count class 1, class 2, class 3 on the left
- we count class 1, class 2, class 3 on the appropriate
- we compute Gini (Left), Gini (Right), and the weighted Gini
The split with the smallest weighted impurity is the one chosen by the choice tree.

We are able to easily generalize the algorithm to K classes, using these following formulas to calculate Gini or Entropy

How Different Are Impurity Measures, Really?
Now, we all the time mention Gini or Entropy as criterion, but do they really differ? When taking a look at the mathematical formulas, some may say
The reply is just not that much.
In theory, in just about all practical situations:
- Gini and Entropy select the identical split
- The tree structure is almost similar
- The predictions are the identical
Why?
Because their curves look extremely similar.
They each peak at 50 percent mixing and drop to zero at purity.
The one difference is the of the curve:
- Gini is a quadratic function. It penalizes misclassification more linearly.
- Entropy is a logarithmic function, so it penalizes uncertainty a bit more strongly near 0.5.
However the difference is tiny, in practice, and you’ll be able to do it in Excel!
Other impurity measures?
One other natural query: is it possible to invent/use other measures?
Yes, you would invent your personal function, so long as:
- It’s 0 when the node is pure
- It’s maximal when classes are mixed
- It’s smooth and strictly increasing in “disorder”
For instance: Impurity = 4*p0*p1
That is one other valid impurity measure. And it is definitely equal to Gini multiplied by a relentless when there are only two classes.
So again, it gives the identical splits. Should you are usually not convinced, you’ll be able to
Listed below are another measures that may also be used.

Exercises in Excel
Tests with other parameters and features
When you construct the primary split, you’ll be able to extend your file:
- Try Entropy as an alternative of Gini
- Try adding categorical features
- Try constructing the next split
- Try changing max depth and observe under- and over-fitting
- Try making a confusion matrix for predictions
These easy tests already provide you with an excellent intuition for the way real decision trees behave.
Implementations of the foundations for Titanic Survival Dataset
A natural follow-up exercise is to recreate decision rules for the famous Titanic Survival Dataset (CC0 / Public Domain).
First, we will start with only two features: sex and age.
Implementing the foundations in Excel is long and a bit tedious, but this is precisely the purpose: it makes you realize what decision rules really appear like.
They’re nothing greater than a sequence of IF / ELSE statements, repeated time and again.
That is the true nature of a choice tree: easy rules, stacked on top of one another.

Conclusion
Implementing a Decision Tree Classifier in Excel is surprisingly accessible.
With just a few formulas, you uncover the guts of the algorithm:
- list possible splits
- compute impurity
- select the cleanest split

This easy mechanism is the inspiration of more advanced ensemble models like Gradient Boosted Trees, which we are going to discuss later on this series.
And stay tuned for Day 8 tomorrow!
