machine learning algorithms can’t handle categorical variables. But decision trees (DTs) can. Classification trees don’t require a numerical goal either. Below is an illustration of a tree that classifies a subset of Cyrillic letters into vowels and consonants. It uses no numeric features — yet it exists.
Many also promote mean goal encoding (MTE) as a clever technique to convert categorical data into numerical form — without inflating the feature space as one-hot encoding does. Nevertheless, I haven’t seen any mention of this inherent connection between MTE and decision tree logic on TDS. This text addresses exactly that gap through an illustrative experiment. Particularly:
- I’ll start with a fast recap of how decision trees handle categorical features.
- We’ll see that this becomes a computational challenge for features with high cardinality.
- I’ll exhibit how mean goal encoding naturally emerges as an answer to this problem — unlike, say, label encoding.
- You may reproduce my experiment using the code from GitHub.
A fast note: One-hot encoding is usually portrayed unfavorably by fans of mean goal encoding — however it’s not as bad as they suggest. The truth is, in our benchmark experiments, it often ranked first among the many 32 categorical encoding methods we evaluated. [1]
Decision trees and the curse of categorical features
Decision tree learning is a recursive algorithm. At each recursive step, it iterates over all features, trying to find the perfect split. So, it’s enough to look at how a single recursive iteration handles a categorical feature. In case you’re unsure how this operation generalizes to the development of the complete tree, have a look here [2].
For a categorical feature, the algorithm evaluates all possible ways to divide the categories into two nonempty sets and selects the one which yields the very best split quality. The standard is often measured using Gini impurity for binary classification or mean squared error for regression — each of that are higher when lower. See their pseudocode below.
# ---------- Gini impurity criterion ----------
FUNCTION GiniImpurityForSplit(split):
left, right = split
total = size(left) + size(right)
RETURN (size(left)/total) * GiniOfGroup(left) +
(size(right)/total) * GiniOfGroup(right)
FUNCTION GiniOfGroup(group):
n = size(group)
IF n == 0: RETURN 0
ones = count(values equal 1 in group)
zeros = n - ones
p1 = ones / n
p0 = zeros / n
RETURN 1 - (p0² + p1²)
# ---------- Mean-squared-error criterion ----------
FUNCTION MSECriterionForSplit(split):
left, right = split
total = size(left) + size(right)
IF total == 0: RETURN 0
RETURN (size(left)/total) * MSEOfGroup(left) +
(size(right)/total) * MSEOfGroup(right)
FUNCTION MSEOfGroup(group):
n = size(group)
IF n == 0: RETURN 0
μ = mean(Value column of group)
RETURN sum( (v − μ)² for every v in group ) / n
Let’s say the feature has cardinality . Each category can belong to either of the 2 sets, giving 2 total combos. Excluding the 2 trivial cases where one among the sets is empty, we’re left with 2−2 feasible splits. Next, note that we don’t care in regards to the order of the sets — splits like {{A,B},{C}} and {{C},{A,B}} are equivalent. This cuts the variety of unique combos in half, leading to a final count of (2−2)/2 iterations. For our above toy example with Cyrillic letters, that number is 15. But when , it balloons to 524,287 combos — enough to significantly decelerate DT training.
Mean goal encoding solves the efficiency problem
What if one could reduce the search space from (2−2)/2 to something more manageable — without losing the optimal split? It seems that is indeed possible. One can show theoretically that mean goal encoding enables this reduction [3]. Specifically, if the categories are arranged so as of their MTE values, and only splits that respect this order are considered, the optimal split — based on Gini impurity for classification or mean squared error for regression — might be amongst them. There are exactly such splits, a dramatic reduction in comparison with (2−2)/2. The pseudocode for MTE is below.Â
# ---------- Mean-target encoding ----------
FUNCTION MeanTargetEncode(table):
category_means = average(Value) for every Category in table # Category → mean(Value)
encoded_column = lookup(table.Category, category_means) # replace label with mean
RETURN encoded_column
Experiment
I’m not going to repeat the theoretical derivations that support the above claims. As a substitute, I designed an experiment to validate them empirically and to get a way of the efficiency gains brought by MTE over native partitioning, which exhaustively iterates over all possible splits. In what follows, I explain the information generation process and the experiment setup.
Data
To generate synthetic data for the experiment, I used a straightforward function that constructs a two-column dataset. The primary column accommodates n distinct categorical levels, each repeated m times, leading to a complete of n × m rows. The second column represents the goal variable and might be either binary or continuous, depending on the input parameter. Below is the pseudocode for this function.
# ---------- Synthetic-dataset generator ----------
FUNCTION GenerateData(num_categories, rows_per_cat, target_type='binary'):
total_rows = num_categories * rows_per_cat
categories = ['Category_' + i for i in 1..num_categories]
category_col = repeat_each(categories, rows_per_cat)
IF target_type == 'continuous':
target_col = random_floats(0, 1, total_rows)
ELSE:
target_col = random_ints(0, 1, total_rows)
RETURN DataFrame{ 'Category': category_col,
'Value' : target_col }
Experiment setup
The experiment
function takes a listing of cardinalities and a splitting criterion—either Gini impurity or mean squared error, depending on the goal type. For every categorical feature cardinality within the list, it generates 100 datasets and compares two strategies: exhaustive evaluation of all possible category splits and the restricted, MTE-informed ordering. It measures the runtime of every method and checks whether each approaches produce the identical optimal split rating. The function returns the variety of matching cases together with average runtimes. The pseudocode is given below.
# ---------- Split comparison experiment ----------
FUNCTION RunExperiment(list_num_categories, splitting_criterion):
results = []
FOR k IN list_num_categories:
times_all = []
times_ord = []
REPEAT 100 times:
df = GenerateDataset(k, 100)
t0 = now()
s_all = MinScore(df, AllSplits, splitting_criterion)
t1 = now()
t2 = now()
s_ord = MinScore(df, MTEOrderedSplits, splitting_criterion)
t3 = now()
times_all.append(t1 - t0)
times_ord.append(t3 - t2)
IF round(s_all,10) != round(s_ord,10):
PRINT "Discrepancy at k=", k
results.append({
'k': k,
'avg_time_all': mean(times_all),
'avg_time_ord': mean(times_ord)
})
RETURN DataFrame(results)
Results
You may take my word for it — or repeat the experiment (GitHub) — however the optimal split scores from each approaches at all times matched, just as the speculation predicts. The figure below shows the time required to guage splits as a function of the variety of categories; the vertical axis is on a logarithmic scale. The road representing exhaustive evaluation appears linear in these coordinates, meaning the runtime grows exponentially with the variety of categories — confirming the theoretical complexity discussed earlier. Already at 12 categories (on a dataset with 1,200 rows), checking all possible splits takes about one second — three orders of magnitude slower than the MTE-based approach, which yields the identical optimal split.

Conclusion
Decision trees can natively handle categorical data, but this ability comes at a computational cost when category counts grow. Mean goal encoding offers a principled shortcut — drastically reducing the variety of candidate splits without compromising the result. Our experiment confirms the speculation: MTE-based ordering finds the identical optimal split, but exponentially faster.
On the time of writing, scikit-learn
doesn’t support categorical features directly. So what do you think that — in case you preprocess the information using MTE, will the resulting decision tree match one built by a learner that handles categorical features natively?
References
[1] Towards Data Science. https://towardsdatascience.com/a-benchmark-and-taxonomy-of-categorical-encoders-9b7a0dc47a8c/
[2] . Towards Data Science. https://towardsdatascience.com/mining-rules-from-data
[3] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. . Vol. 2. Recent York: Springer, 2009.