Decision Trees Natively Handle Categorical Data

machine learning algorithms can’t handle categorical variables. But decision trees (DTs) can. Classification trees don’t require a numerical goal either. Below is an illustration of a tree that classifies a subset of Cyrillic letters into vowels and consonants. It uses no numeric features — yet it exists.

Many also promote mean goal encoding (MTE) as a clever technique to convert categorical data into numerical form — without inflating the feature space as one-hot encoding does. Nevertheless, I haven’t seen any mention of this inherent connection between MTE and decision tree logic on TDS. This text addresses exactly that gap through an illustrative experiment. Particularly:

I’ll start with a fast recap of how decision trees handle categorical features.
We’ll see that this becomes a computational challenge for features with high cardinality.
I’ll exhibit how mean goal encoding naturally emerges as an answer to this problem — unlike, say, label encoding.
You may reproduce my experiment using the code from GitHub.

This straightforward decision tree (a call stump) uses no numerical features — yet it exists. Image created by creator with the assistance of ChatGPT-4o

A fast note: One-hot encoding is usually portrayed unfavorably by fans of mean goal encoding — however it’s not as bad as they suggest. The truth is, in our benchmark experiments, it often ranked first among the many 32 categorical encoding methods we evaluated. [1]

Decision trees and the curse of categorical features

Decision tree learning is a recursive algorithm. At each recursive step, it iterates over all features, trying to find the perfect split. So, it’s enough to look at how a single recursive iteration handles a categorical feature. In case you’re unsure how this operation generalizes to the development of the complete tree, have a look here [2].

For a categorical feature, the algorithm evaluates all possible ways to divide the categories into two nonempty sets and selects the one which yields the very best split quality. The standard is often measured using Gini impurity for binary classification or mean squared error for regression — each of that are higher when lower. See their pseudocode below.

# ----------  Gini impurity criterion  ----------
FUNCTION GiniImpurityForSplit(split):
    left, right = split
    total = size(left) + size(right)
    RETURN (size(left)/total)  * GiniOfGroup(left) +
           (size(right)/total) * GiniOfGroup(right)

FUNCTION GiniOfGroup(group):
    n = size(group)
    IF n == 0: RETURN 0
    ones  = count(values equal 1 in group)
    zeros = n - ones
    p1 = ones / n
    p0 = zeros / n
    RETURN 1 - (p0² + p1²)

# ----------  Mean-squared-error criterion  ----------
FUNCTION MSECriterionForSplit(split):
    left, right = split
    total = size(left) + size(right)
    IF total == 0: RETURN 0
    RETURN (size(left)/total)  * MSEOfGroup(left) +
           (size(right)/total) * MSEOfGroup(right)

FUNCTION MSEOfGroup(group):
    n = size(group)
    IF n == 0: RETURN 0
    μ = mean(Value column of group)
    RETURN sum( (v − μ)² for every v in group ) / n

Let’s say the feature has cardinality . Each category can belong to either of the 2 sets, giving 2 total combos. Excluding the 2 trivial cases where one among the sets is empty, we’re left with 2−2 feasible splits. Next, note that we don’t care in regards to the order of the sets — splits like {{A,B},{C}} and {{C},{A,B}} are equivalent. This cuts the variety of unique combos in half, leading to a final count of (2−2)/2 iterations. For our above toy example with Cyrillic letters, that number is 15. But when , it balloons to 524,287 combos — enough to significantly decelerate DT training.

Mean goal encoding solves the efficiency problem

What if one could reduce the search space from (2−2)/2 to something more manageable — without losing the optimal split? It seems that is indeed possible. One can show theoretically that mean goal encoding enables this reduction [3]. Specifically, if the categories are arranged so as of their MTE values, and only splits that respect this order are considered, the optimal split — based on Gini impurity for classification or mean squared error for regression — might be amongst them. There are exactly such splits, a dramatic reduction in comparison with (2−2)/2. The pseudocode for MTE is below.

# ----------  Mean-target encoding ----------
FUNCTION MeanTargetEncode(table):
    category_means = average(Value) for every Category in table      # Category → mean(Value)
    encoded_column = lookup(table.Category, category_means)         # replace label with mean
    RETURN encoded_column

Experiment

I’m not going to repeat the theoretical derivations that support the above claims. As a substitute, I designed an experiment to validate them empirically and to get a way of the efficiency gains brought by MTE over native partitioning, which exhaustively iterates over all possible splits. In what follows, I explain the information generation process and the experiment setup.

Data

To generate synthetic data for the experiment, I used a straightforward function that constructs a two-column dataset. The primary column accommodates n distinct categorical levels, each repeated m times, leading to a complete of n × m rows. The second column represents the goal variable and might be either binary or continuous, depending on the input parameter. Below is the pseudocode for this function.

# ----------  Synthetic-dataset generator ----------
FUNCTION GenerateData(num_categories, rows_per_cat, target_type='binary'):
    total_rows = num_categories * rows_per_cat
    categories = ['Category_' + i for i in 1..num_categories]
    category_col = repeat_each(categories, rows_per_cat)

    IF target_type == 'continuous':
        target_col = random_floats(0, 1, total_rows)
    ELSE:
        target_col = random_ints(0, 1, total_rows)

    RETURN DataFrame{ 'Category': category_col,
                      'Value'   : target_col }

Experiment setup

The experiment function takes a listing of cardinalities and a splitting criterion—either Gini impurity or mean squared error, depending on the goal type. For every categorical feature cardinality within the list, it generates 100 datasets and compares two strategies: exhaustive evaluation of all possible category splits and the restricted, MTE-informed ordering. It measures the runtime of every method and checks whether each approaches produce the identical optimal split rating. The function returns the variety of matching cases together with average runtimes. The pseudocode is given below.

# ----------  Split comparison experiment ----------
FUNCTION RunExperiment(list_num_categories, splitting_criterion):
    results = []

    FOR k IN list_num_categories:
        times_all = []
        times_ord = []

        REPEAT 100 times:
            df = GenerateDataset(k, 100)

            t0 = now()
            s_all = MinScore(df, AllSplits, splitting_criterion)
            t1 = now()

            t2 = now()
            s_ord = MinScore(df, MTEOrderedSplits, splitting_criterion)
            t3 = now()

            times_all.append(t1 - t0)
            times_ord.append(t3 - t2)

            IF round(s_all,10) != round(s_ord,10):
                PRINT "Discrepancy at k=", k

        results.append({
            'k': k,
            'avg_time_all': mean(times_all),
            'avg_time_ord': mean(times_ord)
        })

    RETURN DataFrame(results)

Results

You may take my word for it — or repeat the experiment (GitHub) — however the optimal split scores from each approaches at all times matched, just as the speculation predicts. The figure below shows the time required to guage splits as a function of the variety of categories; the vertical axis is on a logarithmic scale. The road representing exhaustive evaluation appears linear in these coordinates, meaning the runtime grows exponentially with the variety of categories — confirming the theoretical complexity discussed earlier. Already at 12 categories (on a dataset with 1,200 rows), checking all possible splits takes about one second — three orders of magnitude slower than the MTE-based approach, which yields the identical optimal split.

Binary Goal — Gini Impurity. Image created by creator

Conclusion

Decision trees can natively handle categorical data, but this ability comes at a computational cost when category counts grow. Mean goal encoding offers a principled shortcut — drastically reducing the variety of candidate splits without compromising the result. Our experiment confirms the speculation: MTE-based ordering finds the identical optimal split, but exponentially faster.

On the time of writing, scikit-learn doesn’t support categorical features directly. So what do you think that — in case you preprocess the information using MTE, will the resulting decision tree match one built by a learner that handles categorical features natively?

References

[1] Towards Data Science. https://towardsdatascience.com/a-benchmark-and-taxonomy-of-categorical-encoders-9b7a0dc47a8c/

[2] . Towards Data Science. https://towardsdatascience.com/mining-rules-from-data

[3] Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. . Vol. 2. Recent York: Springer, 2009.

Decision Trees Natively Handle Categorical Data

Decision trees and the curse of categorical features

Mean goal encoding solves the efficiency problem

Experiment

Data

Experiment setup

Results

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Recent AI agent learns to make use of CAD to create 3D objects from sketches

Designing digital resilience within the agentic AI era

OpenAI pushes Codex to the Max

The right way to Perform Agentic Information Retrieval

The price of considering

Decision Trees Natively Handle Categorical Data

Decision trees and the curse of categorical features

Mean goal encoding solves the efficiency problem

Experiment

Data

Experiment setup

Results

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.