is a cornerstone technique for modeling tabular data on account of its speed and ease. It delivers great results with none fuss. Whenever you go searching you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. On this post, we’ll take an in depth take a look at this model, explore its inner workings, and understand what makes it an important alternative for real-world tasks.
Goal Statistic
Goal Encoding Example: the common value of the goal variable for a category is used to interchange each category
Certainly one of the necessary contributions of the CatBoost paper is a brand new approach to calculating the Goal Statistic. What’s a Goal Statistic? If you may have worked with categorical variables before, you’d know that essentially the most rudimentary method to cope with categorical variables is to make use of one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.
Greedy Goal Statistic
To avoid one-hot encoding, we calculate the Goal Statistic as an alternative for the explicit variables. This implies we calculate the mean of the goal variable at each unique value of the explicit variable. So if a categorical variable takes the values — A
, B
, C
then we’ll calculate the common value of (text{y}) over all these values and replace these values with the common of (text{y}) at each unique value.
That sounds good, right? It does but this approach comes with its problems — namely Goal Leakage. To grasp this, let’s take an extreme example. Extreme examples are sometimes the best method to eke out issues within the approach. Consider the below dataset:
Categorical Column | Goal Column |
---|---|
A | 0 |
B | 1 |
C | 0 |
D | 1 |
E | 0 |
Now let’s write the equation for calculating the Goal Statistic:
[hat{x}^i_k = frac{
sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} cdot y_j + a p
}{
sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}]
Here (x^i_j) is the worth of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of (x^i), select those having the worth (x^i_k), and take the common value of (y) over those samples. As an alternative of taking a direct average, we take a smoothened average which is what the (a) and (p) terms are for. The (a) parameter is the smoothening parameter and (p) is the worldwide mean of (y).
If we calculate the Goal Statistic using the formula above, we get:
Categorical Column | Goal Column | Goal Statistic |
---|---|---|
A | 0 | (frac{ap}{1+a}) |
B | 1 | (frac{1+ap}{1+a}) |
C | 0 | (frac{ap}{1+a}) |
D | 1 | (frac{1+ap}{1+a}) |
E | 0 | (frac{ap}{1+a}) |
Now if I exploit this Goal Statistic
column as my training data, I’ll get an ideal split at ( threshold = frac{0.5+ap}{1+a}). Anything above this value will probably be classified as 1
and anything below will probably be classified as 0
. I even have an ideal classification at this point, so I get 100% accuracy on my training data.
Let’s take a take a look at the test data. Here, since we’re assuming that the feature has all unique values, the Goal Statistic becomes—
[TS = frac{0+ap}{0+a} = p]
If (threshold) is larger than (p), all test data predictions will probably be (0). Conversely, if (threshold) is lower than (p), all test data predictions will probably be (1) resulting in poor performance on the test set.
Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Goal Statistic as an encoding approach.
Leave One Out Goal Statistic
So the Greedy TS didn’t work out quite well for us. Let’s try one other method— the Leave One Out Goal Statistic method. At first glance, this looks promising. But, because it seems, this too has its problems. Let’s see how with one other extreme example. This time let’s assume that our categorical variable (x^i) has just one unique value, i.e., all values are the identical. Consider the below data:
Categorical Column | Goal Column |
---|---|
A | 0 |
A | 1 |
A | 0 |
A | 1 |
If calculate the leave one out goal statistic, we get:
Categorical Column | Goal Column | Goal Statistic |
---|---|---|
A | 0 | (frac{n^+ -y_k + ap}{n+a}) |
A | 1 | (frac{n^+ -y_k + ap}{n+a}) |
A | 0 | (frac{n^+ -y_k + ap}{n+a}) |
A | 1 | (frac{n^+ -y_k + ap}{n+a}) |
Here:
(n) is the whole samples in the info (in our case this 4)
(n^+) is the variety of positive samples in the info (in our case this 2)
(y_k) is the worth of the goal column in that row
Substituting the above, we get:
Categorical Column | Goal Column | Goal Statistic |
---|---|---|
A | 0 | (frac{2 + ap}{4+a}) |
A | 1 | (frac{1 + ap}{4+a}) |
A | 0 | (frac{2 + ap}{4+a}) |
A | 1 | (frac{1 + ap}{4+a}) |
n
and n+
Now, if I exploit this Goal Statistic
column as my training data, I’ll get an ideal split at ( threshold = frac{1.5+ap}{4+a}). Anything above this value will probably be classified as 0
and anything below will probably be classified as 1
. I even have an ideal classification at this point, so I again get 100% accuracy on my training data.
You see the issue, right? My categorical variable which doesn’t have greater than a novel value is producing different values for Goal Statistic which can perform great on the training data but will fail miserably on the test data.
Ordered Goal Statistic

CatBoost introduces a method called Ordered Goal Statistic to handle the problems discussed above. That is the core principle of CatBoost’s handling of categorical variables.
This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data((sigma)). To compute the Goal Statistic for a sample at row (k), CatBoost uses samples from row (1) to (k-1). For the test data, it uses the complete train data to compute the statistic.
Moreover, CatBoost generates a brand new permutation for every tree, reasonably than reusing the identical permutation every time. This reduces the variance that may arise within the early samples.
Ordered Boosting

One other necessary innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered goal statistics, where CatBoost randomly permutes the training data firstly of every tree and makes predictions sequentially.
In traditional boosting methods, when training tree (t), the model uses predictions from the previous tree (t−1) for all training samples, including the one it’s currently predicting. This may result in goal leakage, because the model may not directly use the label of the present sample during training.
To handle this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows within the training data to calculate gradients and construct trees. For every row (i) within the permutation, CatBoost calculates the output value of a leaf using only the samples before (i). The model uses this value to get the prediction for row (i). Thus, the model predicts each row without its label.
CatBoost trains each tree using a brand new random permutation to average the variance in early samples in a single permutation.
Let’s say now we have 5 data points: A, B, C, D, E
. CatBoost creates a random permutation of those points. Suppose the permutation is: σ = [C, A, E, B, D]
Step | Data Used to Train | Data Point Being Predicted | Notes |
---|---|---|---|
1 | — | C | No previous data → use prior |
2 | C | A | Model trained on C only |
3 | C, A | E | Model trained on C, A |
4 | C, A, E | B | Model trained on C, A, E |
5 | C, A, E, B | D | Model trained on C, A, E, B |
This avoids using the actual label of the present row to get the prediction thus stopping leakage.
Constructing a Tree
Every time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered goal statistic for all the explicit variables with greater than two unique values. For a binary categorical variable, it maps the values to zeros and ones.
CatBoost processes data as if the info is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially comparable to the goal values.
As training proceeds, CatBoost updates the leaf output for every sample using the residuals of the previous samples that fall into the identical leaf. By not using the present sample’s label for prediction, CatBoost effectively prevents data leakage.
Split Candidates

On the core of a choice tree lies the duty of choosing the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold mixtures and choosing the one that offers the perfect reduction in loss. CatBoost does something similar. It discretizes the continual variable into bins to simplify the seek for the optimal combination. It evaluates each of those feature-bin mixtures to find out the perfect split
CatBoost uses Oblivious Trees, a key difference in comparison with other trees, where it uses the identical split across all nodes at the identical depth.
Oblivious Trees

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the identical conditions across all nodes at the identical depth of a tree. At a given depth, all samples are evaluated at the identical feature-threshold combination. This symmetry has several implications:
- Speed and ease: for the reason that same condition is applied across all nodes at the identical depth, the trees produced are simpler and faster to coach
- Regularization: Since all trees are forced to use the identical condition across the tree at the identical depth, there may be a regularization effect on the predictions
- Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to speed up training
Conclusion
CatBoost stands out by directly tackling a long-standing challenge: the way to handle categorical variables effectively without causing goal leakage. Through innovations like Ordered Goal Statistics, Ordered Boosting, and using Oblivious Trees, it efficiently balances robustness and accuracy.
If you happen to found this deep dive helpful, you may enjoy one other deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression