Home Artificial Intelligence Gradient-Boosted Trees: To Early Stop or To not Early Stop?

Gradient-Boosted Trees: To Early Stop or To not Early Stop?

1
Gradient-Boosted Trees: To Early Stop or To not Early Stop?

Photo by Julian Berengar Sölter

Gradient-boosted decision trees (GBDTs) currently outperform deep learning in tabular-data problems, with popular implementations corresponding to LightGBM, XGBoost, and CatBoost dominating Kaggle competitions [1]. Early stopping — a preferred technique in deep learning — will also be used when training and tuning GBDTs. Nonetheless, it’s common to see practitioners explicitly tune the variety of trees in GBDT ensembles, as an alternative of using early stopping. In this text, we show that early stopping halves training time, while maintaining the identical performance as explicitly tuning the variety of trees.

By reducing training time, early stopping can lower computational costs and decrease practitioner downtime while waiting for models to run. Such savings are of utmost value in industries with large-scale GBDT applications, corresponding to content suggestion, financial fraud detection, or credit scoring. But how does early stopping reduce training time without harming performance? Let’s dive in.

Gradient-Boosted Decision Trees

Gradient-boosted decision trees (GBDTs) currently achieve state-of-the-art performance in classification and regression problems based on (heterogeneous) tabular data (two-dimensional datasets with diverse column types). Deep learning techniques — although performant in natural language processing and computer vision— are yet to steal the crown within the tabular data domain [2, 3, 4, 5].

Gradient-boosted decision trees (GBDTs).

GBDTs work by sequentially adding decision trees to an ensemble. Unlike with random forests, trees in GBDTs will not be independent. As a substitute, they’re trained to correct the mistakes of previous trees. As such, given enough trees, a GBDT model can achieve perfect performance within the training set. Nevertheless, this behavior — known as overfitting — is understood to harm the model’s ability to generalize to unseen data.

Hyperparameter Tuning and Early Stopping

To optimize the degree of fitting to the training data, practitioners tune several key hyperparameters: the variety of trees, the training rate, the utmost depth of every tree, amongst others. To seek out the optimal set of values, several configurations are tested in a separate validation dataset; the model performing best within the holdout data is chosen as the ultimate model.

One other tool that helps fight overfitting is early stopping. Common in deep learning, early stopping is a method where the learning process is halted if the performance on holdout data will not be improving. In GBDTs, this suggests not constructing more trees beyond that time.

Early stopping halts training at the purpose where loss within the validation set stops to decreasing.

Although ubiquitous in deep learning, early stopping will not be as popular amongst GBDT users. As a substitute, it’s common to see practitioners tune the variety of trees through the aforementioned search process. But what if using early stopping amounts to similar to explicitly tuning the variety of trees? In any case, each mechanisms aim to seek out the optimal size of the GBDT ensemble, given the training rate and other hyperparameters. If that were the case, it could mean that the identical performance could possibly be achieved at greatly reduced search time through the use of early stopping, because it halts the training of time-consuming, unpromising iterations. Let’s test this hypothesis.

Experimental Setup

To this end, with the authors’ permission, I take advantage of the public bank-account-fraud dataset recently published at NeurIPS ’22 [6]. It consists of an artificial replica of an actual fraud-detection dataset, having been generated by a privacy-preserving GAN. For an implementation of GBDTs, I go for LightGBM for its speed and state-of-the-art performance [1, 7]. All of the code utilized in this experiment will be present in this Kaggle notebook.

As mentioned above, to seek out the optimal set of hyperparameters, probably the most common approach is to experiment with several configurations. Ultimately, the model that performs best within the validation set is chosen as the ultimate model. I follow this approach, randomly sampling hyperparameters from sensible distributions at each iteration.

To check my hypothesis, I run two parallel random search processes:

  1. Without early stopping, the variety of trees parameter is tested uniformly between 10 and 4000.
  2. With early stopping, the utmost variety of trees is ready to 4000, but ultimately defined by the early stopping criteria. Early stopping monitors cross-entropy loss within the validation set. The training process is barely halted after 100 non-improving iterations (the patience parameter), at which point it’s reset to its best version.

The next function is used to run each random search trial inside an Optuna study (truncated for clarity; full version within the aforementioned notebook):

def _objective(t, dtrain, dval, early_stopping):
params = {
'boosting_type': t.suggest_categorical(['gbdt', 'goss']),
'learning_rate': t.suggest_float(0.01, 0.5, log=True),
'min_split_gain': t.suggest_float(0.00001, 2, log=True),
'num_leaves': t.suggest_int(2, 1024, log=True),
'max_depth': t.suggest_int(1, 15),
'min_child_samples': t.suggest_int(2, 100, log=True),
'bagging_freq': t.suggest_categorical([0, 1]),
'pos_bagging_fraction': t.suggest_float(0, 1),
'neg_bagging_fraction': t.suggest_float(0, 1),
'reg_alpha': t.suggest_float(0.00001, 0.1, log=True),
'reg_lambda': t.suggest_float(0.00001, 0.1, log=True),
}
model = lgb.train(
**params, dtrain,
num_boost_round=(
4000 if early_stopping
else trial.suggest_int('num_boost_rounds', 10, 4000)
),
valid_sets=dval if early_stopping else None,
callbacks=(
[lgb.early_stopping(stopping_rounds=100)] if early_stopping
else None))

Performance

Since early stopping monitors performance on the validation set, all models are evaluated on an unseen test set, thus avoiding biased results.

Results on the test set. Bottom 20% of trials removed for visualization clarity.

To early stop or to not early stop? Each approaches achieve similar results. This consequence is consistent each when measuring cross-entropy loss — the metric monitored by early stopping, and recall at 5% FPR — a binary classification metric especially relevant on this dataset’s domain [6]. On the primary criterion, the no-early-stopping strategy achieves marginally higher results, whereas on the second criteria, it’s the early-stopping strategy that has the sting.

In sum, the outcomes of this experiment fail to reject my hypothesis that there isn’t a significant difference between employing early stopping and explicitly tuning the variety of trees in GBDTs. Naturally, a more robust evaluation would require experimenting with several datasets, hyperparameter search spaces and random seeds.

Training Time

A part of my hypothesis was also that early stopping reduces average training time by stopping the addition of unpromising trees. Can a meaningful difference be measured?

Distribution of coaching time in seconds.

Results confirm the second a part of my hypothesis: training times are substantially inferior when using early stopping. Using this strategy — even with a high patience value of 100 iterations — halves the common training time, from 122 seconds to 58 seconds. This means a discount of total training time from 3 hours and 23 minutes to 1 hour and 37 minutes.

This reduction comes regardless of the extra computation required by the early stopping mechanism to watch cross-entropy loss on the validation set, which is accounted for within the measurements presented above.

Conclusion

Gradient-boosted decision-trees (GBDTs) are currently state-of-the-art in problems involving tabular data. I find that using early stopping within the training of those models halves training times, while maintaining the identical performance as explicitly tuning the variety of trees. This makes popular GBDT implementations like LightGBM, XGBoost, and CatBoost that way more powerful for applications in large industries, corresponding to Digital Marketing and Finance.

In the longer term, it might be necessary to corroborate the findings presented here in other datasets and across other GBDT implementations. Tuning the patience parameter could also prove helpful, although its optimal value will likely vary for every dataset.

Except where otherwise noted, all images are by the creator.

References

[1] H. Carlens. The State of Competitive Machine Learning in 2022. ML Contests, 2023.
[2] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, Revisiting Deep Learning Models for Tabular Data, thirty fifth Conference on Neural Information Processing Systems (NeurIPS 2021).
[3] R. Shwartz-Ziv, and A. Armon, Tabular Data: Deep Learning is Not All You Need, Information Fusion 81 (2022): 84–90.
[4] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, Deep Neural Networks and Tabular Data: A Survey, IEEE Transactions on Neural Networks and Learning Systems (2022).
[5] L. Grinsztajn, E. Oyallon, and G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data?, thirty sixth Conference on Neural Information Processing Systems — Datasets and Benchmarks Track (NeurIPS 2022).
[6] S. Jesus, J. Pombal, D. Alves, A. Cruz, P. Saleiro, R. Ribeiro, J. Gama, P. Bizarro, Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation, thirty sixth Conference on Neural Information Processing Systems — Datasets and Benchmarks Track (NeurIPS 2022).
[7] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T. Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, thirty first Conference on Neural Information Processing Systems (NIPS 2017).

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here