A Visual Guide to Tuning Gradient Boosted Trees

-

Introduction

My previous posts checked out the bog-standard decision tree and the wonder of a random forest. Now, to finish the triplet, I’ll visually explore !

There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. Nevertheless, for this I’m going to make use of sklearn’s one. Why? Just because, compared with the others, it allowed me to visualise easier. In practice I are inclined to use the opposite libraries greater than the sklearn one; nonetheless, this project is about visual learning, not pure performance.

Fundamentally, a GBT is a mix of trees that . While a single decision tree (including one extracted from a random forest) could make an honest prediction by itself, taking a person tree from a GBT is unlikely to offer anything usable.

Beyond this, as at all times, no theory, no maths — just plots and hyperparameters. As before, I’ll be using the California housing dataset via scikit-learn (CC-BY), the identical general process as described in my previous posts, the code is at https://github.com/jamesdeluk/data-projects/tree/most important/visualising-trees, and all images below are created by me (aside from the GIF, which is from Tenor).

A basic gradient boosted tree

Starting with a basic GBT: gb = GradientBoostingRegressor(random_state=42). Much like other tree types, the default settings for min_samples_split, min_samples_leaf, max_leaf_nodes are 2, 1, None respectively. Interestingly, the default max_depth is 3, not None because it is with decision trees/random forests. Notable hyperparameters, which I’ll look into more later, include learning_rate (how steep the gradient is, default 0.1), and n_estimators (much like random forest — the variety of trees).

Fitting took 2.2s, predicting took 0.005s, and the outcomes:

Metric max_depth=None
MAE 0.369
MAPE 0.216
MSE 0.289
RMSE 0.538
0.779

So, quicker than the default random forest, but barely worse performance. For my chosen block, it predicted 0.803 (actual 0.894).

Visualising

That is why you’re here, right?

The tree

Much like before, we are able to plot a single tree. That is the primary one, accessed with gb.estimators_[0, 0]:

I’ve explained these within the previous posts, so I won’t achieve this again here. One thing I’ll bring to your attention though: notice how terrible the values are! Three of the leaves even have negative values, which we all know can’t be the case. That is why a GBT only works as a combined ensemble, not as separate standalone trees like in a random forest.

Predictions and errors

My favourite approach to visualise GBTs is with prediction vs iteration plots, using gb.staged_predict. For my chosen block:

Remember the default model has 100 estimators? Well, here they’re. The initial prediction was way off — 2! But every time it learnt (remember learning_rate?), and got closer to the true value. After all, it was trained on the training data, not this specific data, so the ultimate value was off (0.803, so about 10% off), but you may clearly see the method.

On this case, it reached a reasonably regular state after about 50 iterations. Later we’ll see the best way to stop iterating at this stage, to avoid wasting money and time.

Similarly, the error (i.e. the prediction minus the true value) will be plotted. After all, this offers us the identical plot, simply with different y-axis values:

Let’s take this one step further! The test data has over 5000 blocks to predict; we are able to loop through each, and predict all of them, for every iteration!

I like this plot.

All of them start around 2, but explode across the iterations. We all know all of the true values vary from 0.15 to five, with a mean of two.1 (check my first post), so this spreading out of predictions (from ~0.3 to ~5.5) is as expected.

We can even plot the errors:

At first glance, it seems a bit strange — we’d expect them to begin at, say, ±2, and converge on 0. Looking rigorously though, this does occur for many — it might probably be seen within the left-hand side of the plot, the primary 10 iterations or so. The issue is, with over 5000 lines on this plot, there are a variety of overlapping ones, making the outliers stand out more. Perhaps there’s a greater approach to visualise these? How about…

The median error is 0.05 — which is excellent! The IQR is lower than 0.5, which can be decent. So, while there are some terrible predictions, most are decent.

Hyperparameter tuning

Decision tree hyperparameters

Same as before, let’s compare how the hyperparameters explored in the unique decision tree post apply to GBTs, with the default hyperparameters of learning_rate = 0.1, n_estimators = 100. The min_samples_leaf, min_samples_split, and max_leaf_nodes one even have max_depth = 10, to make it a good comparison to previous posts and to one another.

Model max_depth=None max_depth=10 min_samples_leaf=10 min_samples_split=10 max_leaf_nodes=100
Fit Time (s) 10.889 7.009 7.101 7.015 6.167
Predict Time (s) 0.089 0.019 0.015 0.018 0.013
MAE 0.454 0.304 0.301 0.302 0.301
MAPE 0.253 0.177 0.174 0.174 0.175
MSE 0.496 0.222 0.212 0.217 0.210
RMSE 0.704 0.471 0.46 0.466 0.458
0.621 0.830 0.838 0.834 0.840
Chosen Prediction 0.885 0.906 0.962 0.918 0.923
Chosen Error 0.009 0.012 0.068 0.024 0.029

Unlike decision trees and random forests, the deeper tree performed far worse! And took longer to suit. Nevertheless, increasing the depth from 3 (the default) to 10 has improved the scores. The opposite constraints resulted in further improvements — again showing how all hyperparameters can play a task.

learning_rate

GBTs operate by tweaking predictions after each iteration based on the error.  The upper the adjustment (a.k.a. the gradient, a.k.a. the training rate), the more the prediction changes between iterations.

There’s a transparent trade-off for learning rate. Comparing learning rates of 0.01 (Slow), 0.1 (Default), and 0.5 (Fast), over 100 iterations:

Faster learning rates can get to the proper value quicker, but they’re more more likely to overcorrect and jump past the true value (think fishtailing in a automotive), and may result in oscillations. Slow learning rates may never reach the proper value (think… not turning the steering wheel enough and driving straight right into a tree). As for the stats:

Model Default Fast Slow
Fit Time (s) 2.159 2.288 2.166
Predict Time (s) 0.005 0.004 0.015
MAE 0.370 0.338 0.629
MAPE 0.216 0.197 0.427
MSE 0.289 0.247 0.661
RMSE 0.538 0.497 0.813
0.779 0.811 0.495
Chosen Prediction 0.803 0.949 1.44
Chosen Error 0.091 0.055 0.546

Unsurprisingly, the slow learning model was terrible. For this block, fast was barely higher than the default overall. Nevertheless, we are able to see on the plot how, a minimum of for the chosen block, it was the last 90 iterations that got the fast model to be more accurate than the default one — if we’d stopped at 40 iterations, for the chosen block a minimum of, the default model would have been much better. The thrill of visualisation!

n_estimators

As mentioned above, the variety of estimators goes hand in hand with the training rate. , the more estimators the higher, because it gives more iterations to measure and adjust for the error — although this comes at a further time cost.

As seen above, a sufficiently high variety of estimators is very vital for a low learning rate, to make sure the proper value is reached. Increasing the variety of estimators to 500:

With enough iterations, the slow learning GBT did reach the true value. In actual fact, all of them ended up much closer. The stats confirm this:

Model DefaultMore FastMore SlowMore
Fit Time (s) 12.254 12.489 11.918
Predict Time (s) 0.018 0.014 0.022
MAE 0.323 0.319 0.410
MAPE 0.187 0.185 0.248
MSE 0.232 0.228 0.338
RMSE 0.482 0.477 0.581
0.823 0.826 0.742
Chosen Prediction 0.841 0.921 0.858
Chosen Error 0.053 0.027 0.036

Unsurprisingly, increasing the variety of estimators five-fold increased the time to suit significantly (on this case by six-fold, but that will just be a one-off). Nevertheless, we still haven’t surpassed the scores of the constrained trees above — I assume we’ll must do a hyperparameter search to see if we are able to beat them. Also, for the chosen block, as will be seen within the plot, after about 300 iterations not one of the models really improved. If that is consistent across all the info, then the additional 700 iterations were unnecessary. I discussed earlier about the way it’s possible to avoid wasting time iterating without improving; now’s time to look into that.

n_iter_no_change, validation_fraction, and tol

It’s possible for added iterations to not improve the outcome, yet it still takes time to run them. That is where early stopping is available in.

There are three relevant hyperparameters. The primary, n_iter_no_change, is what number of iterations for there to be “no change” before doing no more iterations. tol[erance] is how big the change in validation rating needs to be to be classified as “no change”. And validation_fraction is how much of the training data for use as a validation set to generate the validation rating (note that is separate from the test data).

Comparing a 1000-estimator GBT with one with a reasonably aggressive early stopping — n_iter_no_change=5, validation_fraction=0.1, tol=0.005 — the latter one stopped after only 61 estimators (and hence only took 5~6% of the time to suit):

As expected though, the outcomes were worse:

Model Default Early Stopping
Fit Time (s) 24.843 1.304
Predict Time (s) 0.042 0.003
MAE 0.313 0.396
MAPE 0.181 0.236
MSE 0.222 0.321
RMSE 0.471 0.566
0.830 0.755
Chosen Prediction 0.837 0.805
Chosen Error 0.057 0.089

But as at all times, the query to ask: is it value investing 20x the time to enhance the R² by 10%, or reducing the error by 20%?

Bayes searching

You were probably expecting this. The search spaces:

search_spaces = {
    'learning_rate': (0.01, 0.5),
    'max_depth': (1, 100),
    'max_features': (0.1, 1.0, 'uniform'),
    'max_leaf_nodes': (2, 20000),
    'min_samples_leaf': (1, 100),
    'min_samples_split': (2, 100),
    'n_estimators': (50, 1000),
}

Most are much like my previous posts; the one additional hyperparameter is learning_rate.

It took the longest thus far, at 96 mins (~50% greater than the random forest!) The most effective hyperparameters are:

best_parameters = OrderedDict({
    'learning_rate': 0.04345459461297153,
    'max_depth': 13,
    'max_features': 0.4993693929975871,
    'max_leaf_nodes': 20000,
    'min_samples_leaf': 1,
    'min_samples_split': 83,
    'n_estimators': 325,
})

max_features, max_leaf_nodes, and min_samples_leaf, are very much like the tuned random forest. n_estimators is just too, and it aligns with what the chosen block plot above suggested — the additional 700 iterations were mostly unnecessary. Nevertheless, compared with the tuned random forest, the trees are only a 3rd as deep, and min_samples_split is way higher than we’ve seen thus far. The worth of learning_rate was not too surprising based on what we saw above.

And the cross-validated scores:

Metric Mean Std
MAE -0.289 0.005
MAPE -0.161 0.004
MSE -0.200 0.008
RMSE -0.448 0.009
0.849 0.006

Of all of the models thus far, that is the very best, with smaller errors, higher R², and lower variances!

Finally, our old friend, the box plots:

Conclusion

And so we come to the top of my mini-series on the three commonest kinds of tree-based models.

My hope is that, by seeing other ways of visualising trees, you now (a) higher understand how the several models function, without having to take a look at equations, and (b) can use your personal plots to tune your personal models. It may possibly also help with stakeholder management — execs prefer pretty pictures to tables of numbers, so showing them a tree plot can assist them understand why what they’re asking you to do is unimaginable.

Based on this dataset, and these models, the gradient boosted one was barely superior to the random forest, and each were far superior to a lone decision tree. Nevertheless, this will likely have been since the GBT had 50% more time to go looking for higher hyperparameters (they typically are more computationally expensive — in any case, it was the identical variety of iterations). It’s also value noting that GBTs have a better tendency to overfit than random forests. And while the choice tree had worse performance, it’s faster — and in some use cases, that is more vital. Moreover, as mentioned, there are other libraries, with pros and cons — for instance, CatBoost handles categorical data out of the box, whereas other GBT libraries typically require categorical data to be preprocessed (e.g. one-hot or label encoding). Or, in the event you’re feeling really brave, how about stacking the several tree types in an ensemble for even higher performance…

Anyway, until next time!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x