Marginal Effect of Hyperparameter Tuning with XGBoost

-

modeling contexts, the XGBoost algorithm reigns supreme. It provides performance and efficiency gains over other tree-based methods and other boosting implementations. The XGBoost algorithm features a laundry list of hyperparameters, although often only a subset is chosen through the hyperparameter tuning process. In my experience, I’ve all the time used a grid search method using k-fold cross-validation to discover the optimal combination of hyperparameters, although there are alternative methods for hyperparameter tuning with the hyperopt library that may search the hyperparameter space more systematically.

Through my work constructing XGBoost models across different projects, I got here across the good resource by Matt Harrison, a textbook covering XGBoost, including tune hyperparameters. Chapter 12 of the book is devoted to tuning hyperparameters using the hyperopt library; nevertheless, there have been some natural questions that arose upon reading the section. The introduction to the chapter provides a high-level overview of how using hyperopt and Bayesian optimization provides a more guided approach for tuning hyperparameters in comparison with grid search. Nevertheless, I used to be curious, what is occurring here under the hood?

As well as, as is the case with many tutorials about tuning XGBoost hyperparameters, the ranges for the hyperparameters seemed somewhat arbitrary. Harrison explains that he pulled the list of hyperparameters to be tuned from a chat that data scientist Bradley Boehmke gave (here). Each Harrison and Boehmke provide tutorials for using hyperopt with the identical set of hyperparameters, although they use barely different search spaces for locating an optimal combination. In Boehmke’s case, his search space is far larger; for instance, he recommends that the utmost depth for every tree (max_depth) be allowed to differ between 1 and 100. Harrison had narrowed the ranges he presents in his book somewhat, but these two cases led to the query: What’s the marginal gain in comparison with the marginal increase in time from increasing the hyperparameter search space when tuning XGBoost models?

The aim of this text is centered on these two questions. First, we are going to explore how hyperopt works when tuning hyperparameters at a rather deeper level to assist gain some intuition for what is occurring under the hood. Second, we are going to explore the tradeoff between large search spaces and narrower search spaces in a rigorous way. I hope to reply these questions in order that this will be used as a resource for understanding hyperparameter tuning in the longer term.

All code for the project will be found on my GitHub page here: https://github.com/noahswan19/XGBoost-Hyperparameter-Evaluation

hyperopt with Tree-Structured Parzen Estimators for Hyperparameter Tuning

Within the chapter of his textbook covering hyperopt, Harrison describes the technique of using hyperopt for hyperparameter tuning as using “Bayesian optimization” to discover sequential hyperparameter combos to try through the tuning process.

The high-level description makes it clear why using hyperopt is a superior method to the grid search method, but I used to be curious how that is implemented. What is definitely happening once we run the fmin function using the Tree-Structured Parzen (TPE) estimator algorithm?

Sequential Model-Based Optimization

To start out with, the TPE algorithm originates from a paper written in 2011 by James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl, the authors of the hyperopt package called “Algorithms for Hyper-Parameter Optimization”. The paper begins by introducing Sequential Model-Based Optimization (SMBO) algorithms, where the TPE algorithm is one version of a broader SMBO method. SMBOs provide a scientific strategy to select the subsequent hyperparameters to judge, avoiding the brute-force nature of grid search and the inefficiency of random search. It involves developing a “surrogate” model for the underlying model we’re optimizing for (i.e. XGBoost in our case), which we will use direct the seek for optimal hyperparameters in a way that’s computationally cheaper than evaluating the underlying model. The algorithm for an SMBO is described in the next image:

Image by writer from Figure 1 from “Algorithms for Hyper-Parameter Optimization” (Bergstra et al.)

There’s a variety of symbols here, so let’s break down each:

  • x* and x: x* represents the hyperparameter combination that’s being tested in a given trial, and x represents a general hyperparameter combination.
  • f: That is the “fitness function” which is the underlying model that we’re optimizing. Inside this algorithm, f(x*) is mapping a hyperparameter combination x* to performance of this mixture on a validation data set.
  • M_0: The M terms within the algorithm correspond to the “surrogate” model we’re using to approximate f. Since f will likely be expensive to run, we will use a less expensive estimation, M, to assist discover which hyperparameter combos will likely improve performance.
  • H: The curly H corresponds to the history of hyperparameters searched to date. It’s updated upon every iteration. Additionally it is used to develop an updated surrogate model after each iteration.
  • T: This corresponds to the variety of trials we use for hyperparameter tuning. That is pretty self-explanatory, but this corresponds to the max_evals argument within the fmin function from hyperopt.
  • S: The S corresponds to the criterion used to choose a set of hyperparameter combos to envision given a surrogate model. Within the hyperopt implementation of the TPE algorithm, S corresponds to the Expected Improvement (EI) criterion, described in the next image.
Image by writer from Equation 1 of “Algorithms for Hyper-Parameter Optimization” (Bergstra et al.)

Each iteration, some variety of possible hyperparameter combos are drawn (within the python hyperopt package, this is about to 24 by default). We are going to discuss in a bit how TPE indicates how these 24 are drawn. These 24 hyperparameter combos are evaluated using the EI criterion and surrogate model (which is inexpensive) to discover the one combination that’s most probably to have the very best performance. That is where we see the advantages of the surrogate model: as a substitute of coaching and evaluating 24 XGBoost models to judge the one best hyperparameter combination, we will approximate this with a computationally inexpensive surrogate model. Because the name would suggest, the formula above corresponds to the expected performance improvement of a hyperparameter combination x:

  • max(y*-y,0): This represents the actual improvement in performance for a hyperparameter combination x. y* corresponds to the perfect validation loss we’ve attained to this point; we’re aiming to attenuate the validation loss, so we’re in search of values of y which can be lower than y*. This implies we need to maximize EI in our algorithm.
  • p_M(x|y): That is the piece of the criterion that can be approximated using the surrogate model and the piece where TPE will slot in. That is the probability density for possible values for y given a hyperparameter combination x.

So each round, we take a set of 24 hyperparameter combos, then proceed with the one which maximizes the EI criterion, which uses our surrogate model M.

Where does the TPE algorithm slot in

The important thing piece of the SMBO algorithm that may vary across implementations is the surrogate model or how we approximate the success of hyperparameter functions. Using the EI criterion, the surrogate model is required to estimate the density function p(y|x). The paper mentioned above introduces one method called the Gaussian Process Approach which models p(y|x) directly, however the TPE approach (which is more often used for XGBoost hyperparameter optimization) as a substitute approximates p(x|y) and p(y). This approach follows from Bayes theorem:

Image by writer

The TPE algorithm splits p(x|y) right into a piecewise combination of two distributions:

  • l(x) if y < y*
  • g(x) if y ≥ y*

These two distributions have an intuitive understanding: l(x) refers back to the distribution of hyperparameters related to models which have a lower loss (higher) than the perfect model to date while g(x) refers back to the distribution of hyperparameters related to models which have a better loss (worse) than the perfect model to date. This expression for p(y|x) is substituted into the equation for EI within the paper, and a mathematical derivation (that will be too verbose to interrupt down entirely here) arrives on the incontrovertible fact that maximizing EI is akin to picking points which can be more likely under l(x) and fewer likely under g(x).

So how does this work in practice? When using hyperopt, we use the fmin function and provide the tpe.suggest algorithm to specify we would like to make use of the TPE algorithm. We supply an area of hyperparameters where each parameter is related to a uniform or log-uniform distribution. These initial distributions are used to initialize l(x) and g(x) and supply a previous distribution for l(x) and g(x) while working with a small variety of initial trials. By default (parameter n_startup_jobs in tpe.suggest), hyperopt runs 20 trials by randomly sampling hyperparameter combos from the distributions provided for the space parameter of fmin. For every of the 20 trials, an XGBoost model is run and a validation loss obtained.

The 20 observations are then split in order that two subsets are used to construct non-parametric densities for l(x) and g(x). Subsequent observations are used to update these distributions. The densities are estimated using a non-parametric method (which I’m not qualified to explain fully) involving the prior distributions for every hyperparameter (that we specified) and individual distributions for every statement from the trial history. Observations are split into subsets using a technique that changes with the variety of total trials run; the “n” observations with the bottom loss are used for l(x) with the remaining observations used for g(x). The “n” is set by multiplying a parameter gamma (default for tpe.suggest is 0.25) by the square root of the variety of trials and rounding up; nevertheless, a maximum for “n” is about at 25 so l(x) can be parameterized with at most 25 values. If we use the default setting for tpe.suggest, then the perfect two observations (0.25 * sqrt(20) = 1.12 rounds to 2) from the initial trials are used to parameterize l(x) with the remaining 18 used for g(x). The 0.25 value refers back to the gamma parameter in tpe.suggest which will be modified if desired. Looking back to the pseudocode for the SMBO algorithm and the formula for EI, if n observations are used to parameterize l(x), then the (n+1)th statement is the edge value y*.

Once l(x) and g(x) are instantiated using the start trials, we will move forward with each evaluation of our objective function for the variety of max_evals that we specify for fmin. For every iteration, a set of candidate hyperparameter combos (24 by default in tpe.suggest but will be specified with n_EI_candidates) is generated by taking random draws from l(x). Each of those combos is evaluated using the ratio l(x)/g(x); the mix that maximizes this ratio is chosen as the mix for use for the iteration. The ratio increases for hyperparameters combos which can be either (1) more likely to be related to low losses or (2) unlikely to be related to high losses (which drives exploration). This technique of selecting the perfect candidate corresponds to using the surrogate model with the EI as discussed when the pseudocode for an SMBO.

An XGBoost model is then trained with the highest candidate for the iteration; a loss value is obtained, and the information point (x*, f(x*)) is used to update the surrogate model (l(x) and g(x)) to proceed optimization.

Marginal Effect of Hyperparameter Tuning

So now, with a background on how the hyperopt library will be utilized in the hyperparameter tuning process, we move to the query of how using wider distributions impacts model performance. When attempting to match the performance of models trained on large search spaces against those trained on narrower search spaces, the immediate query is create the narrower search space. For instance, the presentation from Boehmke advises using a uniform distribution from 1 to 100 for the max_depth hyperparameter. XGBoost models are likely to generalize higher when combining a lot of weak learners, but does that mean we narrow the distribution to a minimum of 1 and a maximum of fifty? We could have some kind of general understanding from work others have done to intuitively narrow the space, but can we discover a strategy to analytically narrow the search space?

The answer proposed in this text involves running a set of shorter hyperparameter tuning trials to narrow the search space based on shallow searches of a wider hyperparameter space. The broader search space we use comes from slide 20 of Boehmke’s aforementioned presentation (here). As an alternative of running hyperparameter tuning on a large search space for 1,000 rounds of hyperparameter testing, we’ll run 20 independent trials with 25 rounds of hyperparameter testing each. We are going to narrow the search space using percentile values for every hyperparameter using the trial results. With the percentiles, we are going to run a final seek for 200 rounds using the narrower hyperparameter search space, where the distribution we offer for every hyperparameter is given a maximum and minimum from the percentile values we see within the trials.

For instance, say we run our 20 trials and get 20 optimal values for max_depth using the shallow search. We decide to narrow the search space for max_depth from the uniform distribution from 1 to 100 to the uniform distribution from the tenth percentile value for max_depth from our trials to the ninetieth percentile value for max_depth. We are going to run a pair of various models changing up the percentiles we use to match aggressive narrowing strategies.

Models produced using the trial-based method require 700 evaluations of hyperparameter combos (500 from the trials and 200 from the ultimate evaluation). We are going to compare the performance of those models against one tuned for 1,000 hyperparameter evaluations on the broader space and one tuned for 700 hyperparameter evaluations on the broader space. We’re curious as as to whether this approach to narrowing the hyperparameter search space will result in faster convergence toward the optimal hyperparameter combination or if this narrowing negatively impacts results.

We test this method on a task from a previous project involving simulated tennis match results (more information within the article I wrote here). A part of the project involved constructing post-match win probability models using high-level details about each match and statistics for a given player within the match that followed a truncated normal distribution; that is the duty used to check the hyperparameter tuning method here. More information in regards to the specific task will be present in the article and within the code linked initially of the article. At a high level, we are attempting to take details about what happened within the match to predict a binary win/loss for the match; one could use a post-match win probability model to discover any players which may be overperforming their statistical performance, who may be candidates for regression. To coach each XGBoost model, we use log loss/cross-entropy loss because the loss function. The info for the duty comes from Jeff Sackmann’s GitHub page here: https://github.com/JeffSackmann/tennis_atp. Anyone enthusiastic about tennis or tennis data should take a look at his GitHub and excellent website, tennisabstract.com.

For this task and our method, we have now six models, two trained on the total search space and 4 trained on a narrower space. Those are titled as follows within the charts:

  • “Full Search”: That is the model trained for 1000 hyperparameter evaluations across the total hyperparameter search space.
  • “XX-XX Percentile”: These models are those trained on a narrower search space for 200 evaluations after the five hundred rounds of trial evaluations on the total hyperparameter search space. The “10–90 Percentile” model for instance trains on a hyperparameter search space where the distribution for every hyperparameter is set by the tenth percentile and ninetieth percentile values from the 20 trials.
  • “Shorter Search”: That is the model trained for 700 hyperparameter evaluations across the total hyperparameter search space. We use this to match the performance of the trial method against the broader search space when allotting the identical variety of hyperparameter evaluations to each methods.

A log of coaching the models is included on the GitHub page linked at the highest of the article which incorporates the hyperparameters found at each step of the method given the random seeds used together with the time it took to run each model on my laptop. It also provides the outcomes of the 20 trials run so to grasp how each narrowed search space could be parameterized. Those times are listed below:

  • Full Search: ~6000 seconds
  • 10–90 Percentile: ~4300 seconds (~3000 seconds for trials, ~1300 for narrower search)
  • 20–80 Percentile: ~3800 seconds (~3000 seconds for trials, ~800 for narrower search)
  • 30–70 Percentile: ~3650 seconds (~3000 seconds for trials, ~650 for narrower search)
  • 40–60 Percentile: ~3600 seconds (~3000 seconds for trials, ~600 for narrower search)
  • Shorter Search: ~4600 seconds

The timing doesn’t scale 1:1 with the variety of total evaluations use; the trial method models are likely to take less time to coach given the identical variety of evaluations, with narrower searches taking even less time. The following query is whether or not this time-saving impacts model performance in any respect. We’ll begin by validation log loss across the models.

Image by writer

Little or no distinguishes the log losses across the models, but we’ll zoom in a bit of bit to get a visible have a look at the differences. We present the total range y-axis first to contextualize the minor differences within the log losses.

Image by writer

Okay so doing higher, but we’ll zoom in yet another time to see the trend most clearly.

Image by writer

We discover that the 20–80 Percentile model attains the perfect validation log loss, barely higher than the Full Search and Shorter Search methods. The opposite percentile models all perform barely worse than the broader search models, however the differences are minor across the board. We are going to look now on the differences in accuracy between the models.

Image by writer

As with the log losses, we see very minor differences and decide to zoom in to see a more definitive trend.

Image by writer

The Full Search model attains the perfect accuracy of any model, however the 10–90 Percentile and 20–80 Percentile each beat out the Shorter Search model over the identical variety of evaluations. That is the form of tradeoff I used to be hoping to discover with the caveat that that is task-specific and on a really small scale.

The outcomes using log loss and accuracy suggest the potential of a possible efficiency-performance trade off when selecting how wide to make the XGBoost hyperparameter search space. We found that models trained on a narrower search space can outperform or compare to models trained on wider search spaces while taking less time to coach overall.

Further Work

The code provided within the prior section should provide modularity to run this test against different tasks without problems; the outcomes for this classification task could vary from those of others. Altering the variety of evaluations run when exploring the hyperparameter search space or the variety of trials run to get percentile ranges could provide alternative conclusions from those found here. This work also assumed the set of hyperparameters to tune; one other query I’d be enthusiastic about exploring could be the marginal effect of including additional hyperparameters to tune (i.e., colsample_bylevel) on the performance of an XGBoost model.

References

(used with permission)

[2] M. Harrison, (2023), MetaSnake

[3] B. Boehmke, “Advanced XGBoost Hyperparameter Tuning on Databricks” (2021), GitHub

[4] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, “Algorithms for Hyper-Parameter Optimization” (2011), NeurIPS 2011

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x