XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1 XGBoost is extremely efficient, but… Constructing Decision Trees just isn’t a differentiable process Feature selection regularization Introducing a threshold Next steps

Artificial Intelligence

XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1 XGBoost is extremely efficient, but… Constructing Decision Trees just isn’t a differentiable process Feature selection regularization Introducing a threshold Next steps

admin

June 1, 2023

XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1
XGBoost is extremely efficient, but…
Constructing Decision Trees just isn’t a differentiable process
Feature selection regularization
Introducing a threshold
Next steps

If you’ve gotten read my previous articles on Gradient Boosting and Decision Trees, you’re aware that Gradient Boosting, combined with Ensembles of Decision Trees, has achieved excellent performance in classification or regression tasks involving tabular data.

Nonetheless, despite the efforts made by XGBoost, LightGBM, or CatBoost to boost the development of Ensemble of Decision Trees, training such a model still relies on a brute force approach. It involves systematically evaluating the gain related to splitting values and features.

Seek advice from my previous article on the topic:

The fashionable approach to training a Machine Learning model, and programs normally, involves using Differentiable Programming. When you are unfamiliar with the concept, I actually have written an introductory article on the topic:

Essentially, the thought is to discover the parameters to optimize in your model/program and use a gradient-based approach to search out the optimal parameters. Nonetheless, for this to work, your model should be differentiable with respect to the parameters of interest.

Although Ensemble of Decision Trees employ a Gradient-Based approach for training, their construction process just isn’t differentiable.

A choice tree is an easy yet powerful data structure with quite a few applications. When training such a structure for regression or classification tasks, we aim to discover three essential parameters:

The feature to make use of for every decision node to reduce the error, which involves feature selection.
The brink used to separate the input data at each node. Data with a particular feature value greater than the edge will probably be assigned to the appropriate child node, while the remaining will probably be assigned to the left child node.
The worth assigned to the leaves of the tree.

The primary two steps are non-differentiable. In current implementations of Gradient Boosted Trees, step one requires iterating over features and is non-smooth. The second step involves evaluating the gain using the resulting data partition and the lists of possible values. This process can also be non-differentiable.

Fortunately, in 2019, Popov, Morozov, and Babenko introduced a differentiable formulation of Decision Trees.

I highly recommend reading their paper, because it is all the time fruitful to collect information from the source. Nonetheless, to facilitate comprehension, I actually have written some Python code using Jax.

The questions addressed within the paper are:

How can we reformulate feature selection to make it smooth and differentiable?
How can we reformulate comparison to make it smooth and differentiable?

As mentioned earlier, feature selection in standard Gradient Boosting algorithms is performed using an exhaustive search: we try each feature and keep the one that gives the best gain.

As an alternative of this iterative process, we’d like to reformulate the choice as a function that returns the worth of the feature ensuring maximal gain.

First, let’s assume that the features are gathered in a vector. We’d like a approach to extract the worth of the chosen feature from this array, ideally using vector arithmetic. Starting with the belief that we already know the feature to retain, we are able to achieve this with a straightforward dot product:

features = np.array([1.1, 3.2, 4.5])
selection_weights = np.array([0, 0, 1])
selected_feature_value = np.dot(selection_weights, features)
print(selected_feature_value)
# Output: 4.5

Hence, we are able to retrieve a selected value from an array of features by multiplying all of them by 0, except the one we’d like to maintain, and summing all of the products.

Learning the feature that achieves the best performance involves learning the vector selection_weights with the extra constraint that the sum of its elements should be equal to 1. Ideally, this 1 must be concentrated in a single element of the vector.

Enforcing this constraint on the norm of the choice vector might be achieved using a function. An obvious candidate is the softmax function, which enforces the norm of the load vector to be 1. Nonetheless, softmax generates a smooth distribution and can’t generate a 1 for the best weight and 0 for the others.

The next code snippet illustrates this:

import numpy as np
from jax.nn import softmax
from entmax_jax import entmax15, sparsemaxweights = np.array([2, 3, 5, 7, 11, 17])
print(np.linalg.norm(softmax(weights)))
# Output: 0.99747807
print(softmax(weights))
# Output: [3.0512990e-07 8.2942910e-07 6.1286983e-06 4.5285295e-05 2.4724933e-03 9.9747497e-01]

Not only is the norm not exactly equal to 1.0, however the distribution is spread across all dimensions.

Due to this fact, the NODE paper proposes using the entmax function as a substitute:

import numpy as np
from jax.nn import softmax
from entmax_jax import entmax15, sparsemaxweights = np.array([2, 3, 5, 7, 11, 17])
print(np.linalg.norm(entmax15(weights)))
# Output: 1.0
print(entmax15(weights))
# Output: [0. 0. 0. 0. 0. 1.]

This appears to be an excellent alternative, because it generates the expected result!

By utilizing the dot product at the side of the entmax function, we are able to retrieve the worth of a given feature.

Differentiable feature selection might be so simple as:

def pick_features(weights, features):
feature = jnp.dot(entmax15(weights), features)
return feature

Through the training phase, the weights will probably be trained to learn which feature to maintain for a given decision.

The opposite parameter utilized in a choice node, which should be learned, is a threshold. If the worth of the chosen feature is bigger than or equal to this threshold, the prediction follows the appropriate path of the tree. Conversely, if the worth is strictly lower, the left path is taken.

An identical approach might be used to generate a 1 if the worth is above the edge or below it, through the use of a dot product and the entmax function, but with a vector of size 2. The primary element represents the difference between the chosen value and the edge, while the second element is 0.

Because of the entmax function, if the worth is below the edge, the primary element of the vector is negative. When applying the entmax function, the 0 is transformed right into a 1, and the negative part becomes equal to 0. Conversely, when the worth is bigger than the edge, the primary element is 1 and the second element is 0.

Here is an example implementation:

import numpy as np
import jax.numpy as jnp
from jax.nn import softmax
from entmax_jax import entmax15, sparsemaxdef select(feature, threshold, left, right):
decisions = jnp.array([left, right])
steer = jnp.array([threshold - feature, 0])
pred = jnp.dot(entmax15(steer), decisions)
return pred
print(select(12, 10, -1, 1))
# Output: 1
print(select(8, 10, -1, 1))
# Output: -1

Through the training process, the parameters to be learned aren’t only the edge but in addition the weights attached to the appropriate and left leaves of this straightforward 1-level tree.

XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1 XGBoost is extremely efficient, but… Constructing Decision Trees just isn’t a differentiable process Feature selection regularization Introducing a threshold Next steps

1 COMMENT

LEAVE A REPLY Cancel reply