Home Artificial Intelligence XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1 XGBoost is very efficient, but… Constructing Decision Trees shouldn’t be a differentiable process Feature selection regularization Introducing a threshold Next steps

XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1 XGBoost is very efficient, but… Constructing Decision Trees shouldn’t be a differentiable process Feature selection regularization Introducing a threshold Next steps

7
XGBoost: How Deep Learning Can Replace Gradient Boosting and Decision Trees — Part 1
XGBoost is very efficient, but…
Constructing Decision Trees shouldn’t be a differentiable process
Feature selection regularization
Introducing a threshold
Next steps

If you may have read my previous articles on Gradient Boosting and Decision Trees, you might be aware that Gradient Boosting, combined with Ensembles of Decision Trees, has achieved excellent performance in classification or regression tasks involving tabular data.

Nonetheless, despite the efforts made by XGBoost, LightGBM, or CatBoost to boost the development of Ensemble of Decision Trees, training such a model still relies on a brute force approach. It involves systematically evaluating the gain related to splitting values and features.

Seek advice from my previous article on the topic:

The fashionable approach to training a Machine Learning model, and programs usually, involves using Differentiable Programming. Should you are unfamiliar with the concept, I actually have written an introductory article on the topic:

Essentially, the concept is to discover the parameters to optimize in your model/program and use a gradient-based approach to search out the optimal parameters. Nonetheless, for this to work, your model have to be differentiable with respect to the parameters of interest.

Although Ensemble of Decision Trees employ a Gradient-Based approach for training, their construction process shouldn’t be differentiable.

A choice tree is an easy yet powerful data structure with quite a few applications. When training such a structure for regression or classification tasks, we aim to discover three primary parameters:

  1. The feature to make use of for every decision node to attenuate the error, which involves feature selection.
  2. The brink used to separate the input data at each node. Data with a particular feature value greater than the brink shall be assigned to the fitting child node, while the remainder shall be assigned to the left child node.
  3. The worth assigned to the leaves of the tree.

The primary two steps are non-differentiable. In current implementations of Gradient Boosted Trees, step one requires iterating over features and is non-smooth. The second step involves evaluating the gain using the resulting data partition and the lists of possible values. This process can be non-differentiable.

Fortunately, in 2019, Popov, Morozov, and Babenko introduced a differentiable formulation of Decision Trees.

I highly recommend reading their paper, because it is at all times fruitful to collect information from the source. Nonetheless, to facilitate comprehension, I actually have written some Python code using Jax.

The questions addressed within the paper are:

  1. How can we reformulate feature selection to make it smooth and differentiable?
  2. How can we reformulate comparison to make it smooth and differentiable?

As mentioned earlier, feature selection in standard Gradient Boosting algorithms is performed using an exhaustive search: we try each feature and keep the one that gives the very best gain.

As an alternative of this iterative process, we’d like to reformulate the choice as a function that returns the worth of the feature ensuring maximal gain.

First, let’s assume that the features are gathered in a vector. We want a method to extract the worth of the chosen feature from this array, ideally using vector arithmetic. Starting with the belief that we already know the feature to retain, we are able to achieve this with a straightforward dot product:

features = np.array([1.1, 3.2, 4.5])
selection_weights = np.array([0, 0, 1])
selected_feature_value = np.dot(selection_weights, features)
print(selected_feature_value)
# Output: 4.5

Hence, we are able to retrieve a selected value from an array of features by multiplying all of them by 0, except the one we’d like to maintain, and summing all of the products.

Learning the feature that achieves the very best performance involves learning the vector selection_weights with the extra constraint that the sum of its elements have to be equal to 1. Ideally, this 1 needs to be concentrated in a single element of the vector.

Enforcing this constraint on the norm of the choice vector might be achieved using a function. An obvious candidate is the softmax function, which enforces the norm of the load vector to be 1. Nonetheless, softmax generates a smooth distribution and can’t generate a 1 for the very best weight and 0 for the others.

The next code snippet illustrates this:

import numpy as np
from jax.nn import softmax
from entmax_jax import entmax15, sparsemax

weights = np.array([2, 3, 5, 7, 11, 17])
print(np.linalg.norm(softmax(weights)))
# Output: 0.99747807
print(softmax(weights))
# Output: [3.0512990e-07 8.2942910e-07 6.1286983e-06 4.5285295e-05 2.4724933e-03 9.9747497e-01]

Not only is the norm not exactly equal to 1.0, however the distribution is spread across all dimensions.

Subsequently, the NODE paper proposes using the entmax function as a substitute:

import numpy as np
from jax.nn import softmax
from entmax_jax import entmax15, sparsemax

weights = np.array([2, 3, 5, 7, 11, 17])
print(np.linalg.norm(entmax15(weights)))
# Output: 1.0
print(entmax15(weights))
# Output: [0. 0. 0. 0. 0. 1.]

This appears to be a very good selection, because it generates the expected result!

By utilizing the dot product along with the entmax function, we are able to retrieve the worth of a given feature.

Differentiable feature selection might be so simple as:

def pick_features(weights, features):
feature = jnp.dot(entmax15(weights), features)
return feature

Through the training phase, the weights shall be trained to learn which feature to maintain for a given decision.

The opposite parameter utilized in a call node, which have to be learned, is a threshold. If the worth of the chosen feature is larger than or equal to this threshold, the prediction follows the fitting path of the tree. Conversely, if the worth is strictly lower, the left path is taken.

An analogous approach might be used to generate a 1 if the worth is above the brink or below it, by utilizing a dot product and the entmax function, but with a vector of size 2. The primary element represents the difference between the chosen value and the brink, while the second element is 0.

Because of the entmax function, if the worth is below the brink, the primary element of the vector is negative. When applying the entmax function, the 0 is transformed right into a 1, and the negative part becomes equal to 0. Conversely, when the worth is larger than the brink, the primary element is 1 and the second element is 0.

Here is an example implementation:

import numpy as np
import jax.numpy as jnp
from jax.nn import softmax
from entmax_jax import entmax15, sparsemax

def select(feature, threshold, left, right):
decisions = jnp.array([left, right])
steer = jnp.array([threshold - feature, 0])
pred = jnp.dot(entmax15(steer), decisions)
return pred
print(select(12, 10, -1, 1))
# Output: 1
print(select(8, 10, -1, 1))
# Output: -1

Through the training process, the parameters to be learned should not only the brink but in addition the weights attached to the fitting and left leaves of this straightforward 1-level tree.

7 COMMENTS

  1. … [Trackback]

    […] Informations on that Topic: bardai.ai/artificial-intelligence/xgboost-how-deep-learning-can-replace-gradient-boosting-and-decision-trees-part-1xgboost-is-very-efficient-butconstructing-decision-trees-shouldnt-be-a-differentiable-processf/ […]

  2. … [Trackback]

    […] Read More on that Topic: bardai.ai/artificial-intelligence/xgboost-how-deep-learning-can-replace-gradient-boosting-and-decision-trees-part-1xgboost-is-very-efficient-butconstructing-decision-trees-shouldnt-be-a-differentiable-processf/ […]

  3. … [Trackback]

    […] Info on that Topic: bardai.ai/artificial-intelligence/xgboost-how-deep-learning-can-replace-gradient-boosting-and-decision-trees-part-1xgboost-is-very-efficient-butconstructing-decision-trees-shouldnt-be-a-differentiable-processf/ […]

  4. … [Trackback]

    […] Find More Information here on that Topic: bardai.ai/artificial-intelligence/xgboost-how-deep-learning-can-replace-gradient-boosting-and-decision-trees-part-1xgboost-is-very-efficient-butconstructing-decision-trees-shouldnt-be-a-differentiable-pro…

  5. … [Trackback]

    […] Information on that Topic: bardai.ai/artificial-intelligence/xgboost-how-deep-learning-can-replace-gradient-boosting-and-decision-trees-part-1xgboost-is-very-efficient-butconstructing-decision-trees-shouldnt-be-a-differentiable-processf/ […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here