Training XGBoost Models with GPU-Accelerated Polars DataFrames

-


One in all the numerous strengths of the PyData ecosystem is interoperability, which enables seamlessly moving data between libraries that focus on exploratory evaluation, training, and inference. The most recent release of XGBoost introduces exciting recent capabilities, including a category re-coder and integration with Polars DataFrames. This provides a streamlined approach to data handling. 

This post guides you thru tips on how to leverage the Polars GPU engine with the XGBoost machine learning library. It highlights the seamless integration of categorical features, including the brand new category re-coder inside XGBoost.

Using XGBoost with the Polars GPU Engine

Polars is a high-performance DataFrame library written in Rust, offering a lazy evaluation model and GPU acceleration that may significantly optimize data processing workflows.

One in all the important thing features of using Polars with XGBoost in a GPU-accelerated pipeline is knowing lazy evaluation. Polars operations are sometimes lazy, meaning that they construct a question plan but don’t execute it unless explicitly directed to accomplish that. To customize the execution of the query plan using a GPU, call the collect approach to the LazyFrame and specify the engine=”gpu”parameter.

This tutorial uses a small subset of the Microsoft Malware Prediction dataset for illustration. The dataset, available through Kaggle, has a moderate size with each numerical and categorical features. In the next snippets, only a number of columns are chosen to show using categorical features with XGBoost.

Organising the environment 

Before diving into the code, ensure you might have the next libraries installed: xgboost, polars[gpu], and pyarrow. The [gpu] dependency specifier downloads the GPU-enabled version of Polars:

pip install xgboost polars[gpu] pyarrow

When consuming Polars inputs, XGBoost uses the zero-copy to_arrow method from Polars DataFrames. Consequently, PyArrow is required to pass data between Polars and XGBoost. As this instance shows, it’s also used as a knowledge exchange format for exporting categories from XGBoost models.

Data preparation and model training

First, import the crucial libraries:

import polars as pl
import xgboost as xgb

Three features can be used from the dataset, two of that are categorical. The HasDetections column is the prediction goal with binary values. To utilize the Polars execution engine for optimal performance, create a LazyFrame object with scan_csv:

columns = [
    "ProductName", # Categorical
    "IsBeta",  # Boolean
    "Census_OSArchitecture", # Categorical
    "HasDetections",  # Binary target
]

# ignore_errors is ready to True to let polars infer the schema
df_lazy = pl.scan_csv(
    "./microsoft-malware-prediction/train.csv",
    ignore_errors=True,
).select(columns)
# Forged the specific features to the Polars Enum type
df_lazy = df_lazy.with_columns(
    [
        pl.col("ProductName").cast(
            pl.Enum(
                ["fep", "mseprerelease", "win8defender", "scep", "mse", "windowsintune"]
            )
        ),
        pl.col("Census_OSArchitecture").solid(pl.Enum(["amd64", "x86", "arm64"])),
    ]
)

After reading the info, train an XGBoost binary classification model by feeding the DataFrame into an XGBClassifier:

X = df_lazy.drop("HasDetections")
y = df_lazy.select("HasDetections")

# Use GPU to suit the classification model. We let XGBoost handle the specific features by setting the `enable_categorical` parameter.
clf = xgb.XGBClassifier(device="cuda", enable_categorical=True)
# Validation data just isn't created for simplicity
clf.fit(X, y)

This snippet uses the GPU just for model training. The DataFrame loading and processing are still being performed using the CPU. When calling the fit method, XGBoost will issue a warning that claims it’s really useful to convert the lazy frame right into a concrete DataFrame for optimal performance. To attain this and to customize the query plan execution on a per-frame basis, use the next snippet:

# Convert the lazy frame right into a concrete dataframe using a GPU
df = df_lazy.collect(engine="gpu")

X = df.drop("HasDetections")
y = df.select("HasDetections")

clf.fit(X, y)

Alternatively, to enable global GPU acceleration for Polars along with model training, set the engine affinity to GPU:

import polars as pl
import xgboost as xgb

# set the engine before using polars
pl.Config.set_engine_affinity("gpu")

Robotically re-code categorical data with XGBoost

The most recent XGBoost release significantly enhances its handling of categorical features with the introduction of the re-coder. Polars encodes categorical and enum data types into integers based on the ordering of input values. For instance, given three categories [“aa”, “bb”, “cc”], Polars might store the info as follows:

Values Encoding Categories
“cc” 2 “aa”
“cc” 2 “bb”
“bb” 1 “cc”
“aa” 0
Table 1. Example encoding of categories

The scheme is shared amongst DataFrame implementations, including pandas and cuDF. For an in-depth explanation of the specific type and the enum type in Polars, consult with the Polars documentation.

In prior XGBoost versions, users needed to manually re-code categorical features for XGBoost prior to inference. Except for being error-prone, this will be difficult and inefficient.

For instance, given a feature with three categories within the training dataset, ["aa", "bb", "cc"], Polars would encode them into numerical values [0, 1, 2] with a mapping {”aa”: 0, “bb”: 1, “cc”: 2}. Nevertheless, during inference, the test dataset might contain only a subset of categories: [“bb”, “cc”], which could be encoded as [0, 1] with a mapping {“bb”: 0, “cc”: 1}, leading to an invalid test-time encoding. 

With the most recent XGBoost, the booster object can remember the encoding from the training dataset, and use it within the predict method to re-code the categories mechanically. The next example demonstrates tips on how to use this with an artificial dataset containing categorical features:

import numpy as np
import polars as pl
import xgboost as xgb

# Create a dataframe with a categorical feature (f1)
f0 = [1, 3, 2, 4, 4]
cats = ["aa", "cc", "bb", "ee", "ee"]

df = pl.DataFrame(
    {"f0": f0, "f1": cats},
    schema=[("f0", pl.Int64()), ("f1", pl.Categorical(ordering="lexical"))],
)
rng = np.random.default_rng(2025)
y = rng.normal(size=(df.shape[0]))

# Train a regression model
reg = xgb.XGBRegressor(enable_categorical=True, device="cuda")
reg.fit(df, y)
predt_0 = reg.predict(df)

# Use a subset of rows to create a distinct encoding, "aa" and "ee" are removed
df_new = pl.DataFrame(
    {"f0": f0[1:3], "f1": cats[1:3]},
    schema=[("f0", pl.Int64()), ("f1", pl.Categorical(ordering="lexical"))],
)
predt_1 = reg.predict(df_new)

# Check the resulting predictions are the identical with the unique encoding
np.testing.assert_allclose(predt_0[1:3], predt_1)

On this snippet, a test DataFrame is created with a subset of categories from the training DataFrame. It also verifies that the output predictions remain the identical despite different encoding schemes. The feature prevents the necessity for a separate transformation pipeline.

As well as, the re-coder inside XGBoost is more efficient than re-coding with the DataFrame directly when coping with numerous features. Internally, the re-coding is performed in-place and on-the-fly. XGBoost can handle all categorical columns concurrently using a GPU without copying the DataFrame.

Exporting the categories (experimental)

As previously mentioned, the XGBoost model can now remember the categories. Advanced users can export the saved categories to an inventory of arrow arrays by accessing the underlying booster object from the high-level model. This will be useful for verifying whether the model is trained with the expected categories. Continuing with the instance:

# Get the underlying booster object
booster = reg.get_booster()
# Export the categories from the booster
categories = booster.get_categories(export_to_arrow=True)
# Export the categories into an inventory of arrow arrays
print(categories.to_arrow())

The export_to_arrow option is required to exchange data with PyArrow:

[('f0', None), ('f1', 
[
  "aa",
  "cc",
  "bb",
  "ee"
])]

The entire list of categories in f1 is stored within the booster. The interface for exporting categories is experimental as of XGBoost 3.1. Note that the examples on this post use the scikit-learn interface, because it handles most configurations mechanically. Using the native interface (the booster) will be more involved, especially when working with training continuation. For more details, see the XGBoost documentation.

The GPU acceleration for Polars is currently limited to its execution plan. The resulting DataFrame remains to be stored in CPU memory. Consequently, XGBoost needs to repeat it back to the GPU during inference, and users will see a one-time warning in regards to the performance impact.

Start with Polars and XGBoost

You’ll be able to construct highly efficient and robust GPU-accelerated pipelines by understanding tips on how to materialize lazy Polars DataFrames and effectively utilize the brand new XGBoost categorical feature handling, including its re-coder. This approach not only streamlines your workflow, but additionally unlocks recent performance levels to your machine learning models. 

For more information, see GPU Acceleration with Polars and NVIDIA RAPIDS. You can even provide feedback or ask questions on training XGBoost on the dmlc/xgboost GitHub repo. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x